Correcting inaccurate background mortality in excess hazard models through breakpoints

Background Methods for estimating relative survival are widely used in population-based cancer survival studies. These methods are based on splitting the observed (the overall) mortality into excess mortality (due to cancer) and background mortality (due to other causes, as expected in the general population). The latter is derived from life tables usually stratified by age, sex, and calendar year but not by other covariates (such as the deprivation level or the socioeconomic status) which may lack though they would influence background mortality. The absence of these covariates leads to inaccurate background mortality, thus to biases in estimating the excess mortality. These biases may be avoided by adjusting the background mortality for these covariates whenever available. Methods In this work, we propose a regression model of excess mortality that corrects for potentially inaccurate background mortality by introducing age-dependent multiplicative parameters through breakpoints, which gives some flexibility. The performance of this model was first assessed with a single and two breakpoints in an intensive simulation study, then the method was applied to French population-based data on colorectal cancer. Results The proposed model proved to be interesting in the simulations and the applications to real data; it limited the bias in parameter estimates of the excess mortality in several scenarios and improved the results and the generalizability of Touraine’s proportional hazards model. Conclusion Finally, the proposed model is a good approach to correct reliably inaccurate background mortality by introducing multiplicative parameters that depend on age and on an additional variable through breakpoints.


Background
Many medical research works dedicated to prognosis or to the impact of some covariates on a given disease outcome rely largely on population-based indicators. In cancer epidemiology, using observational data from cancer registry, survival after cancer diagnosis is the most widely used indicator but there are now several aspects of survival. Among these aspects, net survival is especially interesting because it provides the survival that would be observed if only deaths from cancer were considered [1]; it eliminates the part of mortality due to other causes and allows then fair comparisons between populations or periods [2]. Unfortunately, in cancer registries, the causes of death are often unreliable [3,4]. For this purpose, methods for estimating excess mortality that do not rely on the cause of death have been developed.
These methods may be applied in a parametric framework [5][6][7][8] or a non-parametric framework [9][10][11][12]. In the parametric framework, the models consist in splitting λ O , the observed (or overall) mortality, into two components: λ E , the excess mortality due to the disease, and λ P , the background mortality (i.e., the expected mortality due to other causes in the general population). The latter component is usually derived from life tables adjusted for age, sex, and calendar year. More formally, a represents the age at diagnosis, t the time since diagnosis, and z a vector of covariates that includes vector z D . The availability of covariates z D varies between countries and some are not available at the population level; this may affect the estimates of both the excess mortality and the background mortality. For instance, both cancer-specific mortality and all-cause mortality may differ according to the socioeconomic status [13,14] and, in some populations, the deprivation level is associated with reduced life expectancy [15]. Moreover, it has been shown that overlooking certain relevant covariates in estimating the background mortality induces a bias in estimating the effects of these covariates on the excess mortality [16,17]. Thus, searching for more accurate estimates of excess mortality should take into account differences in background mortality due to specific covariates.
When population data are not available, several approaches have been proposed to overcome the problem of insufficiently stratified life tables using individual information. For instance, in a Bayesian framework, Morfeld and McCunney [18] proposed standardized mortality ratios (SMRs) using some prior distributions. Other authors proposed methods to construct life tables stratified by additional variables such as ethnicity [19,20], socioeconomic deprivation [16,21,22], or smoking status [23]. Bower et al. [24] proposed adjusting the background mortality for covariates other than age, sex, and calendar year using information from a control population that would accurately match the reference population. Such approaches allow to improve the estimate of the excess mortality by correcting the background mortality through involving a specific variable, which may be of particular interest in some epidemiological studies. However, although it is best to use external information, on the whole or on a part of the reference population, to construct life tables stratified by additional variables, it is important to note that this is not always possible. Indeed, such external information do not exist and/or are not available at regional or national level. An alternative approach focusing on correction involving a specific variable is based on modelling.
Within the context of long-term clinical trials of cancer treatment, Cheuvart and Ryan [25] proposed a model for rescaling patients' background mortality using a single multiplicative parameter. However, this model relied on aggregate data, which may have involved a loss of information. Furthermore, this scale parameter was common to all patients and allowed the mortality from other causes to differ between the studied group and the general population. Therefore, Touraine et al. [26] proposed a model where the population hazard is modelled using life table mortality rates and multiplicative parameters that depend on the level of an additional variable. However, this model relies on an assumption of proportional hazards; i.e., the background mortality differs from the life table mortality in a multiplicative way. This assumption (which cannot always be checked) may not be true with certain covariates or at certain age intervals. For instance, in the American life tables (that include ethnicity), the background mortality functions of Blacks and Whites deviate from proportionality and intersect between ages 80 and 90.
In line with the modelling approach of correcting a life table by adjusting the population hazard for an additional variable, the aim of the present study was to relax the assumption of proportional hazards between the levels of the additional variable. To this end, the study proposes a model with age-dependent multiplicative parameters using breakpoints. This allows the effect of the additional variable on the background mortality to change according to age.
The present manuscript is organized as follows: the Methods section presents Estève's model --considered as the classical model for estimating the effects of certain covariates on the excess mortality--, Touraine's model, the proposed model, and the simulation study. The Results section presents the results of the assessment of the empirical performance of each model through intensive simulations. This section also illustrates the use of these methods on French population-based colorectal cancer data. The manuscript ends with practical recommendations, a discussion about the study limitations, and ways for further research.

Estève's model (Model 1)
The model proposed by Estève et al. [5] assumes that, at time t after diagnosis of a subject aged a at diagnosis and considering a vector of covariates z that includes a vector of demographic variables z D , the observed hazard of death λ O may be written: The first component, λ E , is the excess hazard. It represents the disease-related mortality function and may be expressed as: In this expression, β T is the transpose matrix of vector β (the latter being the vector of regression parameters which are the logarithms of the hazard ratios), τ k is the logarithm of the baseline excess hazard in the k th interval for a subject with z = 0, I k is an indicator function (I k =1 when t k − 1 < t < t k , 0 otherwise. These time points are predefined if there is an a priori epidemiological knowledge or they are chosen according to the proportion of deaths in each interval). The baseline excess hazard is a piecewise constant over K intervals. The vector of regression parameters β and the baseline parameters τ k (k = 1, …, K) are estimated by the maximum likelihood method.
The second component, λ Ã P , is the background mortality. This population-hazard term is not estimated from the data but derived from a life table adjusted for some common variables such as age, sex, and calendar year. These variables belong to z D , which is a subset of z.

Touraine's model (Model 2)
Touraine et al. [26] proposed a model that allows the background mortality of the studied population to differ from that of the general population using a multiplicative parameter. Unlike Estève's model, Touraine's model takes into account an additional variable by which the life table is not initially stratified. This parameter that multiplies the potentially inaccurate background mortality represents the effect of the additional variable on the background mortality. There may exist as many multiplicative parameters as levels of the additional variable. More formally, considering a categorical variable x with M levels (included in z) and a life table not stratified by x (i.e., vector of variables z D does not include x), Touraine's model may be written: In this expression, λ E and λ Ã P are defined as in Model 1, I is an indicator function (I =1 when x = m, 0 otherwise), and α m are multiplicative parameters that correct a potentially inaccurate background mortality of subjects whose additional variable x = m. As in Model 1, parameters β and τ k of the excess mortality and α m are simultaneously estimated by the maximum likelihood method,α m is estimated so that eα m ¼ α m . Similarly, λ Ã P is not estimated from the data but derived from a life table.

Proposed model (Model 3)
The proposed model is an extension of Touraine's model; it allows the background mortality of the studied population to differ from that of the general population by introducing an age-dependent multiplicative parameter through breakpoints. This means that the effect of the additional variable on the background mortality could be not constant over time and that there may not be constant proportionality between the background mortality functions associated with the levels of the additional variable.

The model with B breakpoints
As in Model 2, let us consider x with M levels and a vector Ɛ (ε 1 < ε 2 < … < ε B ) of B breakpoints. The proposed model may be written: with m = 1, …, M; b = 1, …, B + 1. In this equation, λ E and λ Ã P are defined as in Models 1 and 2, α mb are multiplicative parameters that correct a potentially inaccurate background mortality of subjects whose additional variable x = m over segment b, and I b is an indicator function (I b =1 when ε b − 1 ≤ a + t < ε b , 0 otherwise). As in Models 1 and 2, parameters β and τ k of the excess mortality and α mb (m = 1, …, M; b = 1, …, B + 1) are simultaneously estimated by the maximum likelihood method, α mb is estimated so that eα mb ¼ α mb . Similarly, λ Ã P is not estimated from the data but derived from a life table. The log-likelihood is defined as: where ψ = (β, τ k , α mb ) represents the vector of model parameters, n the number of subjects, δ i the indicator of death for subject i, Λ E the cumulative excess hazard function, and Λ Ã P the cumulative population hazard function. Unlike Model 1, the latter does not cancel when maximizing the log-likelihood. It is easily computed because λ Ã P is a piecewise constant function derived from a life table that provides mortality rates by age and calendar year units.
From this log-likelihood, the first derivatives are: In these derivatives, z il is the component l of the vector of covariates of subject i and t ki is the time spent in the k th interval by subject i.

Breakpoint number and location
Model 3 assumes that both the number and the locations of the breakpoints are fixed. The literature has reported several approaches for determining the number and the locations of the breakpoints. In the case of a single breakpoint, Kunst et al. [27] proposed a graphical method to determine the location of the breakpoint by a simple examination of a scatter plot. Some authors developed an exact or grid-search type algorithm for breakpoint determination [28][29][30]. Others proposed a Bayesian MCMC approach [31,32] but faced a computational bulk even with simple models. Braun et al. [33] proposed an approach using quasi-deviance to measure the quality of the fitted model and adapted the Schwarz criterion for the choice of the number of breakpoints. Molinari et al. [34] and Bessaoud et al. [35] proposed a heuristic approach. More specifically, the range of the variable of interest is divided into 10 segments (i.e. 9 breakpoints since there are the lower and upper bounds of the variable). Thus, in case of B breakpoints, there are ð 9 B Þ ¼ 9! B!ð9 − BÞ! potential vectors of location. For each combination of potential locations, a Bayesian Information Criterion (BIC) is calculated and used to select the best model (the one with lowest criterion). Muggeo [36] proposed a linearization technique where a single or more breakpoints are parameters of the model. Goodman et al. [37] proposed a sequential process. To find the model with the optimal number of breakpoints k (k = 0, …, K) that best fits the data, they performed sequential testings; i.e., they compared successively model pairs (with k vs. k + 1 breakpoints) until failing to reject the null hypothesis (no breakpoint against the alternative of a single breakpoint), which made them retain k and not k + 1 breakpoints.
In this work, we investigated more specifically models with a single (ε) and two breakpoints (ε 1 , ε 2 ). Indeed, we have chosen this limited number of breakpoints since it allows the number of parameters to remain low while ensuring a sufficient flexibility to reflect plausible patterns of changes in background mortality over age. Model 3 with a single (Model 3.1) and two (Model 3.2) breakpoints may then be written, respectively: To determine the location of breakpoint(s), we retained the strategy of Molinari et al. [34] and Bessaoud et al. [35]. Thus, there are 9 potential locations with a single breakpoint and 36 combinations of potential locations with two breakpoints. The one with the lowest Akaike Information Criterion (AIC) is selected.
To estimate the parameters of the proposed model, standard optimization functions from R software were used to maximize the log-likelihood (programs available on request).

Simulations
Intensive simulations were used to assess the performance of the proposed Model 3 (Models 3.1 and 3.2). We also considered selecting between Models 3.1 and 3.2 using AIC, in order to identify the model favoured by the data, and this model is referred to as Model 4. Comparisons were made with the performance of Estève's and Touraine's models.

Design
Each simulation considered N = 1000 samples of n = 2000 subjects. A first life table --considered as 'incomplete'--was used to construct a 'complete' life table; i.e., a life table adjusted for an additional variable. The complete life table was used to generate T P , the time to death from other causes than cancer (which allows deriving the background mortality), T E , the time to death from cancer (which allows deriving the excess mortality), and a censoring time. These times are assumed to be independent conditionally on z. The observed time to death was considered as the lowest value between T P , T E , and the censoring time.
Within each simulation, several scenarios were considered by varying the impact of the additional variable on the background mortality. The models' parameters were estimated from the incomplete life table and the estimates of the covariate effects on the excess mortality were used to compare model performances. The performance criteria were: the bias, the relative bias, the empirical coverage rate (ECR), and the root mean squared error (RMSE). Model selection used the AIC which allows model penalization according to the number of parameters to satisfy parameter parsimony. The model with the lowest AIC was the best model.

Simulated data
Patient covariates and excess mortality The age at cancer diagnosis (a) and the additional variable (x) were the covariates that influenced the excess mortality. a was simulated from a mixture of uniform distributions with 25% of subjects in age class [30-65[, 35% in [65-75[, and 40% in [75-85[. x was simulated as a binary variable with occurrence probability p = 90% or 10%. The baseline excess hazard was the hazard function of a generalized Weibull distribution [38] with parameters (k, λ, θ) = (2, 0.2, 0.5), where k is the shape parameter, λ the scale parameter and θ the location parameter. T E was then simulated using the inverse transformation method with covariate effects β a = 0.3 and β x = − 0.2. The excess mortality was obtained by: λ E (t| a, x) = λ 0 E ðtÞe β a aþβ x x . An administrative censoring was assumed at 6 years, which resulted 40% of censoring rate in generated data. All subjects were considered to be men diagnosed within the same year.
Background mortality The incomplete life table was the one available from function survexp.us in package survival of R. This table provides the background mortality of the American population by age, sex, and calendar year from 1940 to 2014. For simplicity, the incomplete life table was considered as adjusted for age only and the selected background mortality was that of men in year 1990. As a and x may also influence the background mortality, the complete life table was stratified by a and x and T P was simulated from this complete life table. Various mismatches in the life table were considered by varying the impact of x on the background mortality. This led to six scenarios: Scenario A: No mismatch; i.e., x has no effect on the background mortality. Scenario B: Proportional mismatch; i.e., the two levels of x have proportional effects on the background mortality. Scenario C: Non-proportional crossover mismatch; i.e., the two levels of x have non-proportional effects on the background mortality and the background mortality functions intersect. Scenario D: Non-proportional converging mismatch; i.e., the two levels of x have non-proportional effects on the background mortality and the background mortality functions converge. Scenario E: Non-proportional diverging mismatch; i.e., the two levels of x have non-proportional effects on the background mortality and the background mortality functions diverge. Scenario F: Non-proportional three-level mismatch; i.e., the three levels of x have non-proportional effects on the background mortality.
All models were run with each of these scenarios. Scenario A ensures that Models 3.1 and 3.  Table 1 displays the bias, the relative bias, the ECR, and the RMSE for a and x from all models with Scenarios A to E when the proportion of subjects with x = 1 is 90% (for proportion = 10%, see Additional file 2). The results from Scenario F when the proportions of subjects with x = 1 and x = 2 are respectively 10 and 80% are presented in Additional file 3. Table 2 displays the percentage of times each model was retained (%AIC) with Scenarios A to E when the proportion of subjects with x = 1 is 90% (for proportion = 10%, see Additional file 4) and Scenario F. Figure 2 shows the boxplots of the estimates of the effects of covariates a and x on the excess mortality in Scenarios A to E when the proportion of subjects with x = 1 is 90% (for proportion = 10%, see Additional file 5, and for Scenario F, see Additional file 6). The true values of the parameters lay on the horizontal line.

Simulation results
In Scenario A (No mismatch), the incomplete and complete life table were the same (no use of additional variable); they provided a true value of the background mortality. For β a , Models 3.1 and 3.2 provided unbiased estimates and ECRs equal to 91.4 and 90.4% respectively; for β x , they provided biases close to 0 and ECRs equal to 96.9% and 96.6 respectively. Models 1 and 2 yielded In Scenarios B to F below, the incomplete and complete life tables were different (use of an additional variable). The incomplete life table was used for estimation; it provided inaccurate values of the background mortality.
In Scenario B (Proportional mismatch), for β a , Models 3.1 and 3.2 provided unbiased estimates and ECRs close to 90%; for β x , they provided biases close to 0 and ECRs close to 95%. Model 2 provided the same results. Thus, Models 3.1 and 3.2 performed as well as Model 2 though Model 3.1 was selected only a little more than once in two times (50.56%) while Model 3.2 was selected nearly three times out of ten (26.87%), and Model 4 favoured Model 3.1 (89.58%) vs. Model 3.2. In contrast, Model 1 led to biased parameter estimates, poor ECR, and low AIC-based selection.
In Scenario C (Non-proportional crossover mismatch), Models 3.1 and 3.2 provided biases close to 0 and ECRs greater than 90%; they showed also a lower variability than Models 1 and 2. Model 2 provided biased parameter estimates (of β x in particular) and ECRs lower than 90%. It underestimated the effect of x on the excess mortality. Model 1 provided biased parameter estimates, poor ECRs, and low AIC-based selection. It overestimated the effects of the covariates on the excess mortality. As expected, Models 1 and 2 performed poorly while Model 3.1 performed well and was selected nearly three times out of four (74.37%), and Model 4 favoured Model 3.1 vs Model 3.2.
In Scenario D (Non-proportional converging mismatch), for β a , Models 2, 3.1 and 3.2 provided unbiased estimates and ECRs close to 90%, whereas Model 1 In Scenario E (Non-proportional diverging mismatch), Models 2, 3.1 and 3.2 provided biases close to 0 and ECRs greater than 90%, whereas Model 1 provided biased estimates, ECRs close to 0 and underestimated the effects of the covariates on the excess mortality. Model 3.1 is always favoured over Model 3.2 (89.97% vs. 10.03%). This scenario highlighted significantly the fact that Model 1 does not apply to life tables stratified by an additional variable.
In Scenario F (Non-proportional 3-level mismatch), Models 3.1 and 3.2 provided less biased effects of the covariates than Model 1 or Model 2 and ECRs greater than 90%. Model 1 provided small ECRs and overestimated the effects of the covariates on the excess mortality, whereas the other three models underestimated them. In addition, Model 3.1 was selected about half the times (51.56%) and favoured over Model 3.2. Furthermore, whenever the levels of the additional variable had small effects on the background mortality (not shown here), the four models gave comparable results.

Applications to population-based data on colorectal cancer
The interest of the proposed model and performance comparison between Models 1 and 2 were tested on population-based data on colorectal cancer from nine French cancer registries of network FRANCIM. This testing analyzed mortality data on 1398 patients with colorectal cancer diagnosed between January 1 and December 31, 1995. Two applications studied the effects of three prognostic factors on the excess mortality: sex, age at diagnosis, and cancer stage at diagnosis. These applications excluded patients with missing data on any factor and patients aged > 90 years. Age at diagnosis was considered under three categories (≤ 64, 65-74, and 75-90) and cancer stage categories III and IV were merged to balance the number of patients into three categories.

Application 1
Method In Application 1, Model 1 was used with the 'complete' French life table that included covariates age, sex, and calendar year and was considered as the gold standard (Model 1*). The purpose of this model was to provide reference values as in a simulation study. Models 1, 2, and 3 (3.1 and 3.2) were used with the 'incomplete' life table that included covariates age and calendar year only; covariate sex was considered as the additional variable.
Results Table 3 shows the covariates used for the estimation of the excess mortality.
Patients' ages ranged from 21 to 90 years (mean = 69.8) and category sizes were rather balanced. At 10 years post-diagnosis, 664 deaths had occurred (i.e., 50.3% of 1304 included patients). The results of the application of the models are summarized in Table 4. AIC values for Model 3 with a single and two breakpoints were calculated (4230 and 4228, respectively), resulting in the selection of a model with two-breakpoints, located at 83 and 88 years.
Using the incomplete life table, all three models (Models 1, 2, and 3) found that age and cancer stage at diagnosis were significantly associated with excess mortality. With Model 3, the estimated excess hazard effects (EHR) of these two factors were the closest to the reference values given by Model 1*. With 'sex' as an additional variable, Model 1 found wrongly a significant effect (EHR = 0.69 [0.53-0.90]) because this effect was

Application 2
Method In addition to the prognostic factors considered in Application 1, the socioprofessional category (SPC) has also shown an impact on the survival of patients with colorectal cancer [39,40]. The present application considered four SPCs: no occupational activity, clerical and manual workers, farmers, and other occupational activities. Results Patients' ages ranged from 21 to 90 years (mean = 68.2). Age and stage category sizes were rather balanced; however, clerical or manual workers formed the largest category (Table 3). At 10 years postdiagnosis, 394 deaths had occurred (i.e., 50% of 788 included patients). All models used the inaccurate background mortality from the incomplete life table. The results are summarized in Table 5. The AIC values for Model 3 with single and two breakpoints were the same (2539). We have chosen the model with a singlebreakpoint, located at 75 years, which is more parsimonious.
In this application, all three models (Models 1, 2, and 3) showed that the excess mortality increased significantly with age at diagnosis and cancer stage but no significant difference between men and women. The three models gave close estimates of the effect of age but the effect of cancer stages III-IV (versus I) was greater with   In this application, no model found that additional variable SPC was significantly associated with excess mortality. Model 2 found a lower background mortality in patients with 'No occupational activity' and 'Other occupational activities' (respectively 0.72 and 0.70 times that provided by the life table) but a higher background mortality in 'Farmers' and 'Clerical and manual workers' (respectively, 1.10 and 1.05 times that provided by the life table), irrespective of age. In contrast, Model 3 showed that the effect of SPC on the background mortality was not constant over time. Indeed, only patients with "Other occupational activities" had practically the same background mortality before and after 75 years; e.g., 'Farmers' had a lower background mortality than the overall mortality from the life table before 75 years but a higher one after that age. In addition, 'Other occupational activities' --including intermediate and higher occupations--had a lower background mortality than the overall mortality from the life table. Furthermore, Model 1 and Model 3 had the same AIC whereas Model 2 performed the worst in terms of AIC (2545).

Discussion
The present work proposes a regression model of excess mortality (Model 3) able to correct for potentially inaccurate background mortality due to the unavailability of a specific variable on population level. It thus provides an interesting alternative to answer epidemiological questions involving a specific variable affecting both the excess mortality and the background mortality in the absence of life table stratified by this variable and when no external information exists and/or is available to construct such a stratified life table. Specifically, it increases the flexibility of Touraine's model (Model 2) [26] by introducing age-dependent multiplicative parameters through breakpoints. Whenever a currently available life table is not stratified by an additional variable x, Model 3 considers an x-specific age-dependent corrective parameter that multiplies the background mortality. In practice, for the proposed Model 3, with a particular focus on model with a single or two breakpoints, we used a heuristic approach to determine the number and locations of breakpoints [34,35]. We divide age into segments and calculate the AIC for all combinations of 1, 2, ..., B breakpoints. We retain the one with lowest AIC. As explained above, the choice of this limited number of breakpoints is based on both statistical and epidemiological criteria.
As detailed in the Breakpoint number and location sub-section, other approaches may be used concerning the determination of breakpoint number and location. In addition to graphic and numerical approaches [27-33, 36, 37], it is also possible to "fix" the number and the location based on prior information, for example when a life table stratified by the additional variable x exist in another country. More generally, the approaches and strategy are similar to those proposed in the framework of the use of spline functions concerning the choice of the number and location of nodes.
In the simulations, Model 3, whether it's with a single or two breakpoints, showed a good performance; its estimated parameters and ECRs were close to the nominal values. In all scenario, Model 3 with a single-breakpoint has been favoured over two breakpoints. Thus, Model 3 with a single-breakpoint was sufficient. The simulations also showed that Model 3 was as performant as Model 1 in the absence of additional variable and as performant as Model 2 in case of proportional mismatch. Furthermore, Model 3 eliminated or limited the bias in parameter estimates of the excess mortality in several other mismatch scenarios. However, although it has lower bias, it had a higher variability than Model 2, but was better in terms of AIC. This cost of higher variance may be explained by the additional parameters. Indeed, to estimate the effect of the additional variable on the background mortality, Model 3 has M*B additional multiplicative parameters than Model 2 (M: levels of the additional variable; B: number of the breakpoints).
In the two practical applications on registry colorectal cancer data and an 'incomplete' life table (obtained by  removing a variable from a real complete table), Model 3 proved to be useful; it performed better than Models 1 and 2 vs. gold-standard estimates obtained with a 'complete' life table. Note that our results differ slightly from those of Touraine et al. work [26] simply because of minor modifications in the choice of criteria for the inclusion of patients in our study. In the second application (SPC as additional variable), Models 1 and 3 had the same AIC. However, the simulations have shown that these models may give comparable results in situations where the effect of the additional variable on the background mortality is not significant (results not shown). Although the 95% CIs of the multiplicative parameters overlap between the two age categories (before and after the breakpoint), there is no interest in determining whether this difference is really significant, or whether there is a way to test for such differences. Indeed, our goal is to correct inaccurate background mortality in excess hazard models in order to eliminate or limit the bias in estimating the effects of prognostic factors on excess mortality.
In addition, the results obtained with Model 3 are consistent with the literature. Indeed, SPCs 'Farmers' and 'Other occupational activities' have a lower early background mortality (< 65) than the overall mortality from the life table [41]; Farmers would be healthier than the general population [42,43]. The present study results showed that, before age 75 years, the working population had a lower background mortality than the overall mortality from the general population. People with "No occupational activity" showed a higher background mortality; this may relate to the 'Healthy Worker Effect' (healthy individuals keep being employable) [44]. In this work, the socioprofessional category was defined as the longest occupational activity of each subject. Another but highly debatable choice would be the first subject's occupation.
Given the present results, Model 3 would improve the results of Model 2 by making it more generalizable; specifically, when the assumption of proportionality is not valid at certain age intervals (e.g. in the American life tables that include ethnicity, Black and White background mortality functions deviate from proportionality and intersect between ages 80 and 90). Nevertheless, Model 3 presents some limitations. First, it was found here essentially suitable for estimating the parameters related to a single and necessarily categorical additional variable with no more than two breakpoints. It would be interesting to carry out a study on three or several breakpoints. Nevertheless, the use of several breakpoints may lead to over-parameterization of the model. Fortunately, in medical research, a low number of breakpoints is usually sufficient [34]. Second, Model 3 is still a piecewise proportional population hazards where the parameters related to the additional variable (used to correct the background mortality) vary with age though they remain constant within intervals. Another interesting work would be the use of smooth or flexible functions, such as splines or penalized splines. Then, a generalization to situations where mismatch in the life table may be due to numerous variables may be attractive. A model with a random effect (i.e., a frailty term) was proposed [45]. This term corrects for the effects of several potentially unavailable or unobservable covariates and may differ between subjects, which is of epidemiological interest. Possible limitation of this random effect model, pointed out by the authors, is due to the challenges in estimating the parameters in case of insufficient sample size (simulations done with data sets of size n = 5000). Another one may come from the difficulty in interpreting the epidemiological effects captured by the fragility term. However, such model tries to answer to another epidemiological question than the one investigated with our Model 3. In line with multiple mismatches in the life table, an interesting perspective would be to use latent class approach to correct background mortality, which would allow a better description of the epidemiological profiles and their impact on expected mortality.

Conclusion
In absence of life table stratified by a specific variable and when no external information exists and/or is available to construct life table stratified by this additional variable, the proposed model is a good approach to correct reliably inaccurate background mortality by introducing multiplicative parameters that depend on age and on an additional variable through breakpoints.