Skip to main content
  • Technical advance
  • Open access
  • Published:

Correcting inaccurate background mortality in excess hazard models through breakpoints

Abstract

Background

Methods for estimating relative survival are widely used in population-based cancer survival studies. These methods are based on splitting the observed (the overall) mortality into excess mortality (due to cancer) and background mortality (due to other causes, as expected in the general population). The latter is derived from life tables usually stratified by age, sex, and calendar year but not by other covariates (such as the deprivation level or the socioeconomic status) which may lack though they would influence background mortality. The absence of these covariates leads to inaccurate background mortality, thus to biases in estimating the excess mortality. These biases may be avoided by adjusting the background mortality for these covariates whenever available.

Methods

In this work, we propose a regression model of excess mortality that corrects for potentially inaccurate background mortality by introducing age-dependent multiplicative parameters through breakpoints, which gives some flexibility. The performance of this model was first assessed with a single and two breakpoints in an intensive simulation study, then the method was applied to French population-based data on colorectal cancer.

Results

The proposed model proved to be interesting in the simulations and the applications to real data; it limited the bias in parameter estimates of the excess mortality in several scenarios and improved the results and the generalizability of Touraine’s proportional hazards model.

Conclusion

Finally, the proposed model is a good approach to correct reliably inaccurate background mortality by introducing multiplicative parameters that depend on age and on an additional variable through breakpoints.

Peer Review reports

Background

Many medical research works dedicated to prognosis or to the impact of some covariates on a given disease outcome rely largely on population-based indicators. In cancer epidemiology, using observational data from cancer registry, survival after cancer diagnosis is the most widely used indicator but there are now several aspects of survival. Among these aspects, net survival is especially interesting because it provides the survival that would be observed if only deaths from cancer were considered [1]; it eliminates the part of mortality due to other causes and allows then fair comparisons between populations or periods [2]. Unfortunately, in cancer registries, the causes of death are often unreliable [3, 4]. For this purpose, methods for estimating excess mortality that do not rely on the cause of death have been developed.

These methods may be applied in a parametric framework [5,6,7,8] or a non-parametric framework [9,10,11,12]. In the parametric framework, the models consist in splitting λO, the observed (or overall) mortality, into two components: λE, the excess mortality due to the disease, and λP, the background mortality (i.e., the expected mortality due to other causes in the general population). The latter component is usually derived from life tables adjusted for age, sex, and calendar year. More formally,

$$ {\lambda}_O\left(t|z\right)={\lambda}_E\left(t|z\right)+{\lambda}_P\left(a+t|{z}_D\right). $$

a represents the age at diagnosis, t the time since diagnosis, and z a vector of covariates that includes vector zD. The availability of covariates zD varies between countries and some are not available at the population level; this may affect the estimates of both the excess mortality and the background mortality. For instance, both cancer-specific mortality and all-cause mortality may differ according to the socioeconomic status [13, 14] and, in some populations, the deprivation level is associated with reduced life expectancy [15]. Moreover, it has been shown that overlooking certain relevant covariates in estimating the background mortality induces a bias in estimating the effects of these covariates on the excess mortality [16, 17]. Thus, searching for more accurate estimates of excess mortality should take into account differences in background mortality due to specific covariates.

When population data are not available, several approaches have been proposed to overcome the problem of insufficiently stratified life tables using individual information. For instance, in a Bayesian framework, Morfeld and McCunney [18] proposed standardized mortality ratios (SMRs) using some prior distributions. Other authors proposed methods to construct life tables stratified by additional variables such as ethnicity [19, 20], socioeconomic deprivation [16, 21, 22], or smoking status [23]. Bower et al. [24] proposed adjusting the background mortality for covariates other than age, sex, and calendar year using information from a control population that would accurately match the reference population. Such approaches allow to improve the estimate of the excess mortality by correcting the background mortality through involving a specific variable, which may be of particular interest in some epidemiological studies. However, although it is best to use external information, on the whole or on a part of the reference population, to construct life tables stratified by additional variables, it is important to note that this is not always possible. Indeed, such external information do not exist and/or are not available at regional or national level. An alternative approach focusing on correction involving a specific variable is based on modelling. Within the context of long-term clinical trials of cancer treatment, Cheuvart and Ryan [25] proposed a model for rescaling patients’ background mortality using a single multiplicative parameter. However, this model relied on aggregate data, which may have involved a loss of information. Furthermore, this scale parameter was common to all patients and allowed the mortality from other causes to differ between the studied group and the general population. Therefore, Touraine et al. [26] proposed a model where the population hazard is modelled using life table mortality rates and multiplicative parameters that depend on the level of an additional variable. However, this model relies on an assumption of proportional hazards; i.e., the background mortality differs from the life table mortality in a multiplicative way. This assumption (which cannot always be checked) may not be true with certain covariates or at certain age intervals. For instance, in the American life tables (that include ethnicity), the background mortality functions of Blacks and Whites deviate from proportionality and intersect between ages 80 and 90.

In line with the modelling approach of correcting a life table by adjusting the population hazard for an additional variable, the aim of the present study was to relax the assumption of proportional hazards between the levels of the additional variable. To this end, the study proposes a model with age-dependent multiplicative parameters using breakpoints. This allows the effect of the additional variable on the background mortality to change according to age.

The present manuscript is organized as follows: the Methods section presents Estève’s model --considered as the classical model for estimating the effects of certain covariates on the excess mortality--, Touraine’s model, the proposed model, and the simulation study. The Results section presents the results of the assessment of the empirical performance of each model through intensive simulations. This section also illustrates the use of these methods on French population-based colorectal cancer data. The manuscript ends with practical recommendations, a discussion about the study limitations, and ways for further research.

Methods

Estève’s model (Model 1)

The model proposed by Estève et al. [5] assumes that, at time t after diagnosis of a subject aged a at diagnosis and considering a vector of covariates z that includes a vector of demographic variables zD, the observed hazard of death λO may be written:

$$ {\lambda}_O\left(t|z\right)={\lambda}_E\left(t|z\right)+{\lambda}_P^{\ast}\left(a+t|{z}_D\right) $$
(1)

The first component, λE, is the excess hazard. It represents the disease-related mortality function and may be expressed as:

$$ {\lambda}_E\left(t|z\right)={\sum}_{k=1}^K\exp \left({\tau}_k\right){I}_k(t)\exp \left({\beta}^Tz\right) $$

In this expression, βT is the transpose matrix of vector β (the latter being the vector of regression parameters which are the logarithms of the hazard ratios), τk is the logarithm of the baseline excess hazard in the k th interval for a subject with z = 0, Ik is an indicator function (Ik =1 when tk − 1 < t < tk, 0 otherwise. These time points are predefined if there is an a priori epidemiological knowledge or they are chosen according to the proportion of deaths in each interval). The baseline excess hazard is a piecewise constant over K intervals. The vector of regression parameters β and the baseline parameters τk (k = 1, …, K) are estimated by the maximum likelihood method.

The second component, \( {\lambda}_P^{\ast } \), is the background mortality. This population-hazard term is not estimated from the data but derived from a life table adjusted for some common variables such as age, sex, and calendar year. These variables belong to zD, which is a subset of z.

Touraine’s model (Model 2)

Touraine et al. [26] proposed a model that allows the background mortality of the studied population to differ from that of the general population using a multiplicative parameter. Unlike Estève’s model, Touraine’s model takes into account an additional variable by which the life table is not initially stratified. This parameter that multiplies the potentially inaccurate background mortality represents the effect of the additional variable on the background mortality. There may exist as many multiplicative parameters as levels of the additional variable. More formally, considering a categorical variable x with M levels (included in z) and a life table not stratified by x (i.e., vector of variables zD does not include x), Touraine’s model may be written:

$$ {\lambda}_O\left(t|z\right)={\lambda}_E\left(t|z\right)+{\sum}_m{\alpha}_mI\left(x=m\right){\lambda}_P^{\ast}\left(a+t|{z}_D\right)\ \mathrm{with}\ m= 1,\dots, M $$
(Model  2)

In this expression, λE and \( {\lambda}_P^{\ast } \) are defined as in Model 1, I is an indicator function (I =1 when x = m, 0 otherwise), and αm are multiplicative parameters that correct a potentially inaccurate background mortality of subjects whose additional variable x = m. As in Model 1, parameters β and τk of the excess mortality and αm are simultaneously estimated by the maximum likelihood method, \( {\overset{\sim }{\alpha}}_m \) is estimated so that \( {e}^{{\overset{\sim }{\alpha}}_m}={\alpha}_m \). Similarly, \( {\lambda}_P^{\ast } \) is not estimated from the data but derived from a life table.

Proposed model (Model 3)

The proposed model is an extension of Touraine’s model; it allows the background mortality of the studied population to differ from that of the general population by introducing an age-dependent multiplicative parameter through breakpoints. This means that the effect of the additional variable on the background mortality could be not constant over time and that there may not be constant proportionality between the background mortality functions associated with the levels of the additional variable.

The model with B breakpoints

As in Model 2, let us consider x with M levels and a vector Ɛ (ε1 < ε2 < … < εB) of B breakpoints. The proposed model may be written:

$$ {\lambda}_O\left(t|z\right)={\lambda}_E\left(t|z\right)+{\sum}_m{\sum}_b{\alpha}_{mb}{I}_b\left(a+t|x=m\right){\lambda}_P^{\ast}\left(a+t|{z}_D\right) $$
(Model  3)

with m = 1, …, M; b = 1, …, B + 1. In this equation, λE and \( {\lambda}_P^{\ast } \) are defined as in Models 1 and 2, αmb are multiplicative parameters that correct a potentially inaccurate background mortality of subjects whose additional variable x = m over segment b, and Ib is an indicator function (Ib =1 when εb − 1 ≤ a + t < εb, 0 otherwise). As in Models 1 and 2, parameters β and τk of the excess mortality and αmb (m = 1, …, M; b = 1, …, B + 1) are simultaneously estimated by the maximum likelihood method, \( {\overset{\sim }{\alpha}}_{mb} \) is estimated so that \( {e}^{{\overset{\sim }{\alpha}}_{mb}}={\alpha}_{mb} \). Similarly, \( {\lambda}_P^{\ast } \) is not estimated from the data but derived from a life table. The log-likelihood is defined as:

$$ l\left(\psi \right)={\sum}_{i=1}^n\left(-{\Lambda}_E\left({t}_i|{z}_i\right)-{\sum}_m{\sum}_b{\alpha}_{mb}{I}_b\left({a}_i+{t}_i|{x}_i=m\right){\Lambda}_P^{\ast}\left({a}_i+{t}_i|{z}_{D_i}\right)+{\delta}_i\mathit{\log}\left[{\lambda}_E\left({t}_i|{z}_i\right)+{\sum}_m{\sum}_b{\alpha}_{mb}{I}_b\left({a}_i+{t}_i|{x}_i=m\right){\lambda}_P^{\ast}\left({a}_i+{t}_i|{z}_{D_i}\right)\right]\ \right) $$

where ψ = (β, τk, αmb) represents the vector of model parameters, n the number of subjects, δi the indicator of death for subject i, ΛE the cumulative excess hazard function, and \( {\Lambda}_P^{\ast } \) the cumulative population hazard function. Unlike Model 1, the latter does not cancel when maximizing the log-likelihood. It is easily computed because \( {\lambda}_P^{\ast } \) is a piecewise constant function derived from a life table that provides mortality rates by age and calendar year units.

From this log-likelihood, the first derivatives are:

$$ \frac{\partial l\left(\psi \right)}{\partial {\beta}_l}=\sum \limits_{i=1}^n\left(-{z}_{il}{\Lambda}_E\left({t}_i|{z}_i\right)+{\delta}_i\frac{z_{il}{\lambda}_E\left({t}_{\mathrm{i}}|{z}_i\right)}{\lambda_E\left({t}_i|{z}_i\right)+\sum \limits_m\sum \limits_b{\alpha}_{mb}{I}_b\left({a}_i+{t}_i|{x}_i=m\right){\lambda}_P^{\ast}\left({a}_i+{t}_i|{z}_{D_i}\right)}\ \right) $$
$$ \frac{\partial l\left(\psi \right)}{\partial {\tau}_k}=\sum \limits_{i=1}^n\left(-\exp \left({\tau}_k\right){t}_{ki}\exp \left({\beta}^T{z}_i\right)+{\delta}_i\frac{\exp \left({\tau}_k\right){I}_k\left({t}_i\right)\exp \left({\beta}^T{z}_i\right)}{\lambda_E\left({t}_i|{z}_i\right)+\sum \limits_m\sum \limits_b{\alpha}_{mb}{I}_b\left({a}_i+{t}_i|{x}_i=m\right){\lambda}_P^{\ast}\left({a}_i+{t}_i|{z}_{D_i}\right)}\right) $$
$$ \frac{\partial l\left(\psi \right)}{\partial {\overset{\sim }{\alpha}}_{mb}}=\sum \limits_{i=1}^n\left(-{\alpha}_{mb}{I}_b\left({a}_i+{t}_i|{x}_i=m\right){\Lambda}_P^{\ast}\left({a}_i+{t}_i|{z}_{D_i}\right)+{\delta}_i\frac{\alpha_{mb}{I}_b\left({a}_i+{t}_i|{x}_i=m\right){\lambda}_P^{\ast}\left({a}_i+{t}_i|{z}_{D_i}\right)}{\lambda_E\left({t}_i|{z}_i\right)+\sum \limits_m\sum \limits_b{\alpha}_{mb}{I}_b\left({a}_i+{t}_i|{x}_i=m\right){\lambda}_P^{\ast}\left({a}_i+{t}_i|{z}_{D_i}\right)}\right) $$

In these derivatives, zil is the component l of the vector of covariates of subject i and tki is the time spent in the kth interval by subject i.

Breakpoint number and location

Model 3 assumes that both the number and the locations of the breakpoints are fixed. The literature has reported several approaches for determining the number and the locations of the breakpoints. In the case of a single breakpoint, Kunst et al. [27] proposed a graphical method to determine the location of the breakpoint by a simple examination of a scatter plot. Some authors developed an exact or grid-search type algorithm for breakpoint determination [28,29,30]. Others proposed a Bayesian MCMC approach [31, 32] but faced a computational bulk even with simple models. Braun et al. [33] proposed an approach using quasi-deviance to measure the quality of the fitted model and adapted the Schwarz criterion for the choice of the number of breakpoints. Molinari et al. [34] and Bessaoud et al. [35] proposed a heuristic approach. More specifically, the range of the variable of interest is divided into 10 segments (i.e. 9 breakpoints since there are the lower and upper bounds of the variable). Thus, in case of B breakpoints, there are \( \left(\genfrac{}{}{0pt}{}{9}{B}\right)=\frac{9!}{B!\left(9-B\right)!} \) potential vectors of location. For each combination of potential locations, a Bayesian Information Criterion (BIC) is calculated and used to select the best model (the one with lowest criterion). Muggeo [36] proposed a linearization technique where a single or more breakpoints are parameters of the model. Goodman et al. [37] proposed a sequential process. To find the model with the optimal number of breakpoints k (k = 0, …, K) that best fits the data, they performed sequential testings; i.e., they compared successively model pairs (with k vs. k + 1 breakpoints) until failing to reject the null hypothesis (no breakpoint against the alternative of a single breakpoint), which made them retain k and not k + 1 breakpoints.

In this work, we investigated more specifically models with a single (ε) and two breakpoints (ε1, ε2). Indeed, we have chosen this limited number of breakpoints since it allows the number of parameters to remain low while ensuring a sufficient flexibility to reflect plausible patterns of changes in background mortality over age. Model 3 with a single (Model 3.1) and two (Model 3.2) breakpoints may then be written, respectively:

$$ {\lambda}_O\left(t|z\right)={\lambda}_E\left(t|z\right)+{\sum}_m\left[{\alpha}_{m1}\ I\left(a+t\le \varepsilon |x=m\right)+{\alpha}_{m2}\ I\left(a+t>\varepsilon |x=m\right)\right]{\lambda}_P^{\ast}\left(a+t|{z}_D\right) $$
(Model  3.1)
$$ {\lambda}_O\left(t|z\right)={\lambda}_E\left(t|z\right)+{\sum}_m\left[\ {\alpha}_{m1}\ I\left(a+t\le {\varepsilon}_1|x=m\right)+{\alpha}_{m2}\ I\left({\varepsilon}_1<a+t\le {\varepsilon}_2|x=m\right)\right. $$
$$ \left.+{\alpha}_{m3}\ I\left(a+t>{\varepsilon}_2|x=m\right)\right]{\lambda}_P^{\ast}\left(a+t|{z}_D\right) $$
(Model  3.2)

To determine the location of breakpoint(s), we retained the strategy of Molinari et al. [34] and Bessaoud et al. [35]. Thus, there are 9 potential locations with a single breakpoint and 36 combinations of potential locations with two breakpoints. The one with the lowest Akaike Information Criterion (AIC) is selected.

To estimate the parameters of the proposed model, standard optimization functions from R software were used to maximize the log-likelihood (programs available on request).

Simulations

Intensive simulations were used to assess the performance of the proposed Model 3 (Models 3.1 and 3.2). We also considered selecting between Models 3.1 and 3.2 using AIC, in order to identify the model favoured by the data, and this model is referred to as Model 4. Comparisons were made with the performance of Estève’s and Touraine’s models.

Design

Each simulation considered N = 1000 samples of n = 2000 subjects. A first life table --considered as ‘incomplete’-- was used to construct a ‘complete’ life table; i.e., a life table adjusted for an additional variable. The complete life table was used to generate TP, the time to death from other causes than cancer (which allows deriving the background mortality), TE, the time to death from cancer (which allows deriving the excess mortality), and a censoring time. These times are assumed to be independent conditionally on z. The observed time to death was considered as the lowest value between TP, TE, and the censoring time.

Within each simulation, several scenarios were considered by varying the impact of the additional variable on the background mortality. The models’ parameters were estimated from the incomplete life table and the estimates of the covariate effects on the excess mortality were used to compare model performances. The performance criteria were: the bias, the relative bias, the empirical coverage rate (ECR), and the root mean squared error (RMSE). Model selection used the AIC which allows model penalization according to the number of parameters to satisfy parameter parsimony. The model with the lowest AIC was the best model.

Simulated data

Patient covariates and excess mortality

The age at cancer diagnosis (a) and the additional variable (x) were the covariates that influenced the excess mortality. a was simulated from a mixture of uniform distributions with 25% of subjects in age class [30–65[, 35% in [65–75[, and 40% in [75–85[. x was simulated as a binary variable with occurrence probability p = 90% or 10%. The baseline excess hazard was the hazard function of a generalized Weibull distribution [38] with parameters (k, λ, θ) = (2, 0.2, 0.5), where k is the shape parameter, λ the scale parameter and θ the location parameter. TE was then simulated using the inverse transformation method with covariate effects βa = 0.3 and βx = − 0.2. The excess mortality was obtained by: λE(t| a, x) = \( {\lambda}_E^0(t){e}^{\beta_aa+{\beta}_xx} \). An administrative censoring was assumed at 6 years, which resulted 40% of censoring rate in generated data. All subjects were considered to be men diagnosed within the same year.

Background mortality

The incomplete life table was the one available from function survexp.us in package survival of R. This table provides the background mortality of the American population by age, sex, and calendar year from 1940 to 2014. For simplicity, the incomplete life table was considered as adjusted for age only and the selected background mortality was that of men in year 1990. As a and x may also influence the background mortality, the complete life table was stratified by a and x and TP was simulated from this complete life table. Various mismatches in the life table were considered by varying the impact of x on the background mortality. This led to six scenarios:

  • Scenario A: No mismatch; i.e., x has no effect on the background mortality.

  • Scenario B: Proportional mismatch; i.e., the two levels of x have proportional effects on the background mortality.

  • Scenario C: Non-proportional crossover mismatch; i.e., the two levels of x have non-proportional effects on the background mortality and the background mortality functions intersect.

  • Scenario D: Non-proportional converging mismatch; i.e., the two levels of x have non-proportional effects on the background mortality and the background mortality functions converge.

  • Scenario E: Non-proportional diverging mismatch; i.e., the two levels of x have non-proportional effects on the background mortality and the background mortality functions diverge.

  • Scenario F: Non-proportional three-level mismatch; i.e., the three levels of x have non-proportional effects on the background mortality.

All models were run with each of these scenarios. Scenario A ensures that Models 3.1 and 3.2 perform well in the absence of an additional variable. Scenario B ensures that Models 3.1 and 3.2 perform well in the situation taken into account by Model 2. The other scenarios ensure that Models 3.1 and 3.2 improve the performance of Model 2 in some realistic cases. Figure 1 illustrates all these scenarios. It shows, in each scenario, the background mortality according to the incomplete life table (solid line) and the background mortalities associated with various levels of the additional variable of the complete life table (dotted or dashed lines). For a much clearer view, an additional figure shows the mismatches in the life table used for the simulations in patients under 65 years (see Additional file 1).

Fig. 1
figure 1

Mismatches (Scenarios A to F) in the life table used for simulation. Note: The solid lines represent the background mortality functions from the incomplete life table. The dotted or dashed lines represent the background mortality functions from the complete life table adjusted for x

Results

Simulation results

Table 1 displays the bias, the relative bias, the ECR, and the RMSE for a and x from all models with Scenarios A to E when the proportion of subjects with x = 1 is 90% (for proportion = 10%, see Additional file 2). The results from Scenario F when the proportions of subjects with x = 1 and x = 2 are respectively 10 and 80% are presented in Additional file 3.

Table 1 Performance criteria stemming from the simulation study with Scenarios A to E

Table 2 displays the percentage of times each model was retained (%AIC) with Scenarios A to E when the proportion of subjects with x = 1 is 90% (for proportion = 10%, see Additional file 4) and Scenario F.

Table 2 Percentage of times each model was retained on the basis of its AIC

Figure 2 shows the boxplots of the estimates of the effects of covariates a and x on the excess mortality in Scenarios A to E when the proportion of subjects with x = 1 is 90% (for proportion = 10%, see Additional file 5, and for Scenario F, see Additional file 6). The true values of the parameters lay on the horizontal line.

Fig. 2
figure 2

Boxplots of the estimates from the simulation study with Scenarios A to E. Note: Panels from top to bottom correspond to Scenarios A to E, respectively. 1, 2, 3.1, 3.2 and 4 correspond to Models 1, 2, 3.1, 3.2 and 4, respectively. Scenarios A to E: Proportion of subjects with x = 1 is 90%

In Scenario A (No mismatch), the incomplete and complete life table were the same (no use of additional variable); they provided a true value of the background mortality. For βa, Models 3.1 and 3.2 provided unbiased estimates and ECRs equal to 91.4 and 90.4% respectively; for βx, they provided biases close to 0 and ECRs equal to 96.9% and 96.6 respectively. Models 1 and 2 yielded practically the same results. Thus, in the absence of additional variable, Models 3.1 and 3.2 are as performant as Model 1 or 2 although they showed a higher variability vs. Model 1. Model 1 performed better than Models 2, 3.1 and 3.2 in terms of AIC; it was selected nearly seven times out of ten (66.70 and 73.99%). Model 4 was also centered around the true generated values and tended to favour Model 3.1 (90.89%) vs. Model 3.2.

In Scenarios B to F below, the incomplete and complete life tables were different (use of an additional variable). The incomplete life table was used for estimation; it provided inaccurate values of the background mortality.

In Scenario B (Proportional mismatch), for βa, Models 3.1 and 3.2 provided unbiased estimates and ECRs close to 90%; for βx, they provided biases close to 0 and ECRs close to 95%. Model 2 provided the same results. Thus, Models 3.1 and 3.2 performed as well as Model 2 though Model 3.1 was selected only a little more than once in two times (50.56%) while Model 3.2 was selected nearly three times out of ten (26.87%), and Model 4 favoured Model 3.1 (89.58%) vs. Model 3.2. In contrast, Model 1 led to biased parameter estimates, poor ECR, and low AIC-based selection.

In Scenario C (Non-proportional crossover mismatch), Models 3.1 and 3.2 provided biases close to 0 and ECRs greater than 90%; they showed also a lower variability than Models 1 and 2. Model 2 provided biased parameter estimates (of βx in particular) and ECRs lower than 90%. It underestimated the effect of x on the excess mortality. Model 1 provided biased parameter estimates, poor ECRs, and low AIC-based selection. It overestimated the effects of the covariates on the excess mortality. As expected, Models 1 and 2 performed poorly while Model 3.1 performed well and was selected nearly three times out of four (74.37%), and Model 4 favoured Model 3.1 vs Model 3.2.

In Scenario D (Non-proportional converging mismatch), for βa, Models 2, 3.1 and 3.2 provided unbiased estimates and ECRs close to 90%, whereas Model 1 provided biased estimates and a poor ECR (40.1%). Models 3.1 and 3.2 provided less biased effects of x on the excess mortality and ECRs greater than that of Model 2. All models underestimated the effect of x on the excess mortality. Models 3.1 and 3.2 presented low variability vs. Model 1 or Model 2 and were selected nearly six times out of ten (63.36%) and four times out of ten (43.05%) respectively. Model 4 favoured Model 3.1 vs Model 3.2.

In Scenario E (Non-proportional diverging mismatch), Models 2, 3.1 and 3.2 provided biases close to 0 and ECRs greater than 90%, whereas Model 1 provided biased estimates, ECRs close to 0 and underestimated the effects of the covariates on the excess mortality. Model 3.1 is always favoured over Model 3.2 (89.97% vs. 10.03%). This scenario highlighted significantly the fact that Model 1 does not apply to life tables stratified by an additional variable.

In Scenario F (Non-proportional 3-level mismatch), Models 3.1 and 3.2 provided less biased effects of the covariates than Model 1 or Model 2 and ECRs greater than 90%. Model 1 provided small ECRs and overestimated the effects of the covariates on the excess mortality, whereas the other three models underestimated them. In addition, Model 3.1 was selected about half the times (51.56%) and favoured over Model 3.2. Furthermore, whenever the levels of the additional variable had small effects on the background mortality (not shown here), the four models gave comparable results.

Applications to population-based data on colorectal cancer

The interest of the proposed model and performance comparison between Models 1 and 2 were tested on population-based data on colorectal cancer from nine French cancer registries of network FRANCIM. This testing analyzed mortality data on 1398 patients with colorectal cancer diagnosed between January 1 and December 31, 1995.

Two applications studied the effects of three prognostic factors on the excess mortality: sex, age at diagnosis, and cancer stage at diagnosis. These applications excluded patients with missing data on any factor and patients aged > 90 years. Age at diagnosis was considered under three categories (≤ 64, 65–74, and 75–90) and cancer stage categories III and IV were merged to balance the number of patients into three categories.

Application 1

Method

In Application 1, Model 1 was used with the ‘complete’ French life table that included covariates age, sex, and calendar year and was considered as the gold standard (Model 1*). The purpose of this model was to provide reference values as in a simulation study. Models 1, 2, and 3 (3.1 and 3.2) were used with the ‘incomplete’ life table that included covariates age and calendar year only; covariate sex was considered as the additional variable.

Results

Table 3 shows the covariates used for the estimation of the excess mortality.

Table 3 Covariate categories of the study’s colorectal cancer patients

Patients’ ages ranged from 21 to 90 years (mean = 69.8) and category sizes were rather balanced. At 10 years post-diagnosis, 664 deaths had occurred (i.e., 50.3% of 1304 included patients). The results of the application of the models are summarized in Table 4. AIC values for Model 3 with a single and two breakpoints were calculated (4230 and 4228, respectively), resulting in the selection of a model with two-breakpoints, located at 83 and 88 years.

Table 4 EHR estimates with data on colorectal cancer using life tables stratified and not stratified by sex

Using the incomplete life table, all three models (Models 1, 2, and 3) found that age and cancer stage at diagnosis were significantly associated with excess mortality. With Model 3, the estimated excess hazard effects (EHR) of these two factors were the closest to the reference values given by Model 1*. With ‘sex’ as an additional variable, Model 1 found wrongly a significant effect (EHR = 0.69 [0.53–0.90]) because this effect was not found with Model 1* (EHR = 0.93 [0.72–1.20]). The estimates of α from Model 2 and Model 3 were broadly consistent with the values obtained with the complete life table because men had a higher and women a lower background mortality than that of the general population from the incomplete life table. Specifically, Model 2 found that, irrespective of age, the background mortalities in men and women were respectively 1.33 and 0.74 times the overall mortality as per the incomplete life table. In contrast, Model 3 showed that the background mortality was different before and after 88 years. Moreover, the AIC values for Model 2 and Model 3 were 4230 and 4228, respectively; and Model 3 was less biased. Model 1 performed the worst in terms of AIC (4238) and had the most biased effects.

Application 2

Method

In addition to the prognostic factors considered in Application 1, the socioprofessional category (SPC) has also shown an impact on the survival of patients with colorectal cancer [39, 40]. The present application considered four SPCs: no occupational activity, clerical and manual workers, farmers, and other occupational activities.

The use of all models with the available French life table that included covariates age, sex, and calendar year. This life table was not stratified by SPC. Thus, SPC was considered as the additional variable.

Results

Patients’ ages ranged from 21 to 90 years (mean = 68.2). Age and stage category sizes were rather balanced; however, clerical or manual workers formed the largest category (Table 3). At 10 years post-diagnosis, 394 deaths had occurred (i.e., 50% of 788 included patients). All models used the inaccurate background mortality from the incomplete life table. The results are summarized in Table 5. The AIC values for Model 3 with single and two breakpoints were the same (2539). We have chosen the model with a single-breakpoint, located at 75 years, which is more parsimonious.

Table 5 EHR estimates with data on colorectal cancer using a life table not stratified by SPC

In this application, all three models (Models 1, 2, and 3) showed that the excess mortality increased significantly with age at diagnosis and cancer stage but no significant difference between men and women. The three models gave close estimates of the effect of age but the effect of cancer stages III-IV (versus I) was greater with Model 1 (EHR = 10.96 [4.70–25.53]) than with Model 2 (EHR = 8.61 [3.40–21.78]) or Model 3 (EHR = 7.33 [2.48–21.62]).

In this application, no model found that additional variable SPC was significantly associated with excess mortality. Model 2 found a lower background mortality in patients with ‘No occupational activity’ and ‘Other occupational activities’ (respectively 0.72 and 0.70 times that provided by the life table) but a higher background mortality in ‘Farmers’ and ‘Clerical and manual workers’ (respectively, 1.10 and 1.05 times that provided by the life table), irrespective of age. In contrast, Model 3 showed that the effect of SPC on the background mortality was not constant over time. Indeed, only patients with “Other occupational activities” had practically the same background mortality before and after 75 years; e.g., ‘Farmers’ had a lower background mortality than the overall mortality from the life table before 75 years but a higher one after that age. In addition, ‘Other occupational activities’ --including intermediate and higher occupations-- had a lower background mortality than the overall mortality from the life table. Furthermore, Model 1 and Model 3 had the same AIC whereas Model 2 performed the worst in terms of AIC (2545).

Discussion

The present work proposes a regression model of excess mortality (Model 3) able to correct for potentially inaccurate background mortality due to the unavailability of a specific variable on population level. It thus provides an interesting alternative to answer epidemiological questions involving a specific variable affecting both the excess mortality and the background mortality in the absence of life table stratified by this variable and when no external information exists and/or is available to construct such a stratified life table. Specifically, it increases the flexibility of Touraine’s model (Model 2) [26] by introducing age-dependent multiplicative parameters through breakpoints. Whenever a currently available life table is not stratified by an additional variable x, Model 3 considers an x-specific age-dependent corrective parameter that multiplies the background mortality. In practice, for the proposed Model 3, with a particular focus on model with a single or two breakpoints, we used a heuristic approach to determine the number and locations of breakpoints [34, 35]. We divide age into segments and calculate the AIC for all combinations of 1, 2, ..., B breakpoints. We retain the one with lowest AIC. As explained above, the choice of this limited number of breakpoints is based on both statistical and epidemiological criteria.

As detailed in the Breakpoint number and location sub-section, other approaches may be used concerning the determination of breakpoint number and location. In addition to graphic and numerical approaches [27,28,29,30,31,32,33, 36, 37], it is also possible to “fix” the number and the location based on prior information, for example when a life table stratified by the additional variable x exist in another country. More generally, the approaches and strategy are similar to those proposed in the framework of the use of spline functions concerning the choice of the number and location of nodes.

In the simulations, Model 3, whether it’s with a single or two breakpoints, showed a good performance; its estimated parameters and ECRs were close to the nominal values. In all scenario, Model 3 with a single-breakpoint has been favoured over two breakpoints. Thus, Model 3 with a single-breakpoint was sufficient. The simulations also showed that Model 3 was as performant as Model 1 in the absence of additional variable and as performant as Model 2 in case of proportional mismatch. Furthermore, Model 3 eliminated or limited the bias in parameter estimates of the excess mortality in several other mismatch scenarios. However, although it has lower bias, it had a higher variability than Model 2, but was better in terms of AIC. This cost of higher variance may be explained by the additional parameters. Indeed, to estimate the effect of the additional variable on the background mortality, Model 3 has M*B additional multiplicative parameters than Model 2 (M: levels of the additional variable; B: number of the breakpoints).

In the two practical applications on registry colorectal cancer data and an ‘incomplete’ life table (obtained by removing a variable from a real complete table), Model 3 proved to be useful; it performed better than Models 1 and 2 vs. gold-standard estimates obtained with a ‘complete’ life table. Note that our results differ slightly from those of Touraine et al. work [26] simply because of minor modifications in the choice of criteria for the inclusion of patients in our study. In the second application (SPC as additional variable), Models 1 and 3 had the same AIC. However, the simulations have shown that these models may give comparable results in situations where the effect of the additional variable on the background mortality is not significant (results not shown). Although the 95% CIs of the multiplicative parameters overlap between the two age categories (before and after the breakpoint), there is no interest in determining whether this difference is really significant, or whether there is a way to test for such differences. Indeed, our goal is to correct inaccurate background mortality in excess hazard models in order to eliminate or limit the bias in estimating the effects of prognostic factors on excess mortality.

In addition, the results obtained with Model 3 are consistent with the literature. Indeed, SPCs ‘Farmers’ and ‘Other occupational activities’ have a lower early background mortality (< 65) than the overall mortality from the life table [41]; Farmers would be healthier than the general population [42, 43]. The present study results showed that, before age 75 years, the working population had a lower background mortality than the overall mortality from the general population. People with “No occupational activity” showed a higher background mortality; this may relate to the ‘Healthy Worker Effect’ (healthy individuals keep being employable) [44]. In this work, the socioprofessional category was defined as the longest occupational activity of each subject. Another but highly debatable choice would be the first subject’s occupation.

Given the present results, Model 3 would improve the results of Model 2 by making it more generalizable; specifically, when the assumption of proportionality is not valid at certain age intervals (e.g. in the American life tables that include ethnicity, Black and White background mortality functions deviate from proportionality and intersect between ages 80 and 90). Nevertheless, Model 3 presents some limitations. First, it was found here essentially suitable for estimating the parameters related to a single and necessarily categorical additional variable with no more than two breakpoints. It would be interesting to carry out a study on three or several breakpoints. Nevertheless, the use of several breakpoints may lead to over-parameterization of the model. Fortunately, in medical research, a low number of breakpoints is usually sufficient [34]. Second, Model 3 is still a piecewise proportional population hazards where the parameters related to the additional variable (used to correct the background mortality) vary with age though they remain constant within intervals. Another interesting work would be the use of smooth or flexible functions, such as splines or penalized splines. Then, a generalization to situations where mismatch in the life table may be due to numerous variables may be attractive. A model with a random effect (i.e., a frailty term) was proposed [45]. This term corrects for the effects of several potentially unavailable or unobservable covariates and may differ between subjects, which is of epidemiological interest. Possible limitation of this random effect model, pointed out by the authors, is due to the challenges in estimating the parameters in case of insufficient sample size (simulations done with data sets of size n = 5000). Another one may come from the difficulty in interpreting the epidemiological effects captured by the fragility term. However, such model tries to answer to another epidemiological question than the one investigated with our Model 3. In line with multiple mismatches in the life table, an interesting perspective would be to use latent class approach to correct background mortality, which would allow a better description of the epidemiological profiles and their impact on expected mortality.

Conclusion

In absence of life table stratified by a specific variable and when no external information exists and/or is available to construct life table stratified by this additional variable, the proposed model is a good approach to correct reliably inaccurate background mortality by introducing multiplicative parameters that depend on age and on an additional variable through breakpoints.

Availability of data and materials

The R codes and data used in this paper are available on request from the authors.

Abbreviations

AIC:

Akaike information criterion

BIC:

Bayesian information criterion

CI:

Confidence interval

CMW:

Clerical and manual workers

CS:

Cancer stage

ECR:

Empirical coverage rate

EHR:

Excess hazard ratio

NOA:

No occupational activity

OOA:

Other occupational activities

RMSE:

Root mean squared error

SMR:

Standardized mortality ratio

SPC:

Socioprofessional category

References

  1. Estève J, Benhamou E, Raymond L. Statistical methods in cancer research. Volume IV. Descriptive epidemiology. IARC Sci Publ. 1994;128:1–302.

  2. Danieli C, Remontet L, Bossard N, et al. Estimating net survival: the importance of allowing for informative censoring. Stat Med. 2012;31:775–86.

    Article  PubMed  Google Scholar 

  3. Ederer F, Cutler SJ, Goldenberg IS, et al. Causes of death among long-term survivors from breast cancer in Connecticut. J Natl Cancer Inst. 1963;30:933–47.

    Google Scholar 

  4. Hakulinen T, Teppo L. Causes of death among female patients with cancer of the breast and intestines. Ann Clin Res. 1977;9:15–24.

    CAS  PubMed  Google Scholar 

  5. Estève J, Benhamou E, Croasdale M, et al. Relative survival and the estimation of net survival: elements for further discussion. Stat Med. 1990;9:529–38.

    Article  PubMed  Google Scholar 

  6. Giorgi R, Abrahamowicz M, Quantin C, et al. A relative survival regression model using B-spline functions to model non-proportional hazards. Stat Med. 2003;22:2767–84.

  7. Dickman PW, Sloggett A, Hills M, et al. Regression models for relative survival. Stat Med. 2004;23:51–64.

    Article  PubMed  Google Scholar 

  8. Remontet L, Bossard N, Belot A, et al. An overall strategy based on regression models to estimate relative survival and model the effects of prognostic factors in cancer survival studies. Stat Med. 2007;26:2214–28.

    Article  CAS  PubMed  Google Scholar 

  9. Ederer F, Heise H. Instructions to IBM 650 programmers in processing survival computations. Methodological note No.10. End Results Evaluation Section. Technical report. Bethesda: National Cancer Institute; 1959.

    Google Scholar 

  10. Pohar-Perme M, Stare J, Estève J. On estimation in relative survival. Biometrics. 2012;68:113–20.

    Article  Google Scholar 

  11. Rebolj Kodre A, Pohar-Perme M. Informative censoring in relative survival. Stat Med. 2013;32:4791–802.

    Article  PubMed  Google Scholar 

  12. Komukai S, Hattori S. Doubly robust estimator for net survival rate in analyses of cancer registry data. Biometrics. 2016;73:124–33.

    Article  PubMed  Google Scholar 

  13. Kogevinas M, Marmot MG, Fox AJ, et al. Socioeconomic differences in cancer survival. J Epidemiol Community Health. 1991;45:216–9.

    Article  CAS  PubMed  Google Scholar 

  14. Feinstein JS. The relationship between socioeconomic status and health: a review of the literature. Millbank Q. 1993;71:279–322.

    Article  CAS  Google Scholar 

  15. Woods LM, Rachet B, Riga M, et al. Geographical variation in life expectancy at birth in England and Wales is largely explained by deprivation. J Epidemiol Community Health. 2005;59:115–20.

    Article  PubMed  Google Scholar 

  16. Dickman PW, Auvinen A, Voutilainen, et al. Measuring social class differences in cancer patient survival: is it necessary to control for social class differences in general population mortality? A Finnish population-based study. J Epidemiol Community Health. 1998;52:727–34.

    Article  CAS  PubMed  Google Scholar 

  17. Grafféo N, Jooste V, Giorgi R. The impact of additional life-table variables on excess mortality estimates. Stat Med. 2012;31:4219–30.

    Article  PubMed  Google Scholar 

  18. Morfeld P, McCunney RJ. Bayesian bias adjustments of the lung cancer SMR in a cohort of German carbon black production workers. J Occup Med Toxicol. 2010;5:23.

    Article  PubMed  Google Scholar 

  19. Jeffreys M, Stevanovic V, Tobias M, et al. Ethnic inequalities in cancer survival in New Zealand: linkage study. Am J Public Health. 2005;95:834–7.

    Article  PubMed  Google Scholar 

  20. Morris M, Woods LM, Rachet B. A novel ecological methodology for constructing ethnic-majority life tables in the absence of individual ethnicity information. J Epidemiol Community Health. 2015;69:361–7.

    Article  PubMed  Google Scholar 

  21. Eloranta S, Lambert PC, Cavalli-Björkman N, et al. Does socioeconomic status influence the prospect of cure from colon cancer - a population-based study in Sweden 1965–2000. Eur J Cancer. 2010;46:2965–72.

    Article  CAS  PubMed  Google Scholar 

  22. Rachet B, Ellis L, Maringe C, et al. Socioeconomic inequalities in cancer survival in England after the NHS cancer plan. Br J Cancer. 2010;103:446–53.

    Article  CAS  PubMed  Google Scholar 

  23. Ellis L, Coleman MP, Rachet B. The impact of life tables adjusted for smoking on the socio-economic difference in net survival for laryngeal and lung cancer. Br J Cancer. 2014;111:195–202.

    Article  CAS  PubMed  Google Scholar 

  24. Bower H, Andersson TM-L, Crowther MJ, et al. Adjusting expected mortality rates using information from a control population: an example using socioeconomic status. Am J Epidemiol. 2018;187:828–36.

    Article  PubMed  Google Scholar 

  25. Cheuvart B, Ryan L. Adjusting for age-related competing mortality in long-term cancer clinical trials. Stat Med. 1991;10:65–77.

    Article  CAS  PubMed  Google Scholar 

  26. Touraine C, Grafféo N, Giorgi R, et al. More accurate cancer-related excess mortality through correcting background mortality for extra variables. Stat Methods Med Res. 2020;29(1):122–36. https://doi.org/10.1177/0962280218823234.

    Article  CAS  PubMed  Google Scholar 

  27. Kunst AE, Looman CWN, Mackenbach JP. Outdoor air temperature and mortality in the Netherlands: a time series analysis. Am J Epidemiol. 1993;137:331–41.

    Article  CAS  PubMed  Google Scholar 

  28. Ulm K. A statistical method for assessing a threshold in epidemiological studies. Stat Med. 1991;10:341–9.

    Article  CAS  PubMed  Google Scholar 

  29. Rigby RA, Stasinopoulos DM. Detecting break points in the hazard function in survival analysis. In: van der Heijden PGM, Jansen W, Francis B, et al., editors. Statistical Modelling. Amsterdam: Elsevier Science Publishers B.V; 1992. p. 303–11.

    Google Scholar 

  30. Küchenhoff H. An exact algorithm for estimating breakpoints in segmented generalized linear models. Comput Stat. 1997;12:235–47.

    Google Scholar 

  31. Barry D, Hartigan JA. A Bayesian analysis for change point problems. J Am Stat Assoc. 1993;88:309–19.

    Google Scholar 

  32. Gössl C, Küchenhoff H. Bayesian analysis of logistic regression with an unknown change point and covariate measurement error. Stat Med. 2001;20:3109–21.

    Article  PubMed  Google Scholar 

  33. Braun JV, Braun RK, Müller HG. Multiple changepoint fitting via quasi likelihood, with application to DNA sequence segmentation. Biometrika. 2000;87:301–14.

    Article  Google Scholar 

  34. Molinari N, Daures J-P, Durand J-F. Regression splines for threshold selection in survival data analysis. Stat Med. 2001;20:237–47.

    Article  CAS  PubMed  Google Scholar 

  35. Bessaoud F, Daures J-P, Molinari N. Free knot splines for logistic models and threshold selection. Comput Methods Prog Biomed. 2005;77:1–9.

    Article  CAS  Google Scholar 

  36. Muggeo MR. Estimating regression models with unknown breakpoints. Stat Med. 2003;22:3055–71.

    Article  PubMed  Google Scholar 

  37. Goodman MS, Li Y, Tiwari RC. Detecting multiple change points in piecewise constant hazard functions. J Appl Stat. 2011;38:2523–32.

    Article  PubMed  Google Scholar 

  38. Mudholkar GS, Srivastava DK, Kollia GD. A generalization of the Weibull distribution with application to the analysis of survival data. J Am Stat Assoc. 1996;91:1575–83.

    Article  Google Scholar 

  39. Dejardin O, Remontet L, Bouvier AM, et al. Socioeconomic and geographic determinants of survival of patients with digestive cancer in France. Br J Cancer. 2006;95:944–9.

    Article  CAS  PubMed  Google Scholar 

  40. Lejeune C, Sassi F, Ellis L, et al. Socio-economic disparities in access to treatment and their impact on colorectal cancer survival. Int J Epidemiol. 2010;39:710–7.

    Article  PubMed  Google Scholar 

  41. Programme Cosmop: surveillance de la mortalité par cause selon l’activité professionnelle. Analyse de la mortalité et des causes de décès par secteur d’activité de 1976 à 2005. Saint-Maurice: Santé publique France. 2018. p 49.

  42. Levêque-Morlais N, Tual S, Clin B, et al. The AGRIculture and CANcer (AGRICAN) cohort study: enrolment and causes of death for the 2005–2009 period. Int Arch Occup Environ Health. 2015;88:61–73.

    Article  PubMed  Google Scholar 

  43. Lemarchand C, Tual S, Levêque-Morlais N, et al. Cancer incidence in the AGRICAN cohort study (2005-2011). Cancer Epidemiol. 2017;49:175–85.

    Article  PubMed  Google Scholar 

  44. Baillargeon J. Characteristics of the healthy worker effect. Occup Med. 2001;16:359–66.

    CAS  PubMed  Google Scholar 

  45. Rubio FJ, Rachet B, Giorgi R, et al. On models for the estimation of the excess mortality hazard in case of insufficiently stratified life tables. Biostatistics. 2019. Epub ahead of print. https://doi.org/10.1093/biostatistics/kxz017.

Download references

Acknowledgments

This work was granted access to the HPC resources of Aix-Marseille Université financed by the project Equip@Meso (ANR-10-EQPX-29-01) of the program “Investissements d’Avenir” supervised by the Agence Nationale de la Recherche. The authors thank Jean Iwaz for the revision of the final draft of this manuscript.

Funding

The project that led to this publication has received funding from A*Midex (Aix-Marseille Initiative of excellence, a French “Investissements d’Avenir” program).

Author information

Authors and Affiliations

Authors

Consortia

Contributions

RG and RDM designed the study and drafted the article. RDM performed the programming for the simulation study. JAG and NG contributed to methods and materials, analysis, interpretation of the findings and revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Robert Darlin Mba.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Mismatches in the life table used for simulations in patients under 65 years old.

Additional file 2.

Performance criteria stemming from the simulation study with Scenarios A to E.

Additional file 3.

Performance criteria stemming from the simulation study with Scenario F.

Additional file 4.

Percentage of times each model was retained on the basis of its AIC.

Additional file 5.

Boxplots of the estimates from the simulation study with Scenarios A to E.

Additional file 6.

Boxplots of the estimates from the simulation study with Scenario F.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mba, R.D., Goungounga, J.A., Grafféo, N. et al. Correcting inaccurate background mortality in excess hazard models through breakpoints. BMC Med Res Methodol 20, 268 (2020). https://doi.org/10.1186/s12874-020-01139-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12874-020-01139-z

Keywords