 Research article
 Open Access
 Open Peer Review
 Published:
Comparison of modelbuilding strategies for excess hazard regression models in the context of cancer epidemiology
BMC Medical Research Methodology volume 19, Article number: 210 (2019)
Abstract
Background
Large and complex populationbased cancer data are becoming broadly available, thanks to purposeful linkage between cancer registry data and health electronic records. Aiming at understanding the explanatory power of factors on cancer survival, the modelling and selection of variables need to be understood and exploited properly for improving modelbased estimates of cancer survival.
Method
We assess the performances of wellknown model selection strategies developed by Royston and Sauerbrei and Wynant and Abrahamowicz that we adapt to the relative survival data setting and to test for interaction terms.
Results
We apply these to all male patients diagnosed with lung cancer in England in 2012 (N = 15,688), and followedup until 31/12/2015. We model the effects of age at diagnosis, tumour stage, deprivation, comorbidity and emergency presentation, as well as interactions between age and all of the above. Given the size of the dataset, all model selection strategies favoured virtually the same model, except for a nonlinear effect of age at diagnosis selected by the backwardbased selection strategies (versus a linear effect selected otherwise).
Conclusion
The results from extensive simulations evaluating varying model complexity and sample sizes provide guidelines on a model selection strategy in the context of excess hazard modelling.
Background
Populationbased cancer datasets have become richer in recent years. Improved completeness of key variables and additional information from linkages with other datasets (secondary care management data, specialised registry data, treatment data) have both contributed to enhance the quality and utility of data. Furthermore, longstanding datasets make possible the analysis of longterm trends and survival probabilities can be estimated further away from the date of diagnosis.
Analysis of populationbased cancer survival has greatly benefitted from this data enrichment. However, when modelling the effect of covariates on survival, special care should be taken when assuming or relaxing assumptions of a linear effect or an effect constant in time (the proportional hazards PH assumption). Thus, a modelling strategy is required. Aside from the timetoevent setting, many strategies are developed for variable selection and tests for nonlinearity of continuous variables, traditionally based on backward, forward or stepwise algorithms. In the timetoevent field in general, and in populationbased cancer survival analyses in particular, less attention has been devoted on the selection of the functional form of predictor variables [1, 2]. Indeed, the effects of variables are commonly assumed linear and constant in time, assumptions likely violated for many predictors of cancer survival, especially with longterm followup.
Machine learning algorithms have focussed on variables selection in scenarios where tens or thousands of variables are available [3]. These methods mainly focus on factor analysis and random survival forests [4]. In the context of populationbased data, the number of variables remains low or moderate, but the functional forms of their effects (nonlinear and/or timedependent), as well as their possible interactions need to be carefully examined. Model building fits within three different purposes: descriptive, explanatory and predictive [5]. Our aim here is to describe, measure and quantify accurately the effects of relevant (active) variables while excluding spurious effects.
Some authors [6,7,8] have shown the importance of taking account as well as testing both nonlinearity and timedependency of effects simultaneously, when modelling timetoevent data, in order to get accurate modelbased estimates of survival.
We identify two modelbuilding strategies, developed relatively recently, that offer a systematic and comprehensive approach to the selection of predictors’ effects for survival data. One is devised by Sauerbrei and colleagues using fractional polynomials (MFPT) [9] and further adapted for restricted cubic splines (MVRS) [10] and for the inclusion of interactions (MFPI and MFPIgen) [11, 12]. The second one is proposed by Wynant and Abrahamowicz [13], and will be referred to as W&A. These strategies are formulated and tested in the general timetoevent context, in which overall mortality patterns are modelled. Aiming to identify predictors of cancer survival, we focus here on modelling the excess hazard, which is the main quantity of interest in populationbased cancer studies [14,15,16].
Our first aim is to compare and illustrate the use of these modelbuilding strategies (namely MVRS, W&A), in the context of excess hazard regression models. We also propose an extension of those two strategies (called adapted MVRS, aMVRS and adapted W&A, aW&A) for handling interactions between prognostic factors, and compare them to MFPIgen, intended for use with observational data. The performance of these strategies is evaluated in a simulation study mimicking the cancer survival experience of 2000 lung cancer patients diagnosed in 2012 and followed up to the 31/12/2015. We model the effects of explanatory factors on lung cancer survival for the whole cohort of patients diagnosed with lung cancer in 2012. We provide some guidelines over variable and effect selection, based on the simulations.
Methods

a.
The study context: modelling excess mortality
Our focus is on the excess mortality hazard and the corresponding net survival. The excess mortality hazard is the hazard experienced by cancer patients over and above their background (i.e. expected) mortality hazard due to causes other than the cancer under study. Net survival is derived from the excess hazard and represents the survival experienced by cancer patients under the assumption that they could only die from cancer [17]. Net survival therefore does not depend on the other causes of death, and it is of interest for comparison purposes between countries or periods within a country [18]. In the absence of reliable information on the cause of death, the expected mortality is estimated by the mortality observed in the general population from which patients come from (aka relative survival setting). These life tables are typically defined by age, sex, calendar period, but can also include additional variables such as socioeconomic status and ethnicity. Net survival can be estimated nonparametrically [17] or through semiparametric [19] or fully parametric [20,21,22,23,24] excess hazard regression models. Parametric and nonparametric approaches have their own advantages and disadvantages. For the latter, when net survival needs to be estimated in subgroups, it reduces precision and may lead to unstable estimates. Although there is no assumption relative to the functional forms of effects of variables, these effects cannot be estimated directly. Furthermore, the consistent estimator of net survival proposed by PoharPerme and colleagues [17] is unconstrained and thus may show a nondecreasing behaviour in the tails, violating the basic assumptions of survival models. For parametric approaches, the challenges include (a) proper modelling of the baseline excess hazard function, (b) inclusion of potential timedependent effect of categorical factors, (c) potential nonlinear and timedependent effects of the continuous variables as well as (d) interactions between prognosis factors.
Here, we will use flexible regression models with restricted cubic splines functions for modelling nonlinear and timedependent effects on the log excess hazard scale [23, 25]. The effects of the variables that define the life tables need to be included in the modelling of the excess hazard to produce consistent net survival estimates [17, 20]. Thus, at individual level, the excess mortality hazard λ_{E}(t, x) is linked to the overall λ(t, x) and expected (population) mortality hazards λ_{P}(a + t, y + t, z) as follows:
where z is a subset of the set of variables x, corresponding to the variables defining the life tables, in addition to age a + t and year y + t (a and y being the age at and year of diagnosis, respectively). The population mortality hazard is considered to be known, and we are interested in estimating λ_{E}(t, x) at time t after diagnosis.
In a general form, the excess hazard regression models considered in our work could be written as follows with two prognostic variables x_{1}, continuous, and x_{2}, categorical (with J categories, j = 1, …, J):
where λ_{0}(t) is the baseline excess hazard (defined by using a spline function on the log scale), f(x_{1}) = α_{1}x_{1} if x_{1} is modelled with a linear (L) effect, and f a spline function of x_{1} if x_{1} is modelled with a nonlinear (NL) effect; β_{1}(t) and β_{2, j}(t) are spline functions of t if x_{1} and x_{2} are modelled with timedependent (TD) effects (the more complicated case), or β_{1}(t) = β_{1} and β_{2, j}(t) = β_{2, j}, j = 1, …, J if not (i.e. PH, the simplest case). For the categorical variable x_{2}, we considered a “joint” parameterisation of its effect: either all J − 1 dummy variables are timedependent, or none. To simplify notation later, we define \( {\boldsymbol{\upbeta}}_{\mathbf{2}}(t)\ast g\left({x}_2\right)=\sum \limits_{j=2}^J{\beta}_{2,j}(t)\ast {\mathbbm{I}}_{\left\{{x}_2=j\right\}} \); lastly \( {\mathbbm{I}}_{\left\{{x}_2=j\right\}} \) defines an indicator variable (equal to 1 when x_{2} = j, 0 otherwise).
 b.
Model selection strategies
The MVRS strategy
MVRS is based on an iterative forward selection of variables and increasingly complex functional forms of effects [10]. The modelbuilding proceeds in three steps: (a) the first step focusses on the presence of a variable’s effect, and its possible nonlinearity in the case of continuous predictors, while assuming proportionality of hazards for all variables. The iterative process loops through all variables from most to least significant, until no effect is removed or added. (b) In the second step, nonproportionality of hazards is explored by restricting the followup time to the time until the median time of observed events on which step (a) is performed and additional effects may be retained. (c) The third step consists of testing the nonproportionality of all effects selected in (a) and (b) in a forward stepwise fashion. The likelihood ratio test is used for evaluating significant effects, with a prefixed significance level (usually 5%).
The W&A strategy
W&A advocate for the use of an iterative backward elimination of nonsignificant nonlinear and timedependent effects [13]. From the most complex model, including all possible nonlinear and timedependent effects, each nonlinear and timedependent effect is tested in turn using likelihood ratio test, and the effect corresponding to the highest pvalue (above 5%) is removed. From this new model, we test again each remaining nonlinear and timedependent effect in turn, and repeat those steps until all effects kept are significant. The final model is found when all tests yield pvalues under 5%.
There are several structural differences in the approaches described above. Firstly, W&A advocates for simultaneous tests of nonlinear and timedependent effects, and the effects are removed one by one, starting from the smallest. By contrast, the MVRS strategy establishes a hierarchy and investigates possible nonlinear effects prior to testing timedependency of the selected effects. The simultaneous tests of effects in W&A may influence subsequent selections of nonlinear and/or timedependent effects. In MVRS, the selection of nonlinear effects occurs in the first step, which may well influence the later selection of timedependent effects, but the selection of timedependent effects will not affect retention of nonlinear effects. Secondly, the initial models considered are different and lead to backward (in the case of W&A) or forward (MVRS) selection of variables.
Strategies in the relative survival setting
In both strategies, the main life table variables (age, sex, year and deprivation) are forced into the models, as recommended for excess hazard regression modelling [14, 17, 20]. For the nonlife table variables linearity and timedependency and overall effects are tested so the variables could be completely removed from the set of predictors.
Extensions of the strategies for testing for interactions
The authors of MVRS also consider interactions between variables retained, once the main effects have been selected [11]. MFPI and MFPIgen are defined to consider categoricalbycontinuous interactions and continuousbycontinuous interactions respectively, even though (from our understanding) they do not test for nonproportionality of the interaction terms [9].
We propose to adapt the original W&A and MVRS strategies to include tests for the form and presence of interactions in the same fashion that they already test for the functional form and inclusion of each variable.
There are three types of possible interactions: between two continuous variables, between a continuous and a categorical variable, and between two categorical variables. We focus on continuous–bycategorical interaction, and the strategies will need to test whether or not the interaction is needed and if it is timedependent.
The general form of the excess hazard model is as follows, with x_{1} continuous and x_{2} categorical (with J categories j = 1, …, J):
with all functions as defined above.
The adapted version of MVRS, aMVRS, tests for each interaction in the three steps presented earlier: (a) joint test of the interaction factors, i.e. test for β_{3} = 0; (b) In the restricted followup time (until the median time of observed events) significance test for β_{3} = 0; (c) If β_{3} ≠ 0 in either (a) or (b), test timedependence of the interaction, i.e. β_{3}(t) = β_{3}.
The adapted version of the W&A algorithm, aW&A, tests for each interaction in the same way it tests for the effects of main variables: it first tests for timedependent effect of the interaction, i.e. β_{3}(t) = β_{3}, and then, if a timefixed effect is favoured, it tests for the main effect of the interaction β_{3} = 0.
The MFPIgen algorithm only considers interactions in a final step, after selecting the main effects of variables in the usual steps (a)(c). It tests for β_{3} = 0. In all algorithms the forms of the interactions, f and g are defined by the form of the main variables x_{1} and x_{2} as they are modelled when the interaction is considered.
In the case of interactions with categorical variables, the presence of the interaction could be tested in two different ways: overall (called joint test [26]), or each level of the interaction separately. Here we only test the interactions as one effect, such that all factors relating to one interaction would be removed/included when testing for their inclusion. In the algorithms, the user specifies which interaction terms are worth investigating. Specific significance levels for the tests related to interactions may be chosen as in MVRS. Additional file 1 details how the algorithms are adapted to testing for interactions.
 c.
Simulation of biologically plausible lung cancer survival data
Data generation and simulations design
We use the observed survival time and vital status of the full cohort of lung cancer patients (N = 17,597), evaluated on the 31st December 2015, to obtain the regression coefficients of an excess hazard regression model. The large sample size enables detection and precise estimation of small effects. These coefficients are used for simulating cancer survival times, as detailed in formulas (A)(D) below. From this excess hazard regression model, the cancer survival time Tc is generated using the inverse transform method [27, 28].
For the data design, we randomly extract 2000 men diagnosed with lung cancer in England in 2012 from the English populationbased cancer registry, among those with valid information on stage at diagnosis. We kept the information on their age at diagnosis (continuous variable), their level of deprivation (categorical variable with 5 levels of increasing deprivation measured by the income domain of the Index of Multiple Deprivation [29]), and their stage of cancer at diagnosis (categorical variable with 4 levels of increasing severity based on the Tumour, Nodes, Metastasis classification [30]). The relatively small sample size for populationbased data will enable us to test the practical performances of the algorithm in a setting with low censoring rate (15%) but small number of patients (relative to standard population studies). We repeated this for a larger sample of 5000 cancer patients to study the sensitivity of the model selection strategies on the number of events. By default, all results are presented for the samples of 2000 patients, except when clearly mentioned.
We devise four simulation scenarios, representing increasingly complex excess hazard regression models (see Box 1):
 (A)
Model with linear and proportional effect of age, and proportional effects of stage and deprivation, without interaction

(B)
Model with linear and proportional effect of age, and proportional effect of stage, deprivation and an interaction between age and stage

(C)
Model with nonlinear and timedependent effects of age, timedependent effects of stage, and proportional effects of deprivation, without interaction

(D)
Model with nonlinear and timedependent effects of age, timedependent effects of stage, proportional effects of deprivation and a proportional interaction between age and stage
In the formulas above, associated to scenarios AD, f denotes a restricted cubic splines function with 2 degrees of freedom, i.e. 1 internal knot placed at the median age of the patients’ cohort, λ_{0}(ln(t)) is a restricted cubic spline function of time, with up to 3 degrees of freedom, i.e. 2 internal knots placed at the tertiles of the distribution of times to death.
Time to death from other causes Tp is generated assuming a piecewise exponential hazard obtained from general population life tables detailed by month of age, sex, calendar month and deprivation level [20]. The censoring time C is evaluated on 31/12/2015. The final observed followup time for each individual is defined as T = min (Tc, Tp, C), with the corresponding vital status indicator δ (i.e., δ = 0 for censored observations and δ = 1 for death).
For each scenario (AD), we simulate 250 datasets, and we utilise the survsim command in Stata [28] for simulating cancer survival times in the scenarios described above. Close to 90% of patients die in the first four years after diagnosis, classifying lung cancer in the poorprognosis cancers with low censoring rate.
Analysis of simulated data
The classical algorithms, W&A and MVRS, are run on scenarios A and C, while the algorithms extended to testing for interactions, aW&A and aMVRS, are run on scenarios B and D. MFPIgen is also tested on scenarios B and D. All excess hazard regression models are fitted using the strcs command in Stata [25], as described in section 2.a.
 d.
Indicators used for comparing the modelbuilding strategies
One additional binary variable not contained in the life tables and absent from the original simulation models is added when testing the modelbuilding strategies. For each scenario, we compare the models selected by each strategy to the original effects used in the simulation with the following indicators.
Firstly, we summarise the proportions of models that select each variable with their nonlinear or timedependent effects for each algorithm. We also study the confounding and selfconfounding effects: the impact of misspecifying one of the components (TD, NL, interactions) of the functional form of a covariable on its other components or on the selection of such components for other variables. We also calculate the proportion of selected models that contain or are exactly the simulated models for each strategy.
Furthermore we provide sensitivity (true positive) and specificity (true negative) values, as defined below, looking at the number of correctly selected effects and the number of correctly unselected effects over the number of active and inactive effects [31]. Both sensitivity and specificity tend to reach 1 for a good estimator:
Then, for each model building strategy we plot the average of the 250 stagespecific cohort net survival curves and compare them to the true net survival curve. We quantify this comparison by calculating the proportion of the Area Between Curves through time, pABCtime [32]. pABCtime represents the area between each individual net survival curve (or the average of the 250 net survival curves) and the true generating net survival curve (the reference function). It is expressed as a proportion of the area under the true net survival curve (area under the reference function). A pABCtime of 0 % means that the cohort net survival estimates under investigation are in perfect agreement with the true initial observed effect.
For any function f, let us assume that the true generating function \( \overset{\ast }{f} \)and the estimated function \( \hat{f} \) cross at time t^{∗}, ABCtime is defined as
and pABCtime as
pABCtime is also calculated for the excess hazard curves estimated for given patients’ factors and for the effects of age, deprivation, and stage comparing the possibly timedependent estimated HR curves to the originally simulated HR. In such instances, \( \hat{f} \) represent the excess hazard, \( \hat{f}(u)={\lambda}_E\left(u, age, stage, dep\right) \) or excess hazard ratio, \( \hat{f}(u)=\exp \left(\hat{\beta}(u)\right) \).
We also provide bias of effects, at specific time points t_{k}, which are the average bias over all samples (M = 250) between the estimated (possibly timedependent) effects of age, stage and deprivation and their simulated effects. We specify monthly t_{k}, from diagnosis through to the end of followup (4 years):
 e.
Application
We apply the five modelselection strategies (MVRS, W&A, MFPIgen, aMVRS, and aW&A) to our full cohort of 15,688 men diagnosed with nonsmall cell lung cancer in 2012 in England and followedup until 31/12/2015. All patients had a minimum potential followup of 3 years. Patient’s information on age, deprivation, survival time and vital status is enhanced by information on stage at diagnosis [33] coded using the TNM system (IIV), emergency route to diagnosis (binary variable) [34], comorbidity status defined after ascertainment of hospital episodes in the 6 months to 6 years prior to diagnosis (binary variable) [35]. The model building strategies test the main effects as well as interactions between age at diagnosis and all other covariates.
All model building strategies yield very similar models (Table 1): no main effect is removed, timedependent effects of stage, comorbidity and emergency presentation are kept, and when tested, interactions between age and comorbidity is removed by the MVRS algorithm and age and comorbidity and age and emergency presentation by the aW&A and MFPIgen algorithms. Nonlinear timedependent effects of age are retained by the W&A and aW&A algorithms in comparison to linear time dependent effects of age retained in all other model selection algorithms.
Figure 1 illustrates the impact the different selected interactions and linearity/nonlinearity of age have on the estimated net survival probabilities for two patients, aged 60 and 80 respectively with the values of other variables set (i.e. stage III, nonemergency presentation, no comorbidity, least deprived). The curves for W&A and MVRS overlap. The selection of interactions in the model impacts the estimated individual excess hazard and cancer survival: there are smaller differences in excess hazard between patients aged 60 and 80 when no interactions are modelled, compared to when interactions are considered. We superimposed the nonparametric estimator of net survival (red curves) estimated for the 165 patients aged ]50–70[ years (mean age 64) and the 130 patients aged ]75–85[ years at diagnosis (mean age 79), with nonemergency presentation, stage III disease and from the least deprived group of the population. The nonparametric net survival estimates are generally lower than all modelbased estimates from 1 year (age 80) and 2.5 years (age 60) after diagnosis. At the start of followup, the nonparametric estimates tend to resemble the modelbased estimates without interaction terms.
These differences at individual level do not however impact the overall cohort estimate of net survival as shown by the hardly distinguishable curves in Fig. 2, similar to the nonparametric estimator of net survival.
Results
The four simulated scenarios represent increasingly complex but realistic excess hazard models, derived from observed records of lung cancer patients. To assess how realistic these scenarios are, we compare the modelbased cohort estimates of net survival (using the model used for each simulated scenario) to the nonparametric PoharPerme estimates (Additional file 2) on the original, observed data. All scenarios show reasonable stagespecific cohort net survival estimates. Scenarios A and B underestimate net survival until 12–24 months for patients diagnosed at stages IIII because of the simple effects modelled. Scenarios C and D include nonproportional effects of the main factors and estimate stagespecific cohort net survival very neatly. The characteristics of the patients used in the simulations are presented in Additional file 3. Patients in stage IV comprise half of the sample. There is a decreasing average age with increasing stage at diagnosis. The distribution of patients by deprivation group is skewed towards more deprived groups, and a third of the patients have the trait of the extra binary variable.

(a)
Performances of the modelbuilding strategies in selecting variables and their effects
Original algorithms – scenarios A and C (no interaction)
In scenario (A), both algorithms led to almost identical selection of effects (Fig. 3, Table 2). The only difference is the higher proportion of timedependent effects of the extra variable, 5.6% vs. 0.8%, selected with W&A compared to MVRS. In scenario (C), albeit small there are more differences in the effects selected between MVRS and W&A. W&A tends to (rightly) select more models that include timedependent effects of stage (100% vs. 96.8%) and age (40.4% vs. 36.4%). Nonlinear effects of age are more often selected by MVRS (45.2%) than by W&A (34.4%). Overall, the effect of stage is always rightly kept in the final selected models, by all algorithms, and the extra binary variable appears (wrongly) in only 7.2–8.8% of models (Fig. 3, Additional file 4).
All selected models contain the true simulated model for scenario A but the proportions drop to 69.6% (MVRS) and 70.4% (W&A) of models that are the exact simulated model. Similarly in the slightly more complex scenario (C), 10.8% of models contain, and 8.8% of models are, the true model using MVRS model selection, vs. 6.0 and 5.2% of W&A models, respectively (Table 2). This drop in proportions between scenarios A and C reflects the high proportion of models with a time dependent effect of linear age, in other words the low proportion of models with a timedependent effect of nonlinear age. This is explained by the small sample size and the relatively small magnitude of the nonlinear and timedependent effects of age (Additional file 5). Higher number of lung cancer patients leads to higher proportions of selected models that contain or are exactly the generated model (Table 2) due to higher proportions of models capturing the nonlinearity and timedependency of age (Fig. 3).
Sensitivity and specificity are high for scenario A, and are not impacted by an increasing sample size. They are relatively high for scenario C, with a slight increase in sensitivity (0.74 to 0.78–0.80) with an increasing sample size (Table 2).
Algorithms adapted to models with interactions –scenarios B and D
The adapted (aMVRS, aW&A) and MFPIgen algorithms correctly keep the main effects in the final models (Fig. 4). 28% of models selected using aW&A identify the nonlinearity of age in D, whereas 34–40% of the aMVRS and MFPIgen algorithms retain the nonlinearity of age. The aW&A algorithm tends to keep higher proportions of timedependent effects of deprivation, of the binary variable and of interactions than the other two algorithms. aMVRS and aW&A also lead to 10–21% of interactions wrongly selected. The proportions of the interaction agestage rightly kept are at or just over 30% for scenario B and up to 71% (aW&A) for scenario D. The MFPIgen algorithm is able to keep in valid interaction in 29.6% (B) and 50.8% (D) of the final models while spurious interactions are rejected in over 94% of final models.
Nonlinearity and timedependency of age in scenario D are retained in just over a quarter of models selected by aW&A, 6–20% less than the proportions of models selected by aMVRS and MFPIgen that contain these characteristics of age. Increased sample size to N = 5000 is beneficial for raising the detection of the agestage interaction in B for aMVRS (68.3%) and aW&A (69.2%), and raising detection of nonlinearity and timedependency of age in D for all three algorithms (Fig. 4).
The proportions of models that contain the true generating model lie between 29.6% (MFPIgen) and just over 35% (aMVRS and aW&A) for scenario B, and between 2.4% (aMVRS) and 4.8% (aW&A) for scenario D. For scenario B, those proportions correspond to the proportion of models with an age by stage interaction, and therefore increase with increasing sample size for aMVRS (74.5% for B and 43.5% for D when N = 5000) and aW&A (72.3% for B and 17.4% for D when N = 5000). For scenario D, this is the proportion of models with an interaction between a nonlinear effect of age and stage. Only 14.4–14.8% (scenario B) and 1.6–2.8% (scenario D) are the exact simulated models. These proportions increase to 16–36.8% (scenario D) when small effects are not considered, due to the relatively small sample size, or when the sample size is increased to 5000.
Sensitivity and specificity are around and over 0.8 for scenario B and are stable to increased sample size. Sensitivity is just over 0.5, and specificity between 0.75 and 0.85 for scenario D, with slight improvement in sensitivity with increased sample size (Table 2).
Impact of misselection of effects on other effects
In scenario (A) and (C), W&A seems to suffer more from confounding and selfconfounding (Additional file 4). For example, when the extra binary variable is selected in (C), the proportion of models with timedependent effects of deprivation and/or age are hardly changed with MVRS, but they increase with W&A to 16.7% (+ 12.3%) and 55.6% (+ 15.2%) respectively. (Additional file 4).
There are hardly any confounding or selfconfounding effects in the MFPIgen algorithm. Misspecification of timedependent effects only has minimal confounding impact on the other effects selected using the aMVRS algorithm. This is due to the twostep structure of the algorithm (Additional file 4).
In the aW&A algorithm, selection of complex forms (e.g. timedependent effect of a variable) results on the selection of more complex effects of some other factors or additional selection of interaction terms (Additional file 4).

(b)
Accuracy of the nonlinear and timedependent effects estimated
Original algorithms – scenarios A and C (no interaction)
Figure 5 presents the effects estimated by the models selected following the MVRS or W&A algorithms together with their averaged effects (black line) compared to the true generating effect (red line). All sample sizes are N = 2000 patients.
Although there are varied sizes of effect estimated as shown by the width of the boxes (effects estimated as fixed in time) and the varied shapes of the individual effects, grey curves (timedependent effects estimated), the average effects generally agree with the generating effects for all strategies, and lead to comparable estimated effects for MVRS and W&A. For both strategies, the effects of age are well captured for scenario (A) and (C): pABCtime values are 0.3% (A), 0.3% (C, MVRS) and 0.2% (C, W&A), Table 3.
The mixture of timefixed and timedependent effects of stage estimated in the selected models for scenario (A) leads to a very good estimation of the average effect compared to the generated effect for both strategies. Note the graphs present log hazard ratios for better illustrating the differences, but pABCtime values are calculated on the areas between the hazard ratio curves. pABCtime values for the hazard ratios are very similar between algorithms, highest for stage IV (2.5%), intermediate for stage II (2.2–2.4%) and lowest for stage III (1.7%). In scenario (C) all estimated effects are timedependent, and most shapes agree with the original effect. pABCtime values are slightly lower for the W&A algorithm compared to MVRS: 2.3% vs. 2.4% at stage II vs. I, 0.9% vs. 1.2% at stage III vs. I, and 1.8% vs. 2.1% at stage IV vs. I.
The effects of deprivation are well estimated by all models selected by all algorithms: pABCtime is below 1.2% for all deprivation categories, and in both scenarios A and C.
More complex effects of the extra binary variables are captured by W&A, in both (A) and (C) leading to slightly higher pABCtime values: 0.6% vs. 0.3% (A) and 0.12% vs. 0.08% (C).
Algorithms adapted to models with interactions – scenarios B and D
Figure 6 displays the effects estimated by the selected models (250 grey curves) following the aMVRS, MFPIgen and aW&A algorithms together with their averaged effects (black line) compared to the true generating effect (red line). The effects of age are now split by stage at diagnosis, since an interaction agestage is simulated.
For all selected models, the average HRs for age seem to generally underestimate the simulated effects for stages III, in scenario B and D. These are reflected by larger stagespecific pABCtime values for age: 2.4–5.9% (stages III) versus 0.01%2.2 (stages IIIIV, Table 3). The timedependency of age, simulated in scenario D, is not very strong, hence the many models that selected a timefixed effect for age. Graphs of the nonlinear effects of age at given times after diagnosis are presented in Additional file 5.
The effects of stage, deprivation (Fig. 6) and the additional binary variable (Additional file 6) are well reproduced by the average effects obtained from the selected models. The pABCtime values can hardly distinguish between the performance of the modelselection algorithms (Table 3). The complexity of models selected by the aW&A algorithm does not impact the overall measures of effects and their adequacy to describe the true generating effects. Indeed, none of the modelled timedependent effects are strong, but the results presented here shed some light in terms of the sensitivity of the different model selection tools.

(c)
Estimation of the cohort net survival
For all modelbuilding strategies, the estimated stagespecific cohort net survival curves lie around the original estimated cohort net survival curves, for all subgroups defined by stage at diagnosis for scenarios AD (Fig. 1). All pABCtime values are below 1.7% (Table 3).
The outcome of choice – net survival – is well reproduced by models selected by each strategy and provides reassurance that the experience of cancer survival for the cohort is well captured by the models. pABCtime values calculated using the nonparametric estimator of net survival provides 0.3–8% higher values than for the modelbased survival curves (Additional file 7).
The bias reflects the varying amount of misspecification for each of the three algorithms. For example, higher proportions of timedependent effect of the binary variables using W&A and aW&A lead to higher standardised bias for that variable and that algorithm (Additional file 8). The minimum in the timevarying bias is reached at around 6 months after diagnosis for all effects, when most timedependent effects cross the true effect. At that point, the value reached reflects the amount of bias due to the estimated fixed effects.
Discussion
Motivated by the growing access to data on explanatory factors of cancer survival, we compared the practical use of several model selection strategies. We adapted wellrecognised algorithms to the context of excess hazard models, including extensions to deal with 2way interactions. Simulations, based on observed realistic scenarios, showed the ability of all strategies to yield proper estimation of the cohort net survival curve despite varying forms of the retained and estimated effects.
Several aspects of model selection deserve further discussion. Additionally, we aim to provide some guidelines for variable selection in the context of cancer survival epidemiology.
Subject matter knowledge
A breadth of modelling strategies exists, but very few strategies have been compared as highlighted by STRATOS Topic Group 2 [37]. We aimed here to look at the impact that model selection strategies may have on inference based on the final selected model. Subject matter knowledge is needed all through model building, such as in decisions relative to the selection of the variables that will be tested, and the allowed forms of these variables [38], as well as how strict we are on keeping/dropping a variable or functional form. In observational studies, we acknowledge it is almost impossible to state all aspects of a model ahead of data exploration, and model selection remains necessary. In our comparison, we concentrate on the modelbuilding algorithms per se and assume both benefited from a similar amount of subject matter knowledge.
Timedependent effects
A timedependent effect is modelled if the effect of a variable, measured at diagnosis, varies with time since diagnosis, i.e. that effect is not constant with followup time. In the context of cancer survival, most factors such as stage at diagnosis, deprivation, emergency presentation [39] tend to have strong effects in the months that follow the diagnosis, and these effects are likely to reduce or disappear as time passes [39]. When testing timedependency of different factors, a long enough followup, as well as enough information are required to detect timedependency.
Nonlinear effects
Additionally, in order to properly assess nonlinearity of the effect of a specific variable, such as age, there needs to be enough information on that variable about its own effect on the time to event: e.g. patients’ age need to cover a reasonable range of all possible ages, rather than be grouped in a small part of the age distribution.
Censoring and lethality of cancer (number of events)
Lung cancer data contain relatively high proportions of events (80% 4 years after diagnosis) compared to other cancers that do not experience such high lethality. Model building strategies and variable selections are highly sensitive both to the number of events and levels of censoring. This is due to the rapidly increasing complexity of the models tested, especially when the backwardbased W&A and aW&A are run. For example, in the context of lung cancer, there was nonconvergence of the Stata algorithms in around 10% of the samples. Changing the starting values or running initial univariate selections did not help in reaching convergence.
It has recently been shown that 40–50 events per variable are necessary to ensure accurate estimation of coefficients [40] in the competing risk setting. In the most complex models (fitted on lung cancer) which include all interactions and timedependent effects, i.e. 48 parameters, there was an average of 36 events per parameters in a sample of 2000 patients. When these modelbuilding strategies were run on cancers with lower lethality, such as laryngeal cancer, with 60% censoring at 5 years, the algorithms did not converge for a larger proportion of samples, up to 20% (results not shown). In addition, after convergence, some estimated hazard ratios were unbelievably large: there was an average of only 16 events per parameters (N = 2000 patients) for the most complex models fitted on laryngeal cancer data.
In the relative survival data setting, a competing risk framework, competing deaths (i.e. from other causes, provided by general population life tables) are subtracted from observed events (death from any cause). This reduces further the power for detecting and retaining effects. This is not so problematic when studying lung cancer as 95% of deaths are due to lung cancer [39], i.e. 1675 lung cancer deaths among the 1765 deaths in the 2000 data samples, leading to 34 events per parameter. Less lethal cancers will see the actual numbers of cancerrelated deaths be a smaller proportion of all deaths, leading to smaller number of events per variable.
Prior to running any model building strategies, we recommend that the censoring rate and the number of events are carefully examined in relation to the complexity of the models fitted. Further clinical considerations and background knowledge are helpful prior to variable selection to ensure significance tests are used with sparsity.
Sample size, model complexity
The W&A strategies tend to favour timedependent effects and interactions, leading to complex models. This is due to the backward selection of effects. Model misspecification of some variables leads to selfconfounding and confounding, which would provide wrong inference on the effects of some variables. On the other hand, the MVRS strategy leads to simpler models with additional variables wrongly selected in about 5% of models overall. However, in three out of four scenarios (B, C and D), all model selection strategies select models containing the true models in a relatively poor proportion (always below 15%). This is largely due to the size of the effects that the algorithms were trying to capture and the number of patients included in the analyses, 2000. Indeed, some effects such as nonlinearity or nonproportionality of age could not be retrieved in the final selected models, due to lack of power. Releasing one or several of these small effects translates in larger proportions of models that nearly contain the generating models. More importantly, increasing the sample size to 5000 patients leads to improved detection power and higher selection proportions of the true generating model.
The adapted MVRS and W&A algorithms testing for interactions show similar properties as the original algorithms for the selection of linear/nonlinear and timedependent main effects. They show equivalent results to the MFPIgen strategy for the selection of interaction terms.
Investigating the effect of many variables of known prognostic value in a large populationbased cohort of lung cancer patients, all modelbuilding strategies lead to similar selection of effects. As expected W&A and aW&A only differed from R&S and aR&S in the shape of the effect of age, which has virtually no impact on cohortwide net survival estimates.
Although the modelbuilding strategies may not tend to select the same final models, and the proportion of models that do select the true generating model vary with the sample size, the number of events and the size of the effects, there is no impact on the estimation of cohort net survival, by stage at diagnosis. Estimation of cohort net survival can best be done nonparametrically as there is no assumption on the form of the association between the exposure variables and survival time. We show that on average the modelbased estimates are equivalent to the nonparametric estimates of net survival. When nonparametric estimates of cohort survival can be produced, it is good practice to use them to validate modelbased estimates.
The variables whose effects are tested in the models, are only mildly correlated with a coefficient of correlation below 0.2. Another challenge in modelling nonlinear effects of a variable is the potential collinearity of some spline basis (such as cubic splines). A possible solution for this, adopted here, consists of orthogonalizing the splines basis. However, high correlation between two variables may have a negative effect on the model selection strategies studied here as they are based on stepwise methods and are thus dependent on the order of testing.
Epidemiological aim of models
The ultimate aim of building exploratory models in our context is to describe variables effects on the survival experience of a cohort of cancer patients. In the simulations, the large variety of models selected by the different modelbuilding strategies leads to varying estimations of main effects and varying levels of individual excess hazard and net survival estimates, which has implications in terms of epidemiological interpretation. Nonetheless all generated effects are well captured by the variable selection strategies, whatever their complexity. This is verified graphically and looking at the area between each estimated effect and the generated effect.
Forwardbased model building strategies tend to favour simpler models, which may be a useful feature in contexts with less information (e.g. low EVP, or high censoring, or relatively small sample sizes) in order to avoid inclusion of spurious effects. Conversely, backwardbased strategies tend favour more complex models, which may be useful to detect small effects in cases with larger samples and low censoring. Nonetheless, the comparison of the final models selected with different strategies may be useful in order to assess any differences on the corresponding net survival curves, and to identify potential reasons for these differences (if any) based on our previous discussion.
The strategies presented here are based on likelihood ratio tests performed in a hierarchical order. Thus, they rely on significance testing and, consequently, are prone to multiple testing as well as Type I and Type II errors. Nonetheless, all strategies let the user decide what significance level should be used for the selection of effect. We use here the conventional 5%, and test for the impact of keeping the main effects in. One could consider choosing more conservative thresholds [41] and evaluating the impact of varying thresholds on the models selected.
Model building strategy is in line with the ‘data modelling culture’ and is based on the idea that a true model generating the data does exist [42]. Although not all important variables may be available, or the true model is likely to not be among the considered models, the aim is to get as close as possible to this true model by including the relevant variables and by flexibly modelling the effect of the available ones. Shrinkage techniques (LASSO [43], Ridge, Elastic Nets [44]) could be considered, but these methods are not yet available in our relative survival context. Still in the machine learning field, methodological developments are of great interest. For example, model averaging [45] and more generally ensemble learning techniques [46] are possible avenues though interpretability of the results can be challenging, hence more appropriate outside of the descriptive modelling field.
Model selection approaches based on Information Criteria [45] (e.g. AIC and BIC) or crossvalidation of the selected models, instead of likelihood ratio testing, could prove useful for selecting the proper functional forms of effects. In the context of prediction, one tends to select and use a simple model in order not to over fit the training data [47]. Following work on the topic of predictions would involve additional statistical measures for assessing predictive accuracy of the selected model for a given strategy. Measures such as discrimination and calibration would then be useful [48, 49]. However, in this work, which was mainly exploratory rather than predictive, all strategies lead to similar modelbased estimates of net survival.
Large datasets and information on many factors are motivations for using complex excess hazard models. Model selection methods are essential to make sure all models are considered in a systematic fashion. Nonetheless, several aspects of the data (such as sample size, censoring, NL and TD effects) and the models (such as complexity, assumptions) deserve full consideration ahead of model selection.
Availability of data and materials
The data that support the findings of this study – both in simulations and application – are available from Public Health England but restrictions apply to the availability of these data, which were used under the necessary statutory and ethical approvals for the current study, and so are not publicly available.
Abbreviations
 AIC:

Akaike Information Criteria
 BIC:

Bayesian Information Criteria
 LASSO:

Least Absolute Shrinkage and Selection Operator
 MFPI:

Multivariable fractional polynomial (interaction)
 MFPIgen:

Multivariable fractional polynomial (general interaction)
 MFPT:

Multivariable fractional polynomial (time)
 MVRS:

Multivariable regression splines
 NL:

Non linear
 pABCtime:

Proportion of the Area Between Curves through time
 PH:

Proportional Hazards
 TD:

Time dependent
 W&A:

Wynant and Abrahamowicz
References
 1.
Sauerbrei W, Abrahamowicz M, Altman DG, le Cessie S, Carpenter J. STRengthening analytical thinking for observational studies: the STRATOS initiative. Stat Med. 2014;33(30):5413–32.
 2.
Mallett S, Royston P, Dutton S, Waters R, Altman DG. Reporting methods in studies developing prognostic models in cancer: a review. BMC Med. 2010;8(1):20.
 3.
Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res. 2003;3(Mar):1157–82.
 4.
Ishwaran H, Kogalur UB, Gorodeski EZ, Minn AJ, Lauer MS. Highdimensional variable selection for survival data. J Am Stat Assoc. 2010;105(489):205–17.
 5.
Shmueli G. To explain or to predict. Stat Sci. 2010;25(3):289–310.
 6.
Abrahamowicz M, MacKenzie TA. Joint estimation of timedependent and nonlinear effects of continuous covariates on survival. Stat Med. 2007;26(2):392–408.
 7.
Sauerbrei W, Royston P, Binder H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Stat Med. 2007;26(30):5512–28.
 8.
Wynant W, Abrahamowicz M. Flexible estimation of survival curves conditional on nonlinear and timedependent predictor effects. Stat Med. 2016;35(4):553–65.
 9.
Sauerbrei W, Royston P, Look M. A new proposal for multivariable modelling of timevarying effects in survival data based on fractional polynomial timetransformation. Biom J. 2007;49(3):453–73.
 10.
Royston P, Sauerbrei W. Multivariable modeling with cubic regression splines: a principled approach. Stata J. 2007;7:45–70.
 11.
Royston P, Sauerbrei W. A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials. Stat Med. 2004;23(16):2509–25.
 12.
Sauerbrei W, Royston P, Zapien K. Detecting an interaction between treatment and a continuous covariate: a comparison of two approaches. Comput Stat Data Anal. 2007;51(8):4054–63.
 13.
Wynant W, Abrahamowicz M. Impact of the modelbuilding strategy on inference about nonlinear and timedependent covariate effects in survival analysis. Stat Med. 2014;33(19):3318–37.
 14.
Esteve J, Benhamou E, Croasdale M, Raymond L. Relative survival and the estimation of net survival: elements for further discussion. Stat Med. 1990;9(5):529–38.
 15.
Mariotto AB, Noone AM, Howlader N, Cho H, Keel GE, Garshell J, et al. Cancer survival: an overview of measures, uses, and interpretation. J Natl Cancer Inst Monogr. 2014;2014(49):145–86.
 16.
Belot A, Ndiaye A, LuqueFernandez MA, Kipourou DK, Maringe C, Rubio FJ, et al. Summarizing and communicating on survival data according to the audience: a tutorial on different measures illustrated with populationbased cancer registry data. Clin Epidemiol. 2019;11:53–65.
 17.
Pohar Perme M, Stare J, Esteve J. On estimation in relative survival. Biometrics. 2012;68(1):113–20.
 18.
Pohar Perme M, Esteve J, Rachet B. Analysing populationbased cancer survival  settling the controversies. BMC Cancer. 2016;16(1):933.
 19.
Pohar Perme M, Henderson R, Stare J. An approach to estimation in relative survival regression. Biostatistics. 2009;10(1):136–46.
 20.
Danieli C, Remontet L, Bossard N, Roche L, Belot A. Estimating net survival: the importance of allowing for informative censoring. Stat Med. 2012;31(8):775–86.
 21.
Remontet L, Bossard N, Belot A, Estève J. An overall strategy based on regression models to estimate relative survival and model the effects of prognostic factors in cancer survival studies. Stat Med. 2007;26(10):2214–28.
 22.
Giorgi R, Abrahamowicz M, Quantin C, Bolard P, Esteve J, Gouvernet J, et al. A relative survival regression model using Bspline functions to model nonproportional hazards. Stat Med. 2003;22(17):2767–84.
 23.
Lambert PC, Royston P. Further development of flexible parametric models for survival analysis. Stata J. 2009;9:265–90.
 24.
Rubio FJ, Remontet L, Jewell NP, Belot A. On a general structure for hazardbased regression models: an application to populationbased cancer research. Stat Methods Med Res. 2019;28(8):2404–17.
 25.
Bower H, Crowther MJ, Lambert PC. Strcs: a command for fitting flexible parametric survival models on the loghazard scale. Stata J. 2016;16(4):989–1012.
 26.
Royston P, Sauerbrei W. Multivariable modelbuilding: a pragmatic approach to regression analysis based on fractional polynomials for modelling continuous variables [chapter 7: interactions]. UK: Wiley; 2008.
 27.
Crowther MJ, Lambert PC. Simulating biologically plausible complex survival data. Stat Med. 2013;32(23):4118–34.
 28.
Crowther MJL. P.C. Simulating complex survival data. Stata J. 2012;12(4):674–87.
 29.
Department for Communities and Local Government. The English indices of deprivation 2007. London; 2008.
 30.
Sobin LH, Gospodarowicz M, Wittekind C. TNM classification of malignant Tumours. 7th ed. New York: John Wiley & Sons; 2009.
 31.
Wang Z, Ma S, Zappitelli M, Parikh C, Wang CY, Devarajan P. Penalized count data regression with application to hospital stay after pediatric cardiac surgery. Stat Methods Med Res. 2016;25(6):2685–703.
 32.
Buchholz A, Sauerbrei W, Royston P. A measure for assessing functions of timevarying effects in survival analysis. Open J Stat. 2014;4:977–98.
 33.
BenitezMajano S, Fowler H, Maringe C, Di Girolamo C, Rachet B. Deriving stage at diagnosis from multiple populationbased sources: colorectal and lung cancer in England. Br J Cancer. 2016;115:391.
 34.
EllissBrookes L, McPhail S, Ives A, Greenslade M, Shelton J, Hiom S, et al. Routes to diagnosis for cancer – determining the patient journey using multiple routine data sets. Br J Cancer. 2012;107:1220.
 35.
Maringe C, Fowler H, Rachet B, LuqueFernandez MA. Reproducibility, reliability and validity of populationbased administrative health data for the assessment of cancer nonrelated comorbidities. PLoS One. 2017;12(3):e0172814.
 36.
Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc. 1927;22(158):209–12.
 37.
Sauerbrei W, Perperoglou A, Schmid M, Abrahamowicz M, Becher H, Binder H, et al. Stateoftheart in selection of variables and functional forms in multivariable analysis  outstanding issues 2019. Available from: https://arxiv.org/abs/1907.00786.
 38.
Heinze G, Wallisch C, Dunkler D. Variable selection – A review and recommendations for the practicing statistician. Biom J. 2018;60(3):431–49.
 39.
Maringe C, Pohar Perme M, Stare J, Rachet B. Explained variation of excess hazard models. Stat Med. 2018;37(14):2284–300.
 40.
Austin PC, Allignol A, Fine JP. The number of primary events per variable affects estimation of the subdistribution hazard competing risks model. J Clin Epidemiol. 2017;83:75–84.
 41.
Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ, Berk R, et al. Redefine statistical significance. Nat Hum Behav. 2018;2(1):6–10.
 42.
Breiman L. Statistical modeling: the two cultures (with comments and a rejoinder by the author). Stat Sci. 2001;16(3):199–231.
 43.
Zou H. The adaptive Lasso and its Oracle properties. J Am Stat Assoc. 2006;101(476):1418–29.
 44.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 2005;67(2):301–20.
 45.
Burnham KP, Anderson DR. Model selection and multimodel inference: a practical informationtheoretic approach. New York: Springer Science & Business Media; 2003.
 46.
Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Data mining, inference, and prediction. 2nd ed. New York: SpringerVerlag; 2009.
 47.
Clayton MK, Geisser S, Jennings DE. In: Goel PK, Zellner A, editors. A comparison of several model selection procedures. New York: Elservier; 1986.
 48.
Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999;18(17–18):2529–45.
 49.
Harrell FE, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy and measuring and reducing errors. Stat Med. 1996;15(4):361–87.
Acknowledgments
Not applicable.
Disclaimer
The findings and conclusions in this report are those of the authors and do not necessarily represent the views of Cancer Research UK.
Funding
This research was supported by Cancer Research UK (grant number C7923/A18525).
Author information
Affiliations
Contributions
CM and BR developed the concept and CM, BR, AB and FJR developed the design of the simulation study. CM was involved in the data preparation and the data linkage, carried out the data analysis and wrote the manuscript. All authors interpreted the data, drafted and critically revised the manuscript. All authors read and approved the final version of the manuscript.
Corresponding author
Correspondence to Camille Maringe.
Ethics declarations
Ethics approval and consent to participate
Consent would not be feasible due to the large number of patients involved, of which many would be deceased. We are secondary users of this confidential populationbased data and as such we hold statutory approval from the Confidentiality Advisory Group (PIAG 1–05(c)/2007) which grants us permission to process the data for the purpose of this study.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Maringe, C., Belot, A., Rubio, F.J. et al. Comparison of modelbuilding strategies for excess hazard regression models in the context of cancer epidemiology. BMC Med Res Methodol 19, 210 (2019) doi:10.1186/s1287401908309
Received
Accepted
Published
DOI
Keywords
 Excess hazard models
 Interactions
 Nonlinearity
 Nonproportionality
 Variable selection