 Research article
 Open Access
 Published:
Robust estimation of the expected survival probabilities from highdimensional Cox models with biomarkerbytreatment interactions in randomized clinical trials
BMC Medical Research Methodology volume 17, Article number: 83 (2017)
Abstract
Background
Thanks to the advances in genomics and targeted treatments, more and more prediction models based on biomarkers are being developed to predict potential benefit from treatments in a randomized clinical trial. Despite the methodological framework for the development and validation of prediction models in a highdimensional setting is getting more and more established, no clear guidance exists yet on how to estimate expected survival probabilities in a penalized model with biomarkerbytreatment interactions.
Methods
Based on a parsimonious biomarker selection in a penalized highdimensional Cox model (lasso or adaptive lasso), we propose a unified framework to: estimate internally the predictive accuracy metrics of the developed model (using double crossvalidation); estimate the individual survival probabilities at a given timepoint; construct confidence intervals thereof (analytical or bootstrap); and visualize them graphically (pointwise or smoothed with spline). We compared these strategies through a simulation study covering scenarios with or without biomarker effects. We applied the strategies to a large randomized phase III clinical trial that evaluated the effect of adding trastuzumab to chemotherapy in 1574 early breast cancer patients, for which the expression of 462 genes was measured.
Results
In our simulations, penalized regression models using the adaptive lasso estimated the survival probability of new patients with low bias and standard error; bootstrapped confidence intervals had empirical coverage probability close to the nominal level across very different scenarios. The double crossvalidation performed on the training data set closely mimicked the predictive accuracy of the selected models in external validation data. We also propose a useful visual representation of the expected survival probabilities using splines. In the breast cancer trial, the adaptive lasso penalty selected a prediction model with 4 clinical covariates, the main effects of 98 biomarkers and 24 biomarkerbytreatment interactions, but there was high variability of the expected survival probabilities, with very large confidence intervals.
Conclusion
Based on our simulations, we propose a unified framework for: developing a prediction model with biomarkerbytreatment interactions in a highdimensional setting and validating it in absence of external data; accurately estimating the expected survival probability of future patients with associated confidence intervals; and graphically visualizing the developed prediction model. All the methods are implemented in the R package biospear, publicly available on the CRAN.
Background
Thanks to the advances in genomics and targeted treatments, an increasing interest is being devoted to develop prediction models with biomarkers, called treatmenteffect modifiers (for which the relative treatment effect varies according to the biomarker values), to predict how much benefit individual patients would derive from specific treatments. This aims at taking the therapeutic decision that fits the best each individual patient. Recently, there have been some attempts to identify prediction models that are associated with higher efficacy of particular treatments such as an 8gene and a 14gene signatures (i.e. prediction model) to evaluate the degree of trastuzumab benefit in early breast cancer patients on diseasefree survival [1, 2]. Statistical methodology has been proposed for the development of prediction models and their validation [3, 4] and to estimate individual predictions in a survival setting [5, 6]; however, no clear guidance has been yet reached and evaluated in a highdimensional setting.
Results from randomized clinical trials are often difficult to translate into predictions for individual patients, but the estimated absolute risk reductions from large randomized trials do still provide the best guidance [7]. Therefore, in addition to the challenges coming from the identification of treatmenteffect modifiers in a highdimensional setting, it is also important to identify prognostic biomarkers (i.e. associated with the clinical outcome and independently of treatment) to adjust for the main effect of established clinical and genomic variables in order to obtain individual survival probabilities. Indeed, the clinical impact of a treatment can be judged only with the knowledge of the prognosis of a patient. It is thus of importance to reliably predict the prognosis of patients to assist treatment counseling [8].
The aim of the present study was to propose a unified framework for developing and validating a highdimensional Cox model [9] in a randomized clinical trial to estimate the expected treatment effect according to different values of the biomarkers included in the model, and to estimate survival probabilities for individual patients with associated confidence intervals. We propose different approaches that can be used in this context. We present a simulation study including null (i.e. with no treatmenteffect modifier) and alternative scenarios (i.e. with at least one treatmenteffect modifier) to evaluate the performance of the proposed approaches. We also illustrate the methods in a large randomized clinical trial of breast cancer patients. Finally, we discuss the findings.
Methods
In the present section we investigate several approaches to: (i) identify a parsimonious Cox regression model in a highdimensional setting, (ii) estimate internally the predictive accuracy metrics of the selected model, (iii) estimate the expected survival probability for patients, (iv) construct confidence intervals thereof, and (v) visualize them graphically. A schematic representation of the framework is provided in Fig. 1.
Identification of a prediction model with treatmenteffect modifiers
The objective of developing a prediction model is to identify a model allowing the computation of a prognostic and a treatmenteffect modifying score for each patient. From a statistical viewpoint, Rothwell put forward that the most reliable statistical approach for identifying treatmenteffect modifiers is to test the interactions between the biomarkers and the treatment effect [7]. Hereinafter, we consider the proportional hazards model
where α, β = (β _{1}, …, β _{ p })^{T} and γ = (γ _{1}, …, γ _{ p })^{T} are the regression coefficients for: the treatment T coded as +0.5 and −0.5 for the experimental and control arm, respectively; the main effects of the p standardized biomarkers X = (X _{1}, …, X _{ p })^{T}; and the p biomarkerbytreatment interactions, i.e. the product between the standardized biomarkers X and the treatment T. In other words, the first sum in (1) corresponds to an average prognostic component to the total score and the second sum in (1), corresponds to the treatmenteffect modifying component. For simplicity, we do not consider clinical covariates in the presentation of the framework.
To overcome the nonidentifiability of the models due to the highdimensional setting, we previously reviewed and proposed several methods to identify a parsimonious model starting from a large number of candidate biomarkers in a randomized clinical trial and to predict the magnitude of the relative treatment effect for future patients [10]. Most of these methods are based on penalized regression, which maximizes the penalized partial loglikelihood l _{ p }(α, β, γ, λ, T,X) that is the partial loglikelihood l(α, β, γ, T, X) of model (1) minus a penalty:
In the present work, we implemented two penalties: the lasso [11] which corresponds to θ _{ j } = ϑ _{ j } = 1, and the adaptive lasso [12, 13] which corresponds to \( {\theta}_j=1/\left{\widehat{\beta}}_j^{\mathrm{R}}\right \), \( {\vartheta}_j=1/\left{\widehat{\gamma}}_j^{\mathrm{R}}\right \), with \( {\widehat{\beta}}_j^{\mathrm{R}} \) and \( {\widehat{\gamma}}_j^{\mathrm{R}} \) the regression coefficients estimated in a preliminary step through the model (1) subject to the ridge penalty. As compared to the popular lasso penalty, the adaptive lasso penalty was initially proposed as it fulfilled the oracle property [14] in the prognostic setting. However, the performance of the oracle property is less known in highdimensional samples with relatively low number of events. In a timetoevent setting with many candidate biomarkerbytreatment interactions, the adaptive lasso penalty was one of the best performing methods for selecting these interactions in a simulation study in [10]. For both the lasso and adaptive lasso, the tuning parameter λ was determined via a k _{1}fold crossvalidation scheme (called 1CV for the rest of the article) by maximizing the crossvalidated loglikelihood criterion of the Cox model [15]:
with k _{1} = 5 and where \( {\widehat{\alpha}}_{ k} \), \( {\widehat{\boldsymbol{\beta}}}_{ k} \), \( {\widehat{\boldsymbol{\gamma}}}_{ k} \) and the tuning parameter are estimated in the training data excluding the subsample k: T _{− k } and X _{− k }. Penalized regression coefficients are estimated through the Rpackage glmnet [16, 17]. We also investigated the “refit adaptive lasso” and the “refit lasso” for which the selected nonzero regression coefficients are reestimated in an unpenalized Cox regression model.
Internal assessment of predictive accuracy of the prediction model
To assess the performance of a prediction model, some predictive accuracy measures (see the section “Simulation study” for more details) can be used. However, computing these metrics on the same data used to develop the model could lead to overoptimistic predictive accuracy measures and thus, external validation is crucial [18]. In the absence of external validation data set, a strategy suggested by Simon et al. [19] and illustrated by Matsui et al. [3] consists in performing a double crossvalidation (2CV) to mimic an external validation. Indeed, patients are first divided into k _{2} subsamples (in this work k _{2} = 5); then, for each fold, the data in the fold are put apart and the data in the remaining k _{2} − 1 folds are used to estimate the tuning parameter λ through k _{1}fold cross validation and to estimate the prediction scores of the leftout patients. The process is iterated for all the k _{2} folds to estimate the crossvalidated prediction scores of the complete data set. Of note, these scores obtained using the 2CV are only used to compute these prediction accuracy metrics, whereas the 2CV is not used elsewhere in this study.
Point estimates of the expected survival probability
From the estimated model (1), the estimation of the expected survival probability of each patient i = (1, …, n) at time t can be obtained by plugging the estimated parameters in the survival function:
with Ĥ _{0}(t) the cumulative baseline hazard at time t, that is commonly estimated through the nonparametric Breslow estimator [20].
Estimation of confidence intervals of the expected survival probability
We compared two strategies to estimate the confidence intervals at level 1 – θ of the expected survival probabilities: one based on analytical expressions deriving from the normal approximation of the estimator and one based on a nonparametric bootstrap approach.
The analytical approach [21] consists in estimating the variance of the cumulative risk Ĥ _{ i }(t) = Ĥ(t, T _{ i }, X _{ i }) based on the Breslow estimator [22] in the Cox model. Thus, the confidence interval of Ŝ _{ i }(t) is approximated as
with \( {q}_{\alpha}\left({\widehat{S}}_i(t)\right)= \exp \left({\widehat{H}}_i(t)+{z}_{\alpha}\sqrt{\widehat{\mathrm{var}}\left({\widehat{H}}_i(t)\right)}\right). \) This can be directly implemented via the Rpackage survival [23]. In the case that penalized coefficients are taken without reestimation, a practical solution to estimate their standard errors can be to compute the Hessian matrix based on the estimated parameter values.
The nonparametric bootstrap approach consists in generating B bootstrap samples of the original data (B = 200 in this work) and in estimating the model (1) for each of them. Thus, B models ĥ _{1}(t, T, X), …, ĥ _{ B }(t, T, X) are obtained, allowing to estimate B times the expected survival probability for each patient i at time t, noted Ŝ _{ i,boot}(t) = {Ŝ _{ i,1}(t), …, Ŝ _{ i,B }(t)}. Then, the nonparametric confidence interval of the Ŝ _{ i }(t) based on the empirical percentiles q _{ α }(⋅) of the distribution of Ŝ _{ i, boot}(t) is given by
Of note, both the analytical and bootstrap confidence intervals are based on the model (1) which is estimated using a penalty parameter determined via a single crossvalidation (1CV).
Graphical visualization
We visualize the expected survival probability of the patients according to their treatmenteffect modifying score \( {\widehat{\eta}}_i = {\displaystyle \sum_{j=1}^p}{\widehat{\gamma}}_j{X}_{i j} \) at a given horizon τ. A toy example is used for illustration in Fig. 2.
For the graphical visualization, the scores \( \widehat{\boldsymbol{\eta}}=\left({\widehat{\eta}}_1,\dots, {\widehat{\eta}}_n\right) \) are scaled so that the 2.5% (i.e. \( {q}_{0.025}\left(\widehat{\boldsymbol{\eta}}\right) \)) and 97.5% (i.e. \( {q}_{0.975}\left(\widehat{\boldsymbol{\eta}}\right) \)) quantiles equal 0 and 1 respectively in the training set:
In the simplest case in which only one treatmenteffect modifier and no prognostic biomarker (\( {\widehat{\beta}}_j=0,\ \forall j \)) are included in the model, the relationship between \( {\widehat{\eta}}_i \) and Ŝ _{ i }(t) (and its confidence interval bounds) is strictly monotone (Fig. 2a). However, in the setting of many treatmenteffect modifiers and no prognostic biomarker, two patients with equal scores \( {\widehat{\eta}}_i \) have the same Ŝ _{ i }(t), but their confidence intervals may be different (Fig. 2b), due to different biomarker values. In order to obtain smoothed average confidence bounds, we considered constrained basis splines (Bsplines) spl(t) [24]. Splines are numerical functions that can be used for curvefitting by approximately fitting the data at particular nodes. The number of nodes was estimated through the Akaike Information Criterion (AIC, default criterion in the chosen Rpackage cobs [25]). We also forced the constraint that 0 ≤ spl(t) ≤ 1.
When the model contains also prognostic biomarkers, patients with the same treatmenteffect modifying score \( {\widehat{\eta}}_i \) may have different survival probability Ŝ _{ i }(t) due to different prognostic scores \( {\widehat{\varphi}}_i = {\displaystyle \sum_{j=1}^p}{\widehat{\beta}}_j{X}_{i j}. \) Due to the possible large heterogeneity of Ŝ _{ i }(t) for equal scores \( {\widehat{\eta}}_i \), we stratified the treatmenteffect modifying plot into four groups according to the prognostic score, using the percentiles proposed by Cox [26]: 16.4%, 33.6%, 33.6% and 16.4%. For our toy example, Fig. 2c and 2d represent the relationship between Ŝ _{ i }(t) and \( {\widehat{\eta}}_i \) for the best and worst prognostic risk groups, respectively. Heterogeneity within groups is still present for both point estimates of the expected survival probability and confidence interval bounds; therefore we also considered smoothing the point estimates through Bsplines. Even though categorization reduces the available information [27], it leads to a more meaningful graphical representation of the prediction model.
In the rest of the study, we investigated both the spline strategy with prognostic categorization and the pointwise strategy.
Simulation study
In this section, we present a simulation study covering several null (i.e. with no treatmenteffect modifier) and alternative (i.e. with at least one true treatmenteffect modifier) scenarios to evaluate the operating characteristics of the different approaches.
Data generation and scenarios
For each simulated data set, we generated p = 500 Gaussian unitvariance variables, including prognostic biomarkers and/or treatmenteffect modifiers. Data were generated for a total of n = 1500 patients, randomly assigned to the experimental or control group with equal probability of 0.5. We generated random survival times t with constant hazard. Table 1 shows the three null and three alternative scenarios considered in the simulations. Simulation parameters such as the desired baseline (i.e. x = (0, …, 0)^{T}) survival probability S _{0}(⋅) at the horizon τ (5 years in our simulations) and the regression parameters α, β and γ were chosen on the basis of the early breast cancer trial presented in the section “Application”. Thus, we fixed S _{0}(5) = 77%. All the scenarios included censoring and an autoregressive correlation structure [28] between biomarkers \( \left({\rho}_{j{ j}^{\hbox{'}}}={0.8}^{\left j{j}^{\hbox{'}}\right}\right) \) within 25biomarker blocks. As sensitivity analyses, we also investigated other values for: the baseline survival probability S _{0}(5) (i.e. 50%), the number of biomarkers p (i.e. 100 and 1000) and the empirical censoring rate (i.e. between 36% and 62%) to assess their impact on the results. For each scenario, 250 data sets were generated and 250 additional ones were generated for external validation, each with the same parameters as each of the training data sets.
Evaluation criteria
All the evaluation criteria have been measured at τ = 5 years.
Integrated brier score. As a measure of overall prediction error of the models, we used the integrated Brier score (iBrier score, [29]). The timedependent Brier score is a quadratic score based on the predicted timedependent survival probability defined by
and takes censoring into account through Ŝ _{ C }(t _{ i }) = P(C > t _{ i }), the KaplanMeier estimate of the censoring distribution with t _{ i } the survival time of patient i. The integration of the Brier score can be done by over time t ∈ [0, τ] with respect to some weight function W(t) for which a natural choice is (1 − Ŝ(t))/(1 − Ŝ(τ)) [29]. The lower the iBrier score, the larger the prediction accuracy is.
Uno’s Cstatistic. To evaluate the discrimination of the prediction model, we also measured the concordance between \( {\widehat{\pi}}_i \), the linear predictor of (1), and the survival time based on the Uno’s Cstatistic, one of the least biased concordance statistic estimator in the presence of censoring [30]. For the entire model, we computed
The larger the C(τ), the higher the overall discrimination is. ΔUno’s Cstatistic. For the treatmenteffect modifying component \( {\widehat{\eta}}_i \), we computed the absolute difference of the armspecific Uno’s Cstatistics as we proposed previously [10] to evaluate the biomarkerbytreatment interaction strength:
where \( UnoC\left(\widehat{\boldsymbol{\eta}},\ \tau; \boldsymbol{T}= T\right) \) is the Uno’s Cstatistic computed within the arm T only. The larger the ΔC(τ), the higher the interaction strength is.
As a comparator, we also computed these prediction accuracy metrics (iBrier, Uno and ΔUno) for the “oracle model” that is the unpenalized Cox proportional hazards model fitted to the truly related biomarkers in the training set and applied to the validation set.
Bias and precision of the expected survival probability
To assess reliability of the estimation of the expected survival probability, we compared for each patient i, its estimated survival probability Ŝ _{ i }(τ) to its theoretical survival probability S _{ i }(τ). The theoretical survival probability is computed from the baseline survival and the regression parameters of the simulation model. We evaluated the accuracy of Ŝ _{ i }(τ) through the mean bias
and its precision through its standard error
Finally, to evaluate the accuracy of the developed confidence intervals, we also evaluated their empirical coverage probability as
Results
The results of the simulation study are summarized in Tables 2 and 3. These tables summarize both the predictive accuracy metrics and the accuracy and precision of survival estimates of the timetoevent regression models subject to the adaptive lasso penalty. As the lasso penalty provided slightly worse coverage of the confidence intervals and obtained slightly less good biomarkerselection performance in some scenarios [10], its results are given in the Additional files 1 and 2.
The models identified through the adaptive lasso penalty provided predictive accuracy (Table 2) relatively close to the oracle models. Computed on an external validation data set, the iBrier score was quite low (varying from 0.099 to 0.109) and the Uno’s Cstatistic was quite large in presence of prognostic effects (scenarios 2 and 6, varying from 0.586 to 0.670). The ΔUno’s Cstatistic was also large in presence of true treatmenteffect modifiers (scenarios 4–6, varying from +0.207 to +0.229). When evaluating these criteria on the training set used to compute the prediction model, we observed overrated performances of the models as compared to results in external data (training 1cv vs. validation: from +0.077 to +0.155 for C and from +0.062 to +0.107 for ΔC). The double crossvalidation technique (training 2cv) gave internal results close to those of an external validation (validation).
When focusing on the estimation of the survival probability (Table 3), the mean bias was close to zero in all scenarios. Nonetheless, for S _{0} ~ 50% (Additional file 3), a slight negative bias was observed (from −0.007 to −0.020). In terms of precision, the standard error of the survival point estimate was low in absence of biomarker effects (null scenarios 1–2, from 0.05 to 0.06) and was higher in presence of prognostic biomarkers (null scenario 3, 0.09) and even higher with treatmenteffect modifiers (scenarios 4–5, from 0.10 to 0.11). The variability was the largest when both prognostic and treatmentmodifiers were present (scenario 6, SE from 0.13). No relevant difference in terms of accuracy and precision of survival point estimate was observed between the pointwise and spline strategy (Table 3). The standard error was inflated, too, when refitting the selected regression model in a second step (“refit adaptive lasso”) to obtain unpenalized coefficients (e.g. scenario 6: SE from 0.16 to 0.17, Additional file 4).
For the computation of the 95% confidence interval of the estimated survival probability, the nonparametric bootstrap approach outperformed the analytical approach in terms of empirical coverage (Table 3). Despite the former (through the pointwise strategy) produced slightly overly conservative confidence intervals (CP from 0.96 to 0.97), the latter was more biased and in the opposite direction (CP from 0.91 to 0.93 in null scenarios and 0.88 in alternative scenarios). These probabilities for the analytical approach were even lower when using the refit model (CP from 0.68 to 0.73, Additional file 4), whereas remained unchanged for the bootstrap approach.
The categorization of the prognostic scores into risk groups and the use of splines slightly increased the coverage probability of the confidence intervals for the bootstrap approach (increase from 0.96–0.97 to 0.96–1.00) whereas the coverage probability remained unchanged for the analytical approach.
In sensitivity analyses, the overall number of biomarkers p and the empirical censoring rate slightly impacted the results. Indeed, for smaller p and lower censoring rate, the prediction accuracy metrics were closer to the oracle models and the overfitting issue was smaller (Additional files 5 and 6). Also, the precision of the expected survival estimates was higher for this setting (Additional files 7 and 8).
Application
A retrospective biomarker study was performed on tumor samples from n = 1574 patients in an early breast cancer randomized clinical trial comparing chemotherapy plus adjuvant trastuzumab (arm C + T, n = 779) or not (arm C, n = 795). The comprehensive description of this data set is provided in PogueGeile et al. [1]. Gene expression data had been collected for p = 462 genes and normalized as in the original publication. Clinicopathological covariates such as ER and nodal status, and tumor size were also available. Median followup time for distantrecurrence free survival (DRFS) was 7.1 years and the censoring rate was 73% (i.e. 431 events for DRFS). The armspecific 5year DRFS was 84% (CI95%: 81%–86%) and 64% (CI95%: 61%–68%) for patients in arm C + T and C, respectively (Fig. 3). In the 1574 patients, adjuvant trastuzumab led to significant better DRFS as compared to chemotherapy alone (Hazard Ratio = 0.46 [95% CI: 0.38–0.56]).
The clinicogenomic prediction model, established through the model (1) subject to the adaptive lasso penalty, is presented in Table 4. It contains 102 prognostic variables (4 clinical and 98 genomic variables) and 24 treatmenteffect modifiers. Interestingly, some prognostic biomarkers have already been identified in the biomedical literature, such as SOX4 [31] or CSNK1D [32]. In addition, some immune genes were also identified in the treatmenteffect modifying component: CD9 and CCL21, which is consistent with some articles highlighting the involvement of immune pathways in the efficacy of trastuzumab [33, 34]. Fig. 4 provides a graphical visualization of the model representing the 5year DRFS according to the treatmenteffect modifying score for different prognostic risk groups. Confidence intervals were estimated through the nonparametric bootstrap approach (B = 200) and were smoothed via Bsplines using two or three nodes estimated by AIC. The prognostic score was categorized through the distribution proposed by Cox [26], i.e. group 1 (\( \widehat{\boldsymbol{\varphi}} \) < 0.20), group 2 (0.20 ≤ \( \widehat{\boldsymbol{\varphi}} \) < 0.99), group 3 (0.99 ≤ \( \widehat{\boldsymbol{\varphi}} \) < 1.85) and group 4 (1.85 ≥ \( \widehat{\boldsymbol{\varphi}} \)). The entire prediction model (i.e. 126 variables) has a moderate ability to discriminate patients according to their survival probability (C = 0.67 for a double crossvalidation). Regarding the treatmenteffect modifying component of the model, it only slightly discriminates patients for their treatment benefit (ΔC = +0.02 for a double crossvalidation): the lowest the treatmenteffect modifying score, the highest the benefit of the trastuzumab is. As expected, when performing a single crossvalidation the prediction measures are higher (C = 0.80 and ΔC = 0.23). In terms of point estimates, the effect of treatment with trastuzumab on absolute DRFS seems higher for the low treatmenteffect modifying scores. However, the width of the confidence intervals is extremely large and the confidence intervals largely overlap between the two arms. The selected model subject to the lasso penalty identified only the SIAH2 gene as treatmenteffect modifier (Additional file 9). Some recent articles have been proposed to reduce the number of selected biomarkers by selecting a more stringent tuning parameter. We previously proposed the pcvl criterion [35] that reduces the number of selected penalized biomarkers in the present application from 122 to 64 for the adaptive lasso (Additional file 10).
Discussion
Individual predictions for future patients are the ultimate goal of the stratified medicine era [36]. Individual outcome predictions from population risks are most often affected by an inherently large amount of uncertainty. Nevertheless, the expected survival probabilities under different treatment alternatives can be a valuable tool to inform therapeutic decisionmaking.
In the present study, we proposed a unified framework for developing, evaluating and visualizing a prediction tool for future patients in the setting of randomized clinical trials with a survival endpoint in a highdimensional space. We performed a simulation study covering null and alternative scenarios to evaluate different approaches. Based on simulations, the models identified either with the lasso or the adaptive lasso penalties give close results, with accurate and precise estimations of the expected survival probability. The lasso penalty give slightly worse results than the adaptive lasso penalty for the construction of confidence intervals through the analytical approach; however, for both penalized models, the coverage probabilities of their nonparametric 95% confidence intervals are close to the nominal level.
The simulations also highlighted that refitting the regression model after biomarker selection to obtain unpenalized coefficients reduced the precision of the expected survival probability estimations. In case no external data is available, a double crossvalidation is useful to reduce the overoptimism of the measures computed internally (prediction error through the iBrier score and discrimination through the Uno’s Cstatistics). Other resampling techniques such as bootstrap or jackknife could have been also proposed, but are known to be more time consuming. In addition, a previous study comparing resampling methods highlighted that the crossvalidation performed well for predictive accuracy [37]. Regarding the construction of confidence intervals, the nonparametric (bootstrap) confidence intervals outperform the analytical approach in our simulation study. A possible explanation is that standard errors of penalized regression coefficients are difficult to estimate, we used the Hessian matrix based on the estimated parameter values in this study. Furthermore, this approach does not take into account the upstream variable selection process, which adds variability and could explain why the coverage probability is lower than the nominal level.
Finally, we also considered to possibly use smoothing splines and to categorize the prognostic score into four risk groups to obtain a more meaningful graphical visualization of the model. Of course, the cutoffs are arbitrary and should be predefined by warranting a sufficient number of patients per group. Even if the categorization of the prognostic scores slightly reduces the information, this is counterbalanced by a more straightforward graphical representation of the model predictions. In particular, this reduction in coverage probability is minimal with the bootstrap approach.
To the best of our knowledge, no simulation study has previously investigated the accuracy and precision of expected survival probabilities from a penalized Cox regression model investigating in highdimensional spaces both prognostic biomarkers and treatmenteffect modifiers in a randomized clinical trial. Matsui et al. [3] showed in a real data example how to obtain individual estimations from prediction models based on both prognostic biomarkers and treatmenteffect modifiers. However, they did not mention the problem of computing confidence intervals and selected the biomarkers through a univariate approach, which is suboptimal in a highdimensional setting [10]. The inference of survival probabilities is a topic increasingly discussed in the literature. Two recent articles [5, 6] focused on a prognostic setting, mainly in lowdimensional spaces. Sinnott and Cai [5] proposed a twostep ensemble voting approach and Lin and Halabi [6] proposed a perturbation approach to better estimate the standard errors, but did not investigate the impact on expected survival probabilities.
We also applied the proposed strategies to a randomized clinical trial of early breast cancer. Interestingly, the identified model contains some biomarkers already known in the biomedical literature but the interaction strength seemed low based on double crossvalidation. Due to the low number of events of the complete data set, we did not separate it into training and validation sets. Of note, the large width of the confidence intervals of the survival predictions translates the high uncertainty around these measures, as also discussed by [5]. The identified model contains a very large number of biomarkers, but parsimonious models could be preferred. Some strategies have been proposed to reduce the number of selected biomarkers by selecting a more stringent tuning parameter such as the pcvl criterion [35]. Finally, a remaining open question concerns the integration of clinical and genomic variables. Ongoing research focuses on this topic, but no consensus has been reached yet [38, 39]. In the application shown, we left the clinical variables unpenalized because they are known to have a strong prognostic main effect in the early breast cancer setting.
Conclusions
In this paper, we propose a unified framework for developing and validating a prediction model with treatmenteffect modifiers from highdimensional survival data in a randomized clinical trial. We suggest to: internally estimate the performance of the model through a double crossvalidation; estimate the expected survival probabilities at a given horizon for future patients and construct confidence intervals thereof using bootstrap; and visualize them using different plots for different prognostic risk groups with smoothed splines to obtain a meaningful graphical visualization.
Abbreviations
 Anly:

Analytical approach
 Boot:

Bootstrap approach
 Bsplines:

Basis splines
 C:

Uno’s Cstatistic
 CI:

Confidence interval
 CP:

Coverage probability
 CV:

Crossvalidation
 DRFS:

Distantrecurrence free survival
 iBrier:

Integrated Brier score
 MB:

Mean bias
 SE:

Standard error
 ΔC:

Difference in Uno’s Cstatistic between the two treatment arms
References
PogueGeile KL, Kim C, Jeong JH, Tanaka N, Bandos H, Gavin PG, et al. Predicting degree of benefit from adjuvant trastuzumab in NSABP trial B31. J Natl Cancer Inst. 2013;105:1782–8.
Perez EA, Thompson EA, Ballman KV, Anderson SK, Asmann YW, Kalari KR, et al. Genomic analysis reveals that immune function genes are strongly linked to clinical outcome in the North Central Cancer Treatment Group n9831 Adjuvant Trastuzumab Trial. J Clin Oncol. 2015;33:701–8.
Matsui S, Simon R, Qu P, Matsui S, Shaughnessy JD, Barlogie B, Crowley J. Developing and Validating Continuous Genomic Signatures in Randomized Clinical Trials for Predictive Medicine. Clin Cancer Res. 2012;18:6065–73.
Yang H, Tang R, Hale M, Huang J. A visualization method measuring the performance of biomarkers for guiding treatment decisions. Pharm Stat. 2016;15:152–64.
Sinnott JA, Cai T. Inference for survival prediction under the regularized Cox model. Biostatistics 2016;17:692–707.
Lin C, Halabi S. A Simple Method for Deriving the Confidence Regions for the Penalized Cox’s Model via the Minimand Perturbation. Commun Stat  Theory Methods. 2016.
Rothwell PM. Subgroup analysis in randomised controlled trials: importance, indications, and interpretation. Lancet. 2005;365:176–86.
Windeler J. Prognosis  what does the clinician associate with this notion? Stat Med. 2000;19:425–30.
Cox DR. Regression models and lifetables (with discussion). J R Stat Soc Ser B. 1972;34:187–220.
Ternès N, Rotolo F, Heinze G, Michiels S. Identification of biomarkerbytreatment interactions in randomized clinical trials with survival outcomes and highdimensional spaces. Biometrical J. doi: 10.1002/bimj.201500234. [Epub ahead of print]
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B. 1996;58:267–88.
Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006;101:1418–29.
Zhang HH, Lu W. Adaptive lasso for Cox’s proportional hazards model. Biometrika. 2007;94:691–703.
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96:1348–60.
Verweij PJ, van Houwelingen HC. Crossvalidation in survival analysis. Stat Med. 1993;12:2305–14.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.
Friedman J, Hastie T, Simon N, Tibshirani R. Lasso and ElasticNet Regularized Generalized Linear Models. Rpackage version 2.05. 2016. https://cran.rproject.org/package=glmnet. Accessed 12 Dec 2016.
Harrell FEJ, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15:361–87.
Simon RM, Subramanian J, Li MC, Menezes S. Using crossvalidation to evaluate predictive accuracy of survival risk classifiers based on highdimensional data. Brief Bioinform. 2011;12(3):203–14.
Breslow NE. Contribution to the discussion of the paper by DR Cox. J R Stat Soc Ser B. 1972;34:216–7.
Therneau TM, Grambsch PM. Modeling survival data: extending the Cox model. New York: Springer Science & Business Media; 2000.
Breslow N. Covariance analysis of censored survival data. Biometrics. 1974;30:89–99.
Therneau TM, Lumley T. survival: Survival Analysis. Rpackage version 2.401. 2016. https://cran.rproject.org/package=survival. Accessed 12 Dec 2016.
Ng P, Maechler M. A fast and efficient implementation of qualitatively constrained quantile smoothing splines. Stat Model. 2007;7:315–28.
Ng P, Maechler M. Qualitatively Constrained (Regression) Smoothing Splines via Linear Programming and Sparse Matrices. Rpackage version 1.31. 2015. https://cran.rproject.org/package=cobs. Accessed 12 Dec 2016.
Cox DR. Note on grouping. J Am Stat Assoc. 1957;52:543–7.
Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression : a bad idea. Stat Med. 2006;127–141.
Pang H, Tong T, Zhao H. Shrinkagebased diagonal discriminant analysis and its applications in highdimensional data. Biometrics. 2009;65:1021–9.
Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999;18:2529–45.
Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei LJ. On the Cstatistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat Med. 2011;30:1105–17.
Song GD, Sun Y, Shen H, Li W. SOX4 overexpression is a novel biomarker of malignant status and poor prognosis in breast cancer patients. Tumor Biol. 2015;36:4167–73.
Abba MC, Sun H, Hawkins KA, Drake JA, Hu Y, Nunez MI, et al. Breast cancer molecular signatures as determined by SAGE: correlation with lymph node status. Mol Cancer Res. 2007;5:881–90.
Andre F, Dieci MV, Dubsky P, Sotiriou C, Curigliano G, Denkert C, et al. Molecular pathways: involvement of immune pathways in the therapeutic response and outcome in breast cancer. Clin Cancer Res. 2013;19:28–33.
Loi S, Michiels S, Salgado R, Sirtaine N, Jose V, Fumagalli D, et al. Tumor infiltrating lymphocytes is prognostic and predictive for trastuzumab benefit in early breast cancer: results from the FinHER trial. Ann Oncol. 2014;25:1544–50.
Ternès N, Rotolo F, Michiels S. Empirical extensions of the lasso penalty to reduce the false discovery rate in high dimensional Cox regression models. Stat Med. 2016;35:2561–73.
Hayes DF, Markus HS, Leslie RD, Topol EJ. Personalized medicine: risk prediction, targeted therapies and mobile health technology. BMC Med. 2014;12:37.
Molinaro AM, Simon R, Pfeiffer RM. Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005;21:3301–7.
Bøvelstad HM, Nygård S, Borgan Ø. Survival prediction from clinicogenomic modelsa comparative study. BMC Bioinformatics. 2009;10:413.
De Bin R, Sauerbrei W, Boulesteix A. Investigating the prediction ability of survival models based on both clinical and omics data : two case studies. Stat Med. 2014;33:5310–29.
Acknowledgments
Nils Ternès acknowledges financial support by Foundation Philanthropia LombardOdier. The authors also acknowledge the National Surgical Adjuvant Breast and Bowel Project (NSABP) investigators of the B31 trial who submitted data from the original study to dbGaP (dbGaP Study Accession: phs000826.v1.p1) and the NIH data repository. The B31 trial was supported by: National Cancer Institute, Department of Health and Human Services, Public Health Service, Grants U10CA12027, U10CA69651, U10CA37377, and U10CA69974, and by a grant from the Pennsylvania Department of Health.
Funding
This work was supported by the Foundation Philanthropia LombardOdier as a PhD scholarship. The funding sources had no role in study design, data collection, data analysis, data interpretation, or manuscript writing.
Availability of data and materials
The R code implementing all the different approaches is provided in the R package biospear, publicly available on the CRAN.
Authors’ contributions
NT performed the statistical analyses and drafted the manuscript. NT, FR and SM jointly contributed to writing the study protocol, interpreting and discussing results, and to writing the article. NT, FR and SM read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Affiliations
Corresponding author
Additional files
Additional file 1:
Prediction measures of the selected models by the lasso penalty. Additional results of the simulation study. (DOCX 16 kb)
Additional file 2:
Accuracy and precision of the survival probabilities, and coverage probability of their 95% confidence intervals of the selected models by the lasso penalty. Additional results of the simulation study. (DOCX 16 kb)
Additional file 3:
Accuracy and precision of the survival probabilities, and coverage probability of the associated 95% confidence intervals of the selected models by the adaptive lasso penalty (scenarios with S _{0} ~ 50%). Additional results of the simulation study. (DOCX 17 kb)
Additional file 4:
Accuracy and precision of the survival probabilities, and coverage probability of the associated 95% confidence intervals of the selected models by the refit adaptive lasso penalty. Additional results of the simulation study. (DOCX 16 kb)
Additional file 5:
Prediction measures of the selected models by the adaptive lasso penalty (scenarios with p = 100 and 1000). Additional results of the simulation study. (DOCX 21 kb)
Additional file 6:
Prediction measures of the selected models by the adaptive lasso penalty (scenarios with lower censoring rate). Additional results of the simulation study. (DOCX 18 kb)
Additional file 7:
Accuracy and precision of the survival probabilities, and coverage probability of the associated 95% confidence intervals of the selected models by the adaptive lasso penalty (scenarios with p = 100 and 1000). Additional results of the simulation study. (DOCX 18 kb)
Additional file 8:
Accuracy and precision of the survival probabilities, and coverage probability of the associated 95% confidence intervals of the selected models by the adaptive lasso penalty (scenarios with lower censoring rate). Additional results of the simulation study. (DOCX 18 kb)
Additional file 9:
Developed clinicogenomic model through the full biomarkerbytreatment interaction Cox model subject to the lasso penalty. Additional results of the breast cancer application. (DOCX 15 kb)
Additional file 10:
Developed clinicogenomic model through the full biomarkerbytreatment interaction Cox model subject to the adaptive lasso penalty with λ selected through the pcvl criterion. Additional results of the breast cancer application. (DOCX 15 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Ternès, N., Rotolo, F. & Michiels, S. Robust estimation of the expected survival probabilities from highdimensional Cox models with biomarkerbytreatment interactions in randomized clinical trials. BMC Med Res Methodol 17, 83 (2017). https://doi.org/10.1186/s1287401703540
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287401703540
Keywords
 Cox model
 Penalized regression
 Prediction model
 Treatmenteffect modifiers
 Prognostic biomarkers
 Survival estimation
 Confidence intervals
 Precision medicine
 Highdimensional data