This article has Open Peer Review reports available.
Added predictive value of omics data: specific issues related to validation illustrated by two case studies
 Riccardo De Bin^{1}Email author,
 Tobias Herold^{2, 3} and
 AnneLaure Boulesteix^{1}
https://doi.org/10.1186/1471228814117
© De Bin et al.; licensee BioMed Central Ltd. 2014
Received: 6 June 2014
Accepted: 18 September 2014
Published: 28 October 2014
Abstract
Background
In the last years, the importance of independent validation of the prediction ability of a new gene signature has been largely recognized. Recently, with the development of gene signatures which integrate rather than replace the clinical predictors in the prediction rule, the focus has been moved to the validation of the added predictive value of a gene signature, i.e. to the verification that the inclusion of the new gene signature in a prediction model is able to improve its prediction ability.
Methods
The highdimensional nature of the data from which a new signature is derived raises challenging issues and necessitates the modification of classical methods to adapt them to this framework. Here we show how to validate the added predictive value of a signature derived from highdimensional data and critically discuss the impact of the choice of methods on the results.
Results
The analysis of the added predictive value of two gene signatures developed in two recent studies on the survival of leukemia patients allows us to illustrate and empirically compare different validation techniques in the highdimensional framework.
Conclusions
The issues related to the highdimensional nature of the omics predictors space affect the validation process. An analysis procedure based on repeated crossvalidation is suggested.
Keywords
Background
In the last 15 years numerous signatures derived from highdimensional omics data such as gene expression data have been suggested in the literature. A bitter disillusion followed the enthusiasm of the first years, as researchers realized that the predictive ability of most signatures failed to be validated when evaluated based on independent datasets. This issue is now widely recognized and validation is considered most important in omicsbased prediction research by both quantitative scientists such as statisticians or bioinformaticians and medical doctors [1–6], see also topic 18 of the recently published checklist for the use of omicsbased predictors in clinical trials [7].
A validation dataset can be generated by randomly splitting the available dataset into a training set and a validation set. This type of validation does not yield information on the potential performance of the signature on patients recruited in different places or at different times. The training and validation patients are drawn from the same population and are thus expected to be similar with respect to all features relevant to the outcome. In this case, validation can be seen as an approach to correct for all optimization procedures taking place while deriving the signature from the training data [8, 9]. External and temporal validations, in contrast, consider patients from a different place or recruited at a later timepoint, respectively. They give information on the potential performance of the signature when applied to patients in clinical settings in the future. No matter how the validation dataset is chosen, the evaluation of prediction models using validation data is known to yield more pessimistic results than the evaluation performed on the training dataset using crossvalidation or related techniques [6]. This is especially true when highdimensional data are involved, since they are more affected by overfitting issues.
George [3] states that “the purpose of validation is not to see if the model under study is “correct” but to verify that it is useful, that it can be used as advertised, and that it is fit for purpose”. To verify that the model is useful, validation of the predictive ability of the omics model is not sufficient, as the clinical interest centers around the added value compared to previous existing models [10]. To verify that the new model is useful, one also needs to validate the added predictive value. This concept is not trivial from a methodological point of view and one may think of many different procedures for this purpose. While the problem of added predictive value has long been addressed in the literature for lowdimensional models, literature on added predictive value of signatures derived from highdimensional data is scarce [11], although the high dimension of the predictor space adds substantial difficulties that have to be addressed by adapting classical methods.
In this paper we focus on this latter case, aiming to provide a better understanding of the process of validation of the added predictive value of a signature derived from highdimensional data. We tackle this issue from an empirical perspective, using exemplary studies on the prediction of survival in leukemia patients which use highdimensional gene expression data. Our goal is threefold: (i) to demonstrate the use of different methods related to the validation of added predictive value, (ii) to show the impact of the choice of the method on the results, and (iii) to suggest an analysis approach based on our own experience and previous literature.
In order to better shed light on the methodological issues and the actual use of the validation methods, we take advantage of two leukemia datasets which are paradigm cases in biomedical practice. In particular, their relatively small effective sample size (number of events) is typical of this kind of study. It is worth noting, however, that a statistical comparison whose results could be generalizable needs a large number of studies [12] or convincing evidence from simulations, and therefore two examples would have been meant as illustrative even if they had had a larger effective sample size. Furthermore, these studies allow us to pursue our goals in two different situations: one, ideal from a statistical point of view, in which the omics data are gathered in the same way both in the training and in the validation sets, and one in which they are gathered with different techniques, making training and validation observations not directly comparable. In particular, in the first dataset we start from the work of Metzeler and colleagues [13], and we illustrate alternative approaches to study the added predictive value of their score, in addition to their performed validation strategy based on the pvalue of a significance test in the Cox model. The second dataset, instead, allows us better insight into the approaches available in a situation in which a measurement error – in a broad sense including the use of different techniques to measure the gene expressions – makes the validation process more complicated. This is not uncommon in biomedical practice, especially since specific technologies, such as TaqMan Low Density Array, enable rapid validation of the differential expressions of a subset of relevant genes previously detected with a more laborintensive technique [14]. Therefore, it is worth considering this situation from a methodological point of view. It is worth noting that the validation of the added predictive value concerns only the gene signature computed with data collected following the technique used in the validation set, not its version based on the training data. When the training and the validation data are not directly comparable, any analysis must be performed using only the information present in the validation set. In particular, a possible bad performance of the signature, in this case, would not mean an overall absence of added predictive value, but its lack of usefulness when constructed with data obtained with the latter technique.
Acute myeloid leukemia: REMARKlike profile of the analysis performed on the dataset
a) Patients, treatment and variables  

Study and marker  Remarks  
Marker  OS = 86probeset geneexpression signature  
Further variables  v1 = age, v2 = sex, v3 = NMP1, v4 = FLT3ITD  
Reference  Metzeler et al. (2008)  
Source of the data  GEO (reference: GSE12417)  
Patients  n  Remarks  
Training set  Assessed for eligibility  163  Disease: acute myeloid leukemia  
Patient source: German AML Cooperative Group 19992003  
Excluded  0  
Included  163  Treatment: following AMLCG1999 trial  
Gene expression profiling: Affymetrix HGU133 A&B microarrays  
With outcome events  105  Overall survival: death from any cause  
Validation set  Assessed for eligibility  79  Disease: acute myeloid leukemia  
Patient source: German AML Cooperative Group 2004  
Excluded  0  
Included  79  Treatment: 62 following AMLCG1999 trial 17 intensive chemotherapy outside the study  
Gene expression profiling: Affymetrix HGU133 plus 2.0 microarrays  
With outcome events  33  Overall survival: death from any cause  
Relevant differences between training and validation sets  
Data source  Same research group, different time (see above)  
Followup time  Much shorter in the validation set (see text)  
Survival rate  Higher in the validation set (see Figure 2)  
b) Statistical analyses of survival outcomes  
Analysis  n  e  Variables considered  Results/remarks 
A: preliminary analysis (separately on training and validation sets)  
A1: univariate  163  105  v1 to v4  KaplanMeier curves (Figure 1) 
79  33  
B: evaluating clinical model and combined model on validation data (models fitted on training set, evaluated on validation set)  
B1: overall prediction  Prediction error curves (Figure 5)  
Integrated Brier score (text)  
Training  Comparison of KaplanMeier curves for risk groups:  
163  105   Medians as cutpoints (Figure 6),  
B2: discriminative ability  OS, v1 to v4   Kmean clustering (data not shown  see text)  
Cindex (text)  
Validation  Kstatistic (text)  
B3: calibration  79  33  KaplanMeier curve vs average individual survival curves for risk groups (Figure 7)  
Calibration slope (text)  
C: Multivariate testing of the omics score in the validation data (only validation set involved)  
C1: significance  79  33  OS, v1 to v4  Multivariate Cox model (Table 3) 
D: Comparison of the predictive accuracy of clinical and combined models through crossvalidation in the validation data (only validation set involved)  
D1: overall prediction  79  33  OS, v1 to v4  Prediction error curves based on repeated crossvalidation (Figure 8) 
Prediction error curves based on repeated subsampling (data not shown  see text)  
Prediction error curves based on repeated bootstrap resampling (data not shown  see text)  
Integrated Brier score based on crossvalidation (text)  
E: Subgroup analysis (E1E3 based on training and validation sets, E4 and E5 only on validation set; for all, separate analysis for female and male population)  
E1: overall prediction  Female  OS, v1 to v4  Prediction error curves (Figure 9)  
E2: discriminative ability  t.: 88 54  Cindex (text)  
v.: 46 16  Kstatistic (text)  
E3: calibration  Male  Calibration slope (text)  
E4: significance  t.: 74 51  Multivariate Cox model (text)  
E5: overall prediction  v.: 33 17  Prediction error curves based on crossvalidation (Figure 10) 
Chronic lymphocytic leukemia: REMARKlike profile of the analysis performed on the dataset
a) Patients, treatment and variables  

Study and marker  Remarks  
Marker  OS = 8probeset geneexpression signature  
Further variables  v1 = age, v2 = sex, v3 = FISH, v4 = IGVH  
Reference  Herold et al. (2011)  
Source of the data  GEO (reference: GSE22762)  
Patients  n  Remarks  
Assessed for eligibility  151  Disease: chronic lymphocytic leukemia  
Patient source: Department of Internal Medicine III, University of Munich (2001  2005)  
Training set  Excluded  0  
Included  151  Criteria: sample availability  
Gene expression profiling: 44 Affymetrix HGU133 A&B microarrays, 107 Affymetrix HGU133 plus 2.0 microarrays  
With outcome events  41  Overall survival  
Assessed for eligibility  149  Disease: chronic lymphocytic leukemia  
Patient source: Department of Internal Medicine III, University of Munich (2005  2007)  
Validation set  Excluded  18  Due to missing clinical information  
Included  131  Criteria: sample availability  
Gene expression profiling: 149 qRTPCR (only selected genes)  
With outcome events  40  Overall survival  
Relevant differences between training and validation sets  
Data source  Same institution, different time (see above)  
Measurement of gene expressions  Affymetrix HGU133 vs. TaqMan LDA (see text)  
Survival rate  Lower in the validation set (see Figure 4)  
b) Statistical analyses of survival outcomes  
Analysis  n  e  Variables considered  Results/remarks 
F: preliminary analysis (separately on training and validation sets)  
F1: univariate  151  41  v1 to v4  KaplanMeier curves (Figure 3) 
131  40  
G: Multivariate testing of the omics score in the validation data (only validation set involved)  
G1: significance  131  40  OS, v1 to v4  Multivariate Cox model (Table 5) 
H: Comparison of the predictive accuracy of clinical and combined models through crossvalidation in the validation data (only validation set involved)  
H1: Overall prediction  131  40  OS, v1 to v4  Prediction error curves based on crossvalidation (Figure 11) 
Integrated Brier score based on crossvalidation (text) 
Data
Acute myeloid leukemia
The first dataset comes from a study conducted by Metzeler and colleagues [13] on patients with cytogenetically normal acute myeloid leukemia (AML). As one of the main results of the study, the authors suggest a signature based on the expression of 86 probe sets for predicting the eventfree and overall survival time of the patients. In this paper we focus on the latter of the two outcomes, which is defined as the time interval between entering in the study and death. The signature was derived using the “supervised principal component” [16] technique, which in this study leads to a signature involving 86 probe sets. The supervised principal component technique consists of applying principal component analysis to the set of predictors mostly correlated with the outcome; in this specific case, the authors used the univariate Cox scores as a measure of correlation, and they selected those predictors with absolute Cox score greater than a specific threshold derived by a 10fold crossvalidation procedure.
The 86 probe set signature was derived using the omics information contained in a training set of 163 patients, with 105 events (patients deceased) and 58 right censored observations. The validation set included 79 patients, with 33 events and 46 right censored observations. Gene expression profiling was performed using Affymetrix HGU133 A&B microarrays for the training set and Affymetrix HGU133 plus 2.0 microarrays for the validation set. Both sets are available in the Gene Expression Omnibus (reference: GSE12417). Our starting point is the data as provided in the Web depository; see Table 1 for a brief description. For further details concerning specimen and assay issues, in accordance with the criteria developed by the US National Cancer Institute [7], we refer to the original paper [13]. We stress the importance, for clinical applicability of an omicsbased predictor, of following the checklist provided by McShane and colleagues [7, 17].
Chronic lymphocytic leukemia
The second dataset comes from a study conducted by Herold and colleagues [19] on patients with chronic lymphocytic leukemia (CLL). The main goal of this study is also to provide a signature based on gene expression which can help to predict timetoevent outcomes, namely the time to treatment and the overall survival time. We again focus on the overall survival, as the authors did. The signature developed in this study is based on the expression of eight genes and was obtained using the “supervised principal component” technique, similarly to the previous study. In this study, however, the selection of the relevant gene expression predictors is more complex. The univariate Cox regressions measuring the strength of the association between survival time and each of the candidate predictors are not simply conducted based on the whole dataset like in the previous study, but are instead repeated in 5000 randomly drawn bootstrap samples. In each of these samples, the association between each predictor and the outcome was computed, and the predictors with a significant association were selected. The 17 genes most frequently selected across the 5000 bootstrap replications were considered in a further step, which was necessary to discard highly correlated genes. The expressions of the 8 genes surviving this further selection were finally used to construct the prognostic signature. The use of a procedure based on bootstrap sampling is motivated by the necessity of increasing the stability and potentially reducing the influence of outliers [20].
For this study, there was also a training set that was used to derive the signature, and an independent validation set that was used to evaluate its accuracy. The former contains clinical and omics information on 151 patients, with 41 events and 110 right censored observations. Among the 149 patients from the validation set, 18 were discarded due to missing values, resulting in a sample size of 131, with 40 events and 91 censored observations. The gene expression data are available in the Gene Expression Omnibus with reference number GSE22762. Further information about the omics data, as provided in the Web depository, is collected in Table 2. For this dataset as well, we refer the reader to the original paper [19] for the additional details on the preliminary steps of data collection/preparation (and their compliance with the US National Cancer Institute’s criteria for the clinical applicability of an omicsbased predictor [7]).
The peculiarity of this study is that the gene expressions were collected using a different technique for the training set than for the validation set. The training set gene expressions were measured using Affymetrix HGU133 (44 Affymetrix HGU133 A&B, 107 Affymetrix HGU133 plus 2.0), while for the validation patients a lowthroughput technique (TaqMan Low Density Array, LDA) was used to measure only those genes involved in the signature. The validation procedures, therefore, are restricted to use only the validation data and cannot take into consideration the training set.
The considered clinical predictors were age (considered continuous as in the previous study), sex, fluorescent in situ hybridization (FISH) and immunoglobulin variable region (IGVH) mutation status. FISH and IGVH are two widely used predictors in CLL studies [21]. The former is an index based on a hierarchical model proposed by Döhner and colleagues [18] that includes the possible deletion or duplication of some chromosomal regions (17p13, 11q2223, 13q14, 12q13), and has 5 modalities (0 = deletion of 13q14 only, 1 = deletion of 11q2223 but no deletion of 17p13, 2 = deletion of 17p13, 3 = trisomy 12q13 but no deletion of 17p13 or 11q2223, 4 = no previously mentioned chromosomal aberration), while the latter indicates whether IGVH is mutated or not.
Methods
Scores
where the abbreviation OS stands for omics score and the other abbreviations are the names of the involved genes. This score is linear, but in general scores may also show a more complex structure. In some cases they do not even have a simple closed form, for example when they are derived using machine learning tools like random forests.
Strategies
No matter with which algorithm the omics score was derived from the training data, its usefulness as a predictor for prognosis purposes has to be evaluated using a set of patients that have not been considered until now: the validation data. We now focus on this part of the analysis, with special emphasis on the question of the added predictive value given other wellestablished clinical predictors. The underlying idea is that the new omics score is relevant for clinical practice only if it improves the prediction accuracy [22] that one would obtain from existing predictors. An exception where the omics score may be useful even if it does not improve prediction accuracy is when it is, say, cheaper or faster to measure. We assume that this is not the case in most applications and that the question of the added predictive value is an important issue.
Here we consider the following situation: we have at our disposal the clinical data (predictors ${Z}_{1},\dots ,{Z}_{q}$) and the omics data (predictors ${X}_{1},\dots ,{X}_{p}$) for both the training and the validation sets. Furthermore, we know how the omics score can be calculated from the omics data. In the case of linear scores like those suggested in the two considered leukemia studies, it means that we know the coefficients and the name of each involved gene, either from a table included in a paper or from a software object. In the rest of this paper, the function used to calculate the omics score from the omics predictors ${X}_{1},\dots ,{X}_{p}$ will be denoted by ${\widehat{f}}^{\mathcal{T}}({X}_{1},\dots ,{X}_{p})$, where the hat and the superscript indicate that this function was estimated based on the training set.
A. Evaluating the clinical model and the combined model on validation data.The most direct approach to the validation of the added predictive value of an omics score consists of (i) fitting two models to the training data: one involving clinical predictors only and one combining clinical predictors and the omics score of interest, and (ii) evaluating their prediction accuracy on the validation set. The added predictive value can then be considered validated if the prediction accuracy of the combined score (i.e., the score involving both the clinical predictors and the omics score) is superior to the prediction accuracy of the clinical score (i.e., the one based only on clinical predictors). This general approach has to be further specified with respect to

the procedure used to derive a combined prediction score;

the evaluation scheme used to compare the prediction accuracy of the clinical and combined prediction scores on the validation set.
is computed in the same way, without taking into account the omics information.
Regarding issue 2), we need to specify how we measure the prediction accuracy of the prognostic rules based on the clinical and the combined prediction scores. This involves a graphical or numerical investigation of their discriminative ability and calibration, either separately or simultaneously. We will focus later on this issue 2 in a dedicated section, “Evaluation criteria”. In the meantime, we want to stress that, within this strategy (strategy A), the measure of the prediction accuracy is computed in the validation set. There is a major issue related to this approach: the omics score, fitted to the training data, tends to (strongly) overfit these data and to consequently dominate the clinical predictors. This is because the training set is used twice: first for the estimation of ${\widehat{f}}^{\mathcal{T}}$ and then for the estimation of ${\widehat{\beta}}_{1}^{\mathcal{T}},\dots ,{\widehat{\beta}}_{q}^{\mathcal{T}},{\widehat{\beta}}_{\ast}^{\mathcal{T}}$. This issue will be discussed further when examining the application to our two exemplary datasets.
B. Multivariate testing of the omics score in the validation data. To address this overfitting issue, model (1) can also be fitted on the validation data, yielding the estimates ${\widehat{\beta}}_{j}^{\mathcal{V}}$ (for $j=1,\dots ,p$) and ${\widehat{\beta}}_{\ast}^{\mathcal{V}}$ for the clinical predictors ${Z}_{1},\dots ,{Z}_{q}$ and the omics score OS, respectively. Here the exponent stresses the fact that the estimates are computed using the validation data. By fitting the model on the validation data, we do not face the overfitting issues mentioned above, because different sets are used to derive OS and to fit the coefficients of model (1). In this approach the clinical predictors of the training set are not used.
A test can then be performed to test the nullhypothesis β _{∗}=0, for instance a score test, a Wald test or a likelihood ratio test. The pvalue can be used as a simple and familiar measure of association between the score and the outcome. However, the pvalue is more related to the explained variability than to the prediction error, and a small pvalue can also be found if the omics score hardly adds anything to the predictive value [11]. Therefore, the use of the pvalues for the validation of the additional predictive value of an omics score is not sensible. For example, the pvalue gets smaller simply by increasing the sample size, even if the predictive ability of the model does not change [11].
C. Comparison of the predictive accuracy of the models with and without omics score through crossvalidation in the validation data. To focus on predictive ability, one option consists of evaluating the combined model (1) and the model based on clinical data only (2) through crossvalidation (or a related procedure) on the validation set. The main reason to perform this procedure is to avoid the overfitting issues related to the aforementioned double use of the training data for variable selection and parameter estimation. The crossvalidation procedure mimics the ideal situation in which three sets are available: one to construct the omics score, one to estimate the parameters and one to test the model. This is performed by splitting the validation set into k subsets: in each of the k iterations, the outcome of the kth fold (“test set”) is predicted using both the clinical and the combined models fitted in the remaining k1 folds (“learning set”) in turn. Comparing these predictions with the actual values of the outcome present in the kth fold, we can compute a measure of prediction accuracy. As already stated for strategy A, the prediction accuracy of the prognostic rules based on the clinical and the combined prediction scores can be measured in terms of discriminative ability, calibration, or these two properties simultaneously. The details are explained in the dedicated section. Since in each crossvalidation step parameter estimation and measurement of the prediction accuracy are performed in independent sets, we do not face overfitting issues. The averages of the results (in terms of prediction accuracy) obtained in the k iterations for the two models allow the assessment of the added predictive value of the omics score.
Note that for this approach standard multivariate Cox regression may be replaced by any other prediction method if appropriate, for example a method which deals better with the collinearity of the clinical predictors ${Z}_{1},\dots ,{Z}_{q}$ and the omics score.
D. Subgroup analysis. Subgroup analyses may be helpful in the context of added predictive value for different reasons. Firstly, biological reasoning may be available. If there are few existing predictors, examining the performance of the omics score in all possible subgroups defined by the existing predictors is a direct approach to determine its added predictive value, i.e. whether it can discriminate between patients when existing predictors cannot (since they have the same values for all predictors). Secondly, even if there are too many combinations of existing predictors to apply this direct approach, applying the methods described in the above sections to subgroups may yield interesting results, for instance that the omics score has more added predictive value in a particular subgroup. The most important drawbacks of such subgroup analyses are related to sample size (each subgroup being smaller than the whole dataset) and multiple testing issues (if several subgroups are investigated in turn). Care is required in assessing the value of subgroup analyses.
Evaluation criteria
In the description of the different strategies, we have seen that a relevant aspect of validating the added predictive value of an omics score is how to measure the prediction accuracy of a prognostic rule. As we stated above, this can be done by investigating, either separately or combined, the discriminative ability and the calibration. Specifically, the former describes the ability to discriminate between observations with and without the outcome, or, in the case of continuous outcome, correctly ranking their values: in the case of survival data, for example, predicting which observations have the higher risk. Since in this paper we focus on survival analysis, we refer only to those methods that handle timetoevent data. This is true also for the calibration, which, in this context, can be seen as a measure describing the agreement between the predicted and the actual survival times.
Discriminative ability: In the context of survival curves, the discriminative ability is, in principle, reflected by the distance between the survival curves for individuals or groups [23]. Therefore, a graphical comparison between the KaplanMeier curves can be used to assess this property: the best rule, indeed, is the one which leads to the most separated curves. In practice, we can split the observations into two groups, assembled considering the estimates of the linear predictors ${\eta}_{\mathit{\text{comb}}}=\sum _{j=1}^{q}{\beta}_{j}\xb7{Z}_{j}+{\beta}_{\ast}\xb7\text{OS}$ and ${\eta}_{\mathit{\text{clin}}}=\sum _{j=1}^{q}{\beta}_{j}\xb7{Z}_{j}$, for example, using their medians as cutpoints. In this way, we define a low and a highrisk group for both cases (using η _{ comb } and η _{ clin }), and we can plot the resulting four KaplanMeier curves. If the two curves related to the groups which are derived using ${\widehat{\eta}}_{\mathit{\text{comb}}}$ are much more separated than those related to the groups derived using ${\widehat{\eta}}_{\mathit{\text{clin}}}$, then we can assert the presence of added predictive value. In principle, more prognostic groups can be constructed, reflecting a division more meaningful from a medical point of view. Nevertheless, for the illustration purpose of this graphic, the twogroup split is sufficient. In the same vein, the choice of the cutpoint is also not relevant, and we would expect similar results with different (reasonable) cutpoints.
Numerical criteria, instead, can be based on the estimation of the concordance probability or on the prognostic separation of the survival curves. The most popular index which exploits the former idea is probably the Cindex [24]. It consists of computing the proportion of all the “usable” pairs of patients for which the difference between the predicted outcomes for the pairs and the difference between the true outcomes for the pairs have the same sign. Here “usable” means that censoring does not prevent the ordering of them. This limitation shows the dependence of this index on the censoring scheme, which may compromise its performance. To cope with this issue, in this paper we use the version of the Cindex described in Gerds and colleagues [25]. Moreover, for the same reason, we also consider the alternative index proposed by Gönen & Heller [26], which relies on the proportional hazards assumption and is applicable when a Cox model is used. For both indexes, the highest value denotes the best rule (on a scale from 0 to 1).
Calibration: The calibration can also be evaluated graphically. A simple method consists of comparing the KaplanMeier curve (observed survival function) computed in the validation set with the average of the predicted survival curves of all the observations of the validation sample [23]. The closer the predicted curve is to the KaplanMeier curve, the better calibration the prognostic rule has. Under the proportional hazards assumption, a numeric result can be obtained via the “calibration slope”. This particular approach consists of fitting a Cox model with the prognostic score as the only predictor. Good calibration leads to an estimate of the regression coefficient being close to 1. It is worth pointing out that this procedure focuses on the calibration aspect and does not constitute itself, as sometimes claimed in the literature, a validation of the prediction model [23]. Calibration is often considered less important than discriminative ability, because a recalibration procedure can be applied whenever appropriate.
Overall performance: a measure of the overall performance of a prognostic rule should incorporate both discrimination and calibration. The integrated Brier score [27, 28] is such a measure. It summarizes in a single index the timedependent information provided by the Brier score [29] (which measures the prediction error at a specific time t), by integrating it over the time. The best prediction rule is the one which leads to the smallest value for the integrated Brier score. The Brier score can also be plotted as a function of time to provide the prediction error curve, which can be used to graphically evaluate the prediction ability of the model: the lower the curve, the better the prediction rule is. The integrated Brier score corresponds to the area under this curve.
We note that, in order to compute these measures, different levels of information from the training set are needed [23]. For example, the baseline hazard function is necessary to assess calibration, while it is not needed to evaluate the discriminative ability via KaplanMeier curves.
Characteristics of the measures implemented to evaluate the prediction ability of a model
Aspect  Measure  Characteristics 

Discriminative ability  KaplanMeier curves for risk groups  Better with greater distance between the KaplanMeier curves for the low and high risk groups 
Cindex  Estimates the concordance probability, i.e. the probability that the score correctly orders two patients with respect to their survival time; higher values correspond to better prediction  
Kstatistic  Alternative to the Cindex; works only under the proportional hazards assumption  
Calibration  Survival curves  Compares the observed survival function with the average predicted curve 
Calibration slope  Computes the regression coeffcient of the prognostic score as unique predictor; the best values are those close to 1; related to overfitting issues  
Overall prediction  Prediction error curves  Presents the Brier score versus time; the closer the curves are to the Xaxis, the better the prediction 
Integrated Brier score  Computes the area under the prediction error curves; the smaller is the value, the better the prediction 
Results
Acute myeloid leukemia
In this subsection we illustrate the application of different methods and their impact on the results by using the acute myeloid leukemia dataset. For a summary of the analyses performed, we refer to the profile provided in Table 1.
Acute myeloid leukemia: estimates of the loghazard in a multivariate Cox model fitted on the validation data, with the standard deviations and the pvalues related to the hypothesis of nullity of the coefficients (simple null hypothesis)
Variable  Coeff  Sd(coeff)  Pvalue 

Omics score  0.523  0.243  0.0312 
Age (continuous)  0.022  0.015  0.1340 
Sex (male)  0.643  0.404  0.1114 
FLT3ITD  0.436  0.440  0.3220 
NPM1 (mutated)  0.377  0.404  0.3497 
Possible sources of overfitting. As stated by Steyerberg et al. (2010) [28], calibrationinthelarge and calibration slope issues are common in the validation process, and they reflect the overfitting problem [38] that we mentioned previously in the Methods section. With particular regard to calibration slope, the overfitting issue can be related to the need for the shrinkage of regression coefficients [28, 39, 40]. If we go back to Figure 7 and shrink the regression coefficients toward 0, we can see that, in this way, we obtain good calibration (dotted lines, almost indistinguishable from the black one). In the clinical model, the shrinkage is performed by applying a factor of 0.92 to all four regression coefficients: the small amount of shrinkage necessary to move the average predicted curve close to the observed one reveals the relatively small effect of the overfitting issue in a model constructed with lowdimensional predictors. In order to obtain the same results with the combined model, instead, we applied a relatively large shrinkage factor, 0.5, to the regression coefficient related to the omics score (and, therefore, leaving those related to the clinical predictors unchanged). This reflects the typical situation of a model containing a predictor derived from highdimensional data: since this predictor (omics score) has been constructed (variable selection and weight estimation) and its regression coefficient estimated in the same set (the training set), the overfitting issue largely affects the combined model. The fact that we need to apply the shrinkage factor only to the regression coefficient of the omics score, moreover, is a clear signal of how much the omics score, inasmuch derived from highdimensional data, dominates the clinical predictors. This may explain the large distance between the red and the green (continuous) lines in Figure 7. As a result, the effect of the (possibly overfitting) omics score may turn out to hide the contribution of the clinical predictors when estimated on the same training set, in a way that in the validation step we in fact mostly evaluate the predictive value of the omics score. The fact that the problem of overfitting largely affects the calibration of the models, moreover, may influence the analyses based on a direct computation of the Brier score (strategy A), and a more refined approach (strategy C) may be required.
Acute myeloid leukemia: differences in the estimates of the loghazard ratio when the combined model is fitted on the training (first column) or on the validation (second column) data
Loghazard ratios  

Variable  Training  Validation 
Omics score  0.642 (0.172)  0.523 (0.243) 
Age (continuous)  0.021 (0.008)  0.022 (0.015) 
Sex (male)  0.024 (0.208)  0.643 (0.404) 
FLT3ITD  0.448 (0.253)  0.436 (0.440) 
NPM1 (mutated)  0.370 (0.215)  0.377 (0.404) 
B. Multivariate testing of the omics score in the validation data. The combined multivariate model previously fitted on the training set can be further used to derive the pvalue corresponding to the nullhypothesis that the coefficient of the omics score is zero, by estimating its regression coefficients on the validation set. The results are reported in Table 4, and are in line with those presented in the original paper [13]. More precisely, the authors used as clinical predictors only age, FLT3ITD and NPM1, while here we also consider sex. Nevertheless, the effect of sex being weak (with a pvalue of 0.111), the pvalue of the score that we are interested in is hardly affected by this additional predictor (here pvalue = 0.031, in the original paper, 0.037). Since these values are in a borderline area between the most commonly used significance levels of 0.01 and 0.05, we cannot clearly confirm the added predictive value of the omics score. Most importantly, this significance testing approach within the multivariate model does not provide any information on prediction accuracy, an aspect that is considered in the next section.
Note that other resampling techniques for accuracy estimation might be used in place of 10fold crossvalidation. The respective advantages and pitfalls of these techniques have been the topic of a large body of literature [44, 45]. As a sensitivity analysis, we also performed our analysis using a 3fold crossvalidation procedure (repeated 100 times): the results, however, are very similar (data not shown). Please note that the repeated crossvalidation is very similar to the repeated subsampling procedure, which has often been used in the context of highdimensional data analysis [46–48]. The latter considers at each iteration only one of the k crossvalidation splits into learning and test sets. For a large number of subsampling iterations or a large number of crossvalidation repetitions, respectively, both procedures are known to yield similar results [49], which was corroborated by our preliminary analyses (data not shown). Another alternative is the bootstrap: in each bootstrap iteration, the models can be fitted on a bootstrap sample (i.e., a sample randomly drawn with replacement from the validation set) and then evaluated using those observations that are not included in the bootstrap sample. Using the “0.632+” version of bootstrap introduced by Efron and Tibshirani [50], based on 1000 bootstrap replications, we obtain results very similar to those obtained by the aforementioned techniques (data not shown).
In any case, an unexpected relation between sex and the omics score seems to be present. A different way to investigate this relation consists of fitting a multivariate Cox model on the validation set, considering also the interaction between these two predictors. Although the pvalue, as we stressed in the Methods section, is more related to the ability of the predictor to explain the outcome variability than to the predictive ability, its value for the interaction term (0.0499) seems to support the existence of an interaction. This result is hard to explain. Nothing in the medical literature seems to confirm such a strong interaction between sex and geneexpression for leukemia (there are only rare cases of specific gene deletions known to be related to sex, but they are not considered here). This is in contrast to the case, for example, of the interaction between the omics score and FLT3ITD, which is wellknown and was clearly stated in the original paper by Metzeler and colleagues [13]. This iteration could possibly be shown by performing the subgroup approach on the sample split between those patients with and those without the FLT3ITD: unfortunately, the small number of patients without FLT3ITD does not allow us to use this variable to illustrate the subgroup analysis. The total independence between sex and FLT3ITD in the sample (if we test the hypothesis of independence through a Fisher exact test, we obtain a pvalue equal to 1) allows us to exclude the presence of spurious correlation. Moreover, we note that in a multivariate Cox model which includes the interaction term score*sex, the effect of the omics score is more significant (pvalue 0.0035) than in the model without the interaction term (pvalue 0.031, see Table 4). If we consider the interaction FLT3ITD*score in the Cox model, instead, the pvalue of the omics score is high (0.4189), showing that all its explanatory ability lies in the interaction with FLT3ITD (pvalue = 0.0020). It is worth noting, however, that the effective sample size (in survival analysis we should consider relevant only those observations where an event occurs) in the subgroup analysis is small (16 events for women, 17 for men). The results may thus be affected by peculiar characteristics of the sample such as a specific pattern in the censoring scheme. To support this idea, we report the fact that the Kstatistic computed in the two subpopulations (male and female) gives results completely different from the Cindex: its value, indeed, is increased by the inclusion of the omics score in the prognostic index both in the female (from 0.684 for the clinical model to 0.694 for the combined model) and in the male (from 0.631 to 0.665) subgroups. We would like to stress that the provided interpretations should be understood as illustrative, and not as a conclusion for the leukemia study.
Chronic lymphocytic leukemia
Here we show the possibilities to validate the added predictive value in a dataset where the training and validation data are different. We refer to the profile provided in Table 2 for a summary of the analyses performed.
A. Evaluating the clinical model and the combined model on validation data. The most notable peculiarity of this dataset is the different measurements of the gene expressions in the training and validation sets. Part of the advantage of the signature proposed in Herold et al. [19], indeed, lies in the relatively small number of involved genes (eight), which allows the practitioner to use a cheaper and more convenient platform to collect the data needed to compute the omics score. Nevertheless, the different measurements affect the validation strategy to be used for assessing the added predictive value of the omics score. In particular, it makes no sense to estimate a model which includes clinical predictors and omics score based on the training data and to apply this model to the validation data. Since the goal is to validate the added predictive value of the omics score when the gene expressions are collected with the technique used in the validation set, it is necessary to fit the considered models based on the validation data. This is what we do when applying the methods discussed below.
Chronic lymphocytic leukemia: estimates of the loghazard in a multivariate Cox model fitted on the validation data, with the standard deviations and the pvalues related to the hypothesis of nullity of the coefficients (simple null hypothesis)
Variable  Coeff  Sd (coeff)  Pvalue 

Omics score  0.589  0.150  8.65×10^{05} 
Age (continuous)  0.113  0.023  6.82×10^{07} 
Sex (female)  0.157  0.343  0.6472 
FISH =1  0.171  0.459  0.7092 
FISH =2  1.352  0.590  0.0219 
FISH =3  0.195  0.665  0.7694 
FISH =4  0.459  0.427  0.2823 
IGVH (mutated)  0.695  0.416  0.0949 
Discussion
In this paper we deliberately focused on the case of the validation of omics scores fitted on training data in the context of survival analysis in the presence of a few clinical predictors. Other situations may be encountered in practice. Firstly, the omics score may be given from a previous study, in which case the overfitting issue leading to an overestimation of its effect is no longer relevant and the omics score can be treated as any other candidate biomarker. Secondly, there may be situations where a validation set is not available (typically because the available dataset is not large enough to be split). In this case, other (resamplingbased) approaches may be taken to test predictive value and assess the gain of predictive accuracy [51, 52]. Thirdly, the outcome of interest may be something other than the survival time. Binary outcomes (e.g., responder vs. nonresponder) are common. The evaluation criteria used to assess predictive accuracy are of course different in this case. Fourthly, one may also consider the added predictive value of a highdimensional set of predictors versus another highdimensional set of predictors. This situation is becoming more common with the multiplication of highthroughput technologies generating, for example, gene expression data, copy number variation data, or methylation data. Data integration is currently a hot topic in statistical bioinformatics and prediction methods handling this type of data are still in their infancy.
Furthermore, we did not address in our paper the problem of the construction of the omics score. We simply assumed that it was estimated based on the training data with an appropriate method. The construction of such an omics score is of course not trivial and has indeed been the subject of numerous publications in biostatistics and bioinformatics in the last decade. From the point of view of predictive accuracy it may be advantageous to construct the omics score while taking the clinical predictors into account [47, 53, 54] in order to focus on the residual variability, a fact that we did not consider in this paper but plan to investigate in a subsequent study. The two omics scores analyzed here, indeed, were constructed without this expedient, and optimized to take the place of the clinical predictors rather than focusing on the added predictive value of the omics data.
Finally, we point out that, even in the case considered in our paper (validation of omics scores fitted on training data in the context of survival analysis in the presence of a few clinical predictors), further approaches are conceivable. For example, other evaluation criteria for prediction models may be considered; see [23] for a recent overview in the context of external validation. When considering combined prediction models we focused on the multivariate Cox model with clinical predictors and omics score as covariates and with linear effects only. Of course further methods could be considered in place of the Cox model with linear effects, including models with timevarying coefficients, parametric models or nonlinear transformations of the predictors such as fractional polynomials.
As soon as one “tries out” many procedures for assessing added predictive value, however, there is a risk of conscious or subconscious “fishing for significance” – in this case “fishing for added predictive value”. To avoid such pitfalls, it is important that the choice of the method used in the final analyses presented in the paper is not driven by the significance of its results. If several sensible analysis strategies are adopted successively by the data analysts, they should consider reporting all results, not just the most impressive in terms of added predictive value.
Here we have summarized all our analyses in REMARK type profile tables (namely, Tables 1 and 2), in order to increase transparency and to allow the reader to easily go through the study. Transparency is an important issues, and was also highlighted in the US National Cancer Institute’s criteria for the clinical applicability of an omicsbased predictor [7, 17]. Among the 30 points listed in this checklist, one is clearly devoted to the validation of the omicsbased predictor: the validation should be analytically and statistically rigorous. These papers also stress the importance of reproducibility of the analysis: in this vein, we provide all Rcodes used to obtain the results presented in this paper at http://www.ibe.med.unimuenchen.de/organisation/mitarbeiter/070_drittmittel/de_bin/index.html.
Conclusion
In this paper we illustrated and critically discussed the application of various methods with the aim of assessing the added predictive value of omics scores through the use of a validation set. In a nutshell, our study based on two recent leukemia datasets outlined that:

When testing is performed for a multivariate model on the validation data, the omics score may have a significant pvalue but show poor or no added predictive value when measured using criteria such as the Brier score. This is because a test in multivariate regression tests whether the effect of the omics score is zero but does not assess how much accuracy can be gained through its inclusion in the model.

To gain information on – and “validate” – predictive value, it is necessary to apply models with and without the omics score to the validation data. There are essentially two ways to do that.

The first approach (denoted “Evaluating the clinical model and the combined model on validation data” in this paper) consists of fitting a clinical model and a combined model on the training data and comparing the prediction accuracy of both models on the validation data. This is essentially the most intuitive way to proceed in lowdimensional settings. The problem in highdimensional settings is that the omics score is likely to overfit the training data. As a result, its effect might be overestimated when its regression coefficient is estimated using again the same set using for its construction. We have seen how this leads to serious problems, especially in term of bad calibration. Furthermore, this approach is not applicable when the omics data has been measured with different techniques in the training and validation sets, as in the CLL data.

The second approach, which we recommend in highdimensional settings, consists of using a crossvalidationlike procedure to compare models with and without the omics score using the validation set. By using the validation set only, we avoid the overfitting problem described above. When using this approach, it is recommended performing as many repetitions of CV as computationally feasible (and to average the results over the repetitions) in order to achieve more stable results.

Alternatively, one could also fit the models on the validation set and use an additional third set to assess them. This approach would avoid the use of crossvalidation procedures that are known to be affected by a high variance, especially in highdimensional settings. However, the opportunity to assess the models based on a third set is rarely given in the context of omics data, since datasets are usually too small to be split.

In any case, it is important that training and validation sets are completely independent. The practice of evaluating the prediction ability of a model, correctly fitted only on the training set, on the whole dataset obtained by merging the training and validation sets is not appropriate. This would indeed result in an overoptimistic estimation of prediction accuracy, because of the overoptimism observed due to the evaluation on the training data, only partially mitigated by the correct estimate obtained on the independent validation data [7, 55].

All in all, our procedures are in line with the recommendations given in a recent paper by Pepe and colleagues [22]. This paper suggests that, in the case of binary outcome, all the tests based on the equality between the discriminative abilities of the clinical and the combined scores refer to the same null hypothesis, namely the nullity of the coefficient of a predictor in a regression model. Assuming that this statement also roughly applies to the survival analysis framework considered in our paper, it would mean that we can rely on the likelihood test performed on the regression coefficient of the omics score in the combined Cox model to test the difference in performance of the models with and without omics predictors. However, the same authors also claim that estimating the magnitude of the improvement in the prediction ability is much more important than testing its presence [22]. This cannot be done by looking at the regression coefficient of the omics score, as often discussed in the literature [56, 57] and illustrated through our AML data example. In this paper we have seen some procedures to quantify the improvement in prediction accuracy of a model containing an omics score derived from highdimensional data, in order to validate its added predictive value.

Subgroup analyses might give valuable insights into the predictive value of the score, and therefore illustrated through the example of the AML dataset. Normally, the subgroups analysis should be inspired by a clear biological reason and, importantly, performed as far as allowed by the sample sizes. However, one should keep in mind that these analyses are possibly affected by multiple testing issues. Their results should be considered from an explorative perspective.
Due to our experience with the analysis of the two considered leukemia datasets and further similar datasets (data not shown), we recommend comparing the predictive accuracy of the models with and without omics score through a resamplingbased approach on the validation data. The repeated crossvalidation procedure is the natural candidate, but we have seen that alternative methods can be implemented.
Declarations
Acknowledgements
RDB was financed by grant BO3139/41 from the German Science Foundation (DFG) to ALB. The authors wish to thank all the participants of the AMLCG trials and recruiting centers and especially Wolfgang Hiddemann, Thomas Büchner, Wolfang E. Berdel and Bernhard J. Woermann. Further, we would like to thank the Laboratory for Leukemia Diagnostics for providing the microarray data and Karsten Spiekermann, Klaus Metzeler and Vindi Jurinovic for their advice and for collaboration regarding the datasets. Finally, we thank Willi Sauerbrei for pointing to the REMARK type profile and comments on various issues, Rory Wilson for his help in writing the paper and the three referees for the useful comments which improved the manuscript.
Authors’ Affiliations
References
 Simon R: Development and validation of therapeutically relevant multigene biomarker classifiers. J Nat Cancer Inst. 2005, 97: 866867. 10.1093/jnci/dji168.View ArticlePubMedGoogle Scholar
 Buyse M, Loi S, Van’t Veer L, Viale G, Delorenzi M, Glas AM, d’Assignies MS, Bergh J, Lidereau R, Ellis P, Harris A, Bogaerts J, Therasse P, Floore A, Amakrane M, Piette F, Rutgers E, Sotiriou C, Cardoso F, Piccart MJ: Validation and clinical utility of a 70gene prognostic signature for women with nodenegative breast cancer. J Nat Cancer Inst. 2006, 98: 11831192. 10.1093/jnci/djj329.View ArticlePubMedGoogle Scholar
 George S: Statistical issues in translational cancer research. Clin Cancer Res. 2008, 14: 59545958. 10.1158/10780432.CCR074537.View ArticlePubMedGoogle Scholar
 Ioannidis JPA: Expectations, validity, and reality in omics. J Clin Epidemiol. 2010, 63: 960963. 10.1016/j.jclinepi.2009.09.006.View ArticleGoogle Scholar
 Mischak H, Allmaier G, Apweiler R, Attwood T, Baumann M, Benigni A, Bennett SE, Bischoff R, BongcamRudloff E, Capasso G, Coon JJ, D’Haese P, Dominiczak AF, Dakna M, Dihazi H, Ehrich JH, FernandezLlama P, Fliser D, Frokiaer J, Garin J, Girolami M, Hancock WS, Haubitz M, Hochstrasser D, Holman RR, Ioannidis JP, Jankowski J, Julian BA, Klein JB, Kolch W, et al: Recommendations for biomarker identification and qualification in clinical proteomics. Sci Trans Med. 2010, 2: 42View ArticleGoogle Scholar
 Castaldi PJ, Dahabreh IJ, Ioannidis JP: An empirical assessment of validation practices for molecular classifiers. Brief Bioinformatics. 2011, 12: 189202. 10.1093/bib/bbq073.View ArticlePubMedPubMed CentralGoogle Scholar
 McShane LM, Cavenagh MM, Lively TG, Eberhard DA, Bigbee WL, Williams PM, Mesirov JP, Polley MYC, Kim KY, Tricoli JV, Taylor JMG, Shuman DJ, Simon RM, Doroshow JH, Conley BA: Criteria for the use of omicsbased predictors in clinical trials. Nature. 2013, 502: 317320. 10.1038/nature12564.View ArticlePubMedPubMed CentralGoogle Scholar
 Daumer M, Held U, Ickstadt K, Heinz M, Schach S, Ebers G: Reducing the probability of false positive research findings by prepublication validation – experience with a large multiple sclerosis database. BMC Med Res Methodol. 2008, 8: 1810.1186/14712288818.View ArticlePubMedPubMed CentralGoogle Scholar
 Boulesteix AL, Strobl C: Optimal classifier selection and negative bias in error rate estimation: an empirical study on highdimensional prediction. BMC Med Res Methodol. 2009, 9: 8510.1186/14712288985.View ArticlePubMedPubMed CentralGoogle Scholar
 Pencina MJ, D’Agostino Sr RB, D’Agostino Jr RB, Vasan RS: Evaluating the added predictive ability of a new marker: from area under the roc curve to reclassification and beyond. Stat Med. 2008, 27: 157172. 10.1002/sim.2929.View ArticlePubMedGoogle Scholar
 Boulesteix AL, Sauerbrei W: Added predictive value of highthroughput molecular data to clinical data and its validation. Brief Bioinformatics. 2011, 12: 215229. 10.1093/bib/bbq085.View ArticlePubMedGoogle Scholar
 Boulesteix AL: On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al. Bioinformatics. 2013, 29: 26642666. 10.1093/bioinformatics/btt458.View ArticlePubMedGoogle Scholar
 Metzeler KH, Hummel M, Bloomfield CD, Spiekermann K, Braess J, Sauerland MC, Heinecke A, Radmacher M, Marcucci G, Whitman SP, Maharry K, Paschka P, Larson RA, Berdel WE, Büchner T, Wörmann B, Mansmann U, Hiddemann W, Bohlander SK, Buske C: An 86probeset geneexpression signature predicts survival in cytogenetically normal acute myeloid leukemia. Blood. 2008, 112: 41934201. 10.1182/blood200802134411.View ArticlePubMedPubMed CentralGoogle Scholar
 Abruzzo LV, Lee KY, Fuller A, Silverman A, Keating MJ, Medeiros LJ, Coombes KR: Validation of oligonucleotide microarray data using microfluidic lowdensity arrays: a new statistical method to normalize realtime RTPCR data. Biotechniques. 2005, 38: 785792. 10.2144/05385MT01.View ArticlePubMedGoogle Scholar
 Altman DG, McShane LM, Sauerbrei W, Taube SE: Reporting recommendations for tumor marker prognostic studies (remark): explanation and elaboration. BMC Med. 2012, 10: 5110.1186/174170151051.View ArticlePubMedPubMed CentralGoogle Scholar
 Bair E, Tibshirani R: Semisupervised methods to predict patient survival from gene expression data. PLoS Biol. 2004, 2: 10810.1371/journal.pbio.0020108.View ArticleGoogle Scholar
 McShane LM, Cavenagh MM, Lively TG, Eberhard DA, Bigbee WL, Williams PM, Mesirov JP, Polley MYC, Kim KY, Tricoli JV, Taylor JMG, Shuman DJ, Simon RM, Doroshow JH, Conley BA: Criteria for the use of omicsbased predictors in clinical trials: explanation and elaboration. BMC Med. 2013, 11: 22010.1186/1741701511220.View ArticlePubMedPubMed CentralGoogle Scholar
 Döhner H, Stilgenbauer S, Benner A, Leupolt E, Kröber A, Bullinger L, Döhner K, Bentz M, Lichter P: Genomic aberrations and survival in chronic lymphocytic leukemia. N Engl J Med. 2000, 343: 19101916. 10.1056/NEJM200012283432602.View ArticlePubMedGoogle Scholar
 Herold T, Jurinovic V, Metzeler K, Boulesteix AL, Bergmann M, Seiler T, Mulaw M, Thoene S, Dufour A, Pasalic Z, Schmidberger M, Schmidt M, Schneider S, Kakadia PM, FeuringBuske M, Braess J, Spiekermann K, Mansmann U, Hiddemann W, Buske C, Bohlander SK: An eightgene expression signature for the prediction of survival and time to treatment in chronic lymphocytic leukemia. Leukemia. 2011, 25: 16391645. 10.1038/leu.2011.125.View ArticlePubMedGoogle Scholar
 Sauerbrei W, Boulesteix AL, Binder H: Stability investigations of multivariable regression models derived from lowand highdimensional data. J Biopharm Stat. 2011, 21: 12061231. 10.1080/10543406.2011.629890.View ArticlePubMedGoogle Scholar
 Hallek M, Cheson BD, Catovsky D, CaligarisCappio F, Dighiero G, Döhner H, Hillmen P, Keating MJ, Montserrat E, Rai KR, Kipp TJ: Guidelines for the diagnosis and treatment of chronic lymphocytic leukemia: a report from the international workshop on chronic lymphocytic leukemia updating the national cancer institute–working group 1996 guidelines. Blood. 2008, 111: 54465456. 10.1182/blood200706093906.View ArticlePubMedPubMed CentralGoogle Scholar
 Pepe MS, Kerr KF, Longton G, Wang Z: Testing for improvement in prediction model performance. Stat Med. 2013, 32: 14671482. 10.1002/sim.5727.View ArticlePubMedPubMed CentralGoogle Scholar
 Royston P, Altman DG: External validation of a Cox prognostic model: principles and methods. BMC Med Res Methodol. 2013, 13: 3310.1186/147122881333.View ArticlePubMedPubMed CentralGoogle Scholar
 Harrell F, Lee KL, Mark DB: Tutorial in biostatistics multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996, 15: 361387. 10.1002/(SICI)10970258(19960229)15:4<361::AIDSIM168>3.0.CO;24.View ArticlePubMedGoogle Scholar
 Gerds TA, Kattan MW, Schumacher M, Yu C: Estimating a timedependent concordance index for survival prediction models with covariate dependent censoring. Stat Med. 2013, 32: 21732184. 10.1002/sim.5681.View ArticlePubMedGoogle Scholar
 Gönen M, Heller G: Concordance probability and discriminatory power in proportional hazards regression. Biometrika. 2005, 92: 965970. 10.1093/biomet/92.4.965.View ArticleGoogle Scholar
 Binder H, Schumacher M: Allowing for mandatory covariates in boosting estimation of sparse highdimensional survival models. BMC Bioinformatics. 2008, 9: 1410.1186/14712105914.View ArticlePubMedPubMed CentralGoogle Scholar
 Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, Obuchowski N, Pencina MJ, Kattan MW: Assessing the performance of prediction models: a framework for some traditional and novel measures. Epidemiology. 2010, 21: 12810.1097/EDE.0b013e3181c30fb2.View ArticlePubMedPubMed CentralGoogle Scholar
 Graf E, Schmoor C, Sauerbrei W, Schumacher M: Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999, 18: 25292545. 10.1002/(SICI)10970258(19990915/30)18:17/18<2529::AIDSIM274>3.0.CO;25.View ArticlePubMedGoogle Scholar
 Royston P, Sauerbrei W: A new measure of prognostic separation in survival data. Stat Med. 2004, 23: 723748. 10.1002/sim.1621.View ArticlePubMedGoogle Scholar
 Zheng Y, Cai T, Pepe MS, Levy WC: Timedependent predictive values of prognostic biomarkers with failure time outcome. J Am Stat Assoc. 2008, 103: 362368. 10.1198/016214507000001481.View ArticlePubMedPubMed CentralGoogle Scholar
 Pencina MJ, D’Agostino RB, Steyerberg EW: Extensions of net reclassification improvement calculations to measure usefulness of new biomarkers. Stat Med. 2011, 30: 1121. 10.1002/sim.4085.View ArticlePubMedGoogle Scholar
 Zheng Y, Parast L, Cai T, Brown M: Evaluating incremental values from new predictors with net reclassification improvement in survival analysis. Lifetime Data Anal. 2013, 19: 350370. 10.1007/s109850129239z.View ArticlePubMedGoogle Scholar
 Vickers AJ, Elkin EB: Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006, 26: 565574. 10.1177/0272989X06295361.View ArticlePubMedPubMed CentralGoogle Scholar
 Vickers AJ, Cronin AM, Elkin EB, Gonen M: Extensions to decision curve analysis, a novel method for evaluating diagnostic tests, prediction models and molecular markers. BMC Med Inform Decis Making. 2008, 8: 5310.1186/14726947853.View ArticleGoogle Scholar
 Hielscher T, Zucknick M, Werft W, Benner A: On the prognostic value of survival models with application to gene expression signatures. Stat Med. 2010, 29: 818829. 10.1002/sim.3768.View ArticlePubMedGoogle Scholar
 Crowson CS, Atkinson EJ, Therneau TM: Assessing calibration of prognostic risk scores. Stat Methods Med Res. 2013, doi:10.1177/0962280213497434Google Scholar
 Harrell FE: Regression Modeling Strategies: with Applications to Linear Models, Logistic Regression, and Survival Analysis. 2001, New York: SpringerView ArticleGoogle Scholar
 Copas JB: Regression, prediction and shrinkage. J R Stat Soc Ser B (Methodological). 1983, 45: 311354.Google Scholar
 Van Houwelingen J, Le Cessie S: Predictive value of statistical models. Stat Med. 1990, 9: 13031325. 10.1002/sim.4780091109.View ArticlePubMedGoogle Scholar
 van Houwelingen HC: Validation, calibration, revision and combination of prognostic survival models. Stat Med. 2000, 19: 34013415. 10.1002/10970258(20001230)19:24<3401::AIDSIM554>3.0.CO;22.View ArticlePubMedGoogle Scholar
 Martinez JG, Carroll RJ, Müller S, Sampson JN, Chatterjee N: Empirical performance of crossvalidation with oracle methods in a genomics context. Am Stat. 2011, 65: 223228. 10.1198/tas.2011.11052.View ArticlePubMedPubMed CentralGoogle Scholar
 Boulesteix AL, Richter A, Bernau C: Complexity selection with crossvalidation for lasso and sparse partial least squares using highdimensional data. Algorithms from and for Nature and Life. 2013, Switzerland: Springer, 261268.View ArticleGoogle Scholar
 Molinaro AM, Simon R, Pfeiffer RM: Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005, 21: 33013307. 10.1093/bioinformatics/bti499.View ArticlePubMedGoogle Scholar
 Dougherty ER, Sima C, Hanczar B, BragaNeto UM: Performance of error estimators for classification. Curr Bioinformatics. 2010, 5: 5367. 10.2174/157489310790596385.View ArticleGoogle Scholar
 Bøvelstad HM, Nygård S, Størvold HL, Aldrin M, Frigessi A, Lingjærde OC, Borgan Ø: Predicting survival from microarray data  a comparative study. Bioinformatics. 2007, 23: 20802087. 10.1093/bioinformatics/btm305.View ArticlePubMedGoogle Scholar
 Bøvelstad HM, Nygård S, Borgan Ø: Survival prediction from clinicogenomic models  a comparative study. BMC Bioinformatics. 2009, 10: 41310.1186/1471210510413.View ArticlePubMedPubMed CentralGoogle Scholar
 Daye ZJ, Jeng XJ: Shrinkage and model selection with correlated variables via weighted fusion. Comput Stat Data Anal. 2009, 53: 12841298. 10.1016/j.csda.2008.11.007.View ArticleGoogle Scholar
 Boulesteix AL, Strobl C, Augustin T, Daumer M: Evaluating microarraybased classifiers: an overview. Cancer Inform. 2008, 6: 77PubMedPubMed CentralGoogle Scholar
 Efron B, Tibshirani R: Improvements on crossvalidation: the 632+ bootstrap method. J Am Stat Assoc. 1997, 92: 548560.Google Scholar
 Van De Wiel MA, Berkhof J, Van Wieringen WN: Testing the prediction error difference between 2 predictors. Biostatistics. 2009, 10: 550560. 10.1093/biostatistics/kxp011.View ArticlePubMedGoogle Scholar
 Boulesteix AL, Hothorn T: Testing the additional predictive value of highdimensional molecular data. BMC Bioinformatics. 2010, 11: 7810.1186/147121051178.View ArticlePubMedPubMed CentralGoogle Scholar
 Nevins JR, Huang ES, Dressman H, Pittman J, Huang AT, West M: Towards integrated clinicogenomic models for personalized medicine: combining gene expression signatures and clinical factors in breast cancer outcomes prediction. Hum Mol Genet. 2003, 12: 153157. 10.1093/hmg/ddg287.View ArticleGoogle Scholar
 Stephenson AJ, Smith A, Kattan MW, Satagopan J, Reuter VE, Scardino PT, Gerald WL: Integration of gene expression profiling and clinical variables to predict prostate carcinoma recurrence after radical prostatectomy. Cancer. 2005, 104: 290298. 10.1002/cncr.21157.View ArticlePubMedPubMed CentralGoogle Scholar
 McIntosh M, Anderson G, Drescher C, Hanash S, Urban N, Brown P, Gambhir SS, Coukos G, Laird PW, Nelson B, Palmer C: Ovarian cancer early detection claims are biased. Clin Cancer Res. 2008, 14: 7574View ArticlePubMedGoogle Scholar
 Altman D, Royston P: What do we mean by validating a prognostic model?. Stat Med. 2000, 19: 453473. 10.1002/(SICI)10970258(20000229)19:4<453::AIDSIM350>3.0.CO;25.View ArticlePubMedGoogle Scholar
 Pepe MS, Janes H, Longton G, Leisenring W, Newcomb P: Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker. Am J Epidemiol. 2004, 159: 882890. 10.1093/aje/kwh101.View ArticlePubMedGoogle Scholar
 The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14712288/14/117/prepub
Prepublication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.