A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization

Background In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset—in its entirety—before training/test set based prediction error estimation by cross-validation (CV)—an approach referred to as “incomplete CV”. Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values. Methods We devise the easily interpretable and general measure CVIIM (“CV Incompleteness Impact Measure”) to quantify the extent of bias induced by incomplete CV with respect to a data preparation step of interest. This measure can be used to determine whether a specific data preparation step should, as a general rule, be performed in each CV iteration or whether an incomplete CV procedure would be acceptable in practice. We apply CVIIM to large collections of microarray datasets to answer this question for normalization and PCA. Results Performing normalization on the entire dataset before CV did not result in a noteworthy optimistic bias in any of the investigated cases. In contrast, when performing PCA before CV, medium to strong underestimates of the prediction error were observed in multiple settings. Conclusions While the investigated forms of normalization can be safely performed before CV, PCA has to be performed anew in each CV split to protect against optimistic bias. Electronic supplementary material The online version of this article (doi:10.1186/s12874-015-0088-9) contains supplementary material, which is available to authorized users.


A Simulation study for the example of supervised variable selection
As described in the section 'Simulation study' of the main paper, we conducted a simulation study to investigate basic statistical properties of CVIIM s,n,K .

Simulation setup
To make our simulation design representative of biomedical data, we use the transcriptomic dataset ProstatecTranscr to estimate realistic parameters to be used in the data-generating process. Datasets of sizes n = 50 and n = 100 are generated with p = 2000 continuous variables and a binary target variable with equal proportions in the classes.
We consider two different settings for the mean and covariance structure M eanCov for the classes: 1. Scenario with strong signal: 2000 correlated variables are generated, of which 200 are informative, these having classspecific means and covariances. As a first step towards obtaining the simulation setup we preselect 2000 of the 12625 variables of ProstatecTranscr, those yielding the smallest pvalues from two-sample t-tests between the observations from the two classes. From these we again select the 200 variables corresponding to the smallest p-values: we take these as the informative variables, with the remaining 1800 as the non-informative. For each informative variable we calculate the difference between the mean of the observations belonging to class 2 and that of the observations belonging to class 1, resulting in the vectorδ of length 200. Furthermore we calculate the empirical covariance matrix of the informative variables separately for classes 1 and 2:Σ class1 andΣ class2 . In the simulation, the vector of the informative variables is drawn from N (0 200 ,Σ class1 ) for observations from class 1 and from N (δ,Σ class2 ) for observations from class 2. We could theoretically draw the values of all 1800 non-informative variables at once from a multivariate normal distribution with covariance matrix estimated from the data. However, this is computationally intractable. Therefore we split the 1800 non-informative variables into blocks of 200 variables, and for each of these blocks we calculate the empirical covariance matrixΣ 0j j = 1, . . . , 9. In the simulation we draw a vector from N (0 200 ,Σ 0j ) for j = 1, . . . , 9 and subsequentially combine these vectors. Thus, the covariance matrix of the non-informative variables is block-diagonal with a block size of 200.

Scenario with weak signal:
This scenario is conceptually equivalent to the previous with the following differences: only 100 variables are informative, the entries in the mean vectorδ for class 2 from scenario 1-corresponding to the 100 informative variables-are multiplied by 0.7 and the block sizes for the non-informative variables are reduced from 200 to 100.
For the supervised variable selection we consider two different numbers psel of selected variables: psel = 10 and psel = 1000. The variables yielding the smallest p-values from two-sample t-tests between the observations from the two classes are selected. Linear Discriminant Analysis is used as a classification method with psel = 10 variables and Diagonal Linear Discriminant Analysis is used with psel = 1000. Again we consider the following commonly used splitting ratios between the sizes of the training and test sets: 2:1 (3-fold CV), 4:1 (5-fold CV) and 9:1 (10-fold CV). K will again denote the number of folds in the CV.
We perform the simulation for each possible combination of M eanCov, n, psel and K, leading to 24 simulation settings in total. For each setting we simulate 2000 datasets and for each we calculate the estimate CVIIM s,n,K . As with our real-data analyses we repeat the full and incomplete

Results
Supplementary Figure 1 shows boxplots of the CVIIM s,n,K -values for all simulation settings. The bias with respect to the true measure values CVIIM P,n,K is negligible in all settings. However, the variance around the true values is relatively large in many of the considered settings. Further note that when computing CVIIM over multiple datasets-as one would in a more extensive realdata analysis-the variability measured takes into account that within the given distribution (as examined in this simulation study) and that over the datasets.
The dependency of CVIIM P,n,K on the individual simulation parameters can be better assessed by examining Supplementary Figure 2. The number of observations n has a negative effect on CVIIM in all cases. An important and slightly surprising observation is that our results suggest no or only a slight dependence on the number of folds K. We observe higher values of CVIIM when selecting 1000 variables, but this should not be over-interpreted since it may result from the specific simulation design. In our supervised variable selection analyses on real datasets we did not make this observation. The influence of the mean-covariance structure M eanCov depends on psel (see Supplementary Figure 1). For psel = 10 we observe smaller CVIIM s,n,K -values for the scenario with weak effects compared to that with strong effects; for psel = 1000 it is the reverse. This might be explained by the fact that in the scenario with weak effects there are only 100 informative variables. When selecting 1000 variables more noise variables are selected, impacting E[e f ull,K (S)]-causing the error to be larger-much more than E[e incompl,K (S)].
The dependency of the variance of CVIIM s,n,K on the simulation parameters and on CVIIM P,n,K is visualized in Supplementary Figure 3. Unsurprisingly the variance decreases with increasing n and increases with the number of folds K. The latter can be explained as follows: CVIIM s,n,K involves the fraction of two CV estimators which, with increasing K, become increasingly dataset dependent-due to the training sets sharing more observations with the entire dataset-and therefore more variable. In the scenario with the stronger effects we generally observe larger variances. When selecting only psel = 10 variables the variances are also much higher than for psel = 1000. For the scenario with psel = 10 we observe a strong dependency of the variance on the true value of the measure, with smaller measure values leading to smaller variances. This dependency cannot be seen as clearly in the case of psel = 1000: a possible explanation is that the measure values are in general higher in this setting, obscuring the dependence at the-relatively-smaller

Prediction rules, prediction errors and their estimation
Let X ⊂ R p denote the predictor space and Y = {1, 2} the space of the response variable. Note that this notation also allows for categorical covariates, e.g.
observations drawn from the distribution P . Most importantly, in our paper x ∈ X denotes the "raw" data, meaning that these predictors may be subject to data preparation steps (possibly modifying their number or scale of measurement) before being used as predictors in classification. We consider a classification function g : X → Y, x → g(x) that takes the vector x as an argument and returns a prediction y of the value of the response variable. For example, consider a classification task where, based on microarray samples, patients are classified as having cancer or not using the Nearest Shrunken Centroids approach [1]. Then the corresponding function g would take the pre-normalized expression values as an argument, perform normalization and classify the sample using a certain value of the shrinkage parameter. These steps are assumed to be performed in an ideal way, where all occurring parameters are estimated or optimized using a hypothetical dataset with sample size tending to infinity.
In practice g is estimated from the available data. We therefore defineĝ S : X → Y, x →ĝ S (x) as the classification function estimated from S. In the example outlined above, this means that the parameters involved in the normalization procedure, as well as the averages and variances involved in the Nearest Shrunken Centroids classifier, are estimated from S and that the shrinkage parameter is also chosen based on S. The estimated classification functionĝ S can then be used to predict y for a new observation.
Note that-as already outlined in the section 'Addon procedures' of the main paper-depending on the procedures involved in the estimation, it is not always straightforward to construct such a functionĝ S that can be applied to predict independent data. From now on, we will however assume that the necessary addon procedures (see the section 'Addon procedures' of the main paper) are available and that we can thus construct the functionĝ S . It is important to assess the prediction error ofĝ S , which is defined as where L(·, ·) is an appropriate loss function, for example the indicator loss yielding the misclassification error rate, as used in the main paper. The error defined in Eq. (1) is commonly termed "conditional" because it refers to the specific sample S. The average error over all samples following P n is referred to as the unconditional error and denoted by ε(n) y 1 ), . . . , (x n , y n )} denote a realization of the random sample S. If we had a large independent sample at hand we could estimate ε[ĝ s ] directly by comparing the true values of the response variable in this data to the predictions made byĝ s . Having only s at hand a naive approach would be to estimate ε[ĝ s ] using s itself as test data. This approach yields the socalled "apparent error" or "resubstitution error" that is well known to be downwardly biased (i.e., too optimistic) as an estimator of ε[ĝ s ], since estimation of the classification function and error estimation are conducted on the same data. Resampling-based error estimation can be performed to address this issue. The sample s is iteratively split into non-overlapping training and test datasets. In each iteration the function g is estimated based on the training set, and the error of this estimated function is assessed based on the test set. We examine K-fold cross-validation-the most widely used of these resampling-based approaches.
Given a random partition of the dataset s into K approximately equally sized folds s 1 , . . . , s K , the K-fold cross-validation error estimate is given as whereby # represents the cardinality, s \ s k is the training set in iteration k and s k is the test set. Since this estimate (highly) depends on the considered random partition of the sample s into K folds, it is recommended to repeat this procedure B > 1 times and average the error estimates over the B repetitions. With s b1 , . . . , s bK denoting the folds considered in the b-th repetition, the repeated K-fold cross-validation error estimate is given as If, for simplicity, we assume that the s b1 , . . . , s bK (b = 1, . . . , B) are equally sized with size n train,K := #s \ s bk for b ∈ {1, . . . , B} and k ∈ {1, . . . , K}, it can be easily seen that e K (s) is an unbiased estimator of ε(n train,K ) and therefore an upwardly biased estimator of ε(n). This bias is called the "inherent bias" of CV in [2]. Note that the notation e K (s) does not reflect the fact that the repeated K-fold cross-validation error estimate depends on the random partitions in the B iterations. For our purposes we assume B to be chosen large enough so that this dependency can be ignored.

B.2 Incomplete versus full cross-validation
With the issue of incomplete cross-validation in mind, we introduce the notation to denote an estimated classification function that is estimated partly based on a sample a 2 and partly based on a possibly smaller subsample a 1 (i.e., one or several steps may be performed on a bigger sample). Returning to the example of microarray-based classification, it is common practice to run the normalization procedure-and often also the parameter tuning-based on the whole dataset s, but to perform the training of the classifier within cross-validation, i.e., based only on the training set s \ s bk in each iteration k of each repetition b. In this scenario a 2 would be the whole dataset s and in each CV iteration a 1 would be the training set s \ s bk . With a 1 = s \ s bk and a 2 = s for b = 1, . . . , B and k = 1, . . . , K, we obtain the incomplete CV error estimate, which is downwardly biased as an estimator of ε(n train,K ): where the index "incompl" indicates that the whole sample s is used for at least part of the data analysis steps required for the estimation of g, and that the resulting CV procedure is thus incomplete. The estimator e incompl,K (s) is unbiased as an estimator of the average incom- Y 1 ), . . . , (X ntrain,K , Y ntrain,K )} and (X ntrain,K +1 , Y ntrain,K +1 ) playing the role of an arbitrary test set observation. We assume exchangeability of the random observations in S.
Furthermore, since by definitionĝ

B.3 Behavior of CVIIM s,n,K for small ε f ull (n train,K )-values
For very small values of ε f ull (n train,K ), extreme CVIIM estimates can occur (either zero or very high values). For very small values of e f ull,K (s), the CVIIM estimate is highly sensitive to relatively small differences between e incompl,K (s) and e f ull,K (s), which may be due at least partly to random fluctuations. For example, suppose that e f ull,K (s) = 0.01 and e incompl,K (s) = 0.001, then we would have CVIIM s,n,K = 0.9. Note however that such extremely large results are expected to be rare due to a mechanism related to regression toward the mean: considering the high variance of CV estimates, in many cases very small values of e f ull,K (s) are an underestimation of ε f ull (n train,K ). In this case it is unlikely that ε incompl (n train,K ; n) is considerably more affected by underestimation. Thus in such a situation it is unlikely that e incompl,K (s) is much smaller than e f ull,K (s). Instead, the incomplete CV error estimator e incompl,K (s) is more likely to be closer to its mean than e f ull,K (s), thereby preventing an overly large CVIIM estimate. Researchers applying CV to optimize tuning parameters in the way described here may be tempted to use as an error estimate of the prediction rule that CV error estimate obtained during optimization for the ultimately chosen tuning parameter value, i.e. the smallest one. The optimistic bias of this estimate has been studied in the literature [2,8,9]. Given a large enough number of repetitions of the CV used in the optimization, this optimistic bias becomes equivalent to the bias resulting from performing the optimization before CV, as studied here. This is because the dependence of the CV error estimates-those of the optimization process-on the specific training/test set divisions diminishes with increasing number of repetitions. As a result, the additional distortion of the estimate studied by Varma and Simon [2] and Bernau et al. [8], which is due to the impact of optimally selecting the smallest error estimate, compared to the distortion of the incomplete CV estimate studied here, decreases in the same way.

Imputation of missing values
We perform k-Nearest-Neighbors imputation [11], a procedure that is commonly used for the analysis of high-dimensional microarray data. Prior to imputation the variables are centered and scaled and the estimated means and standard deviations are stored. After imputing the values, they are rescaled using the stored standard deviations and the stored means are added to retransform the data to the original level. The result of the imputation may depend critically on the number k of nearest neighbors considered. Therefore to optimize this parameter on the grid {1, 2, 3, 5, 10, 15} we again employ 3-fold CV. For correct addon imputation, in addition to using the means and standard deviations estimated from the training data, we also only examine the training data when searching for the k nearest neighbors.
For this illustrative study, we first consider a collection of five datasets, GenitInfCow, containing measurements on 51 cows, where 36 suffer from a major genital infection. Each dataset (IDs 33 -37 in Table 1 in the main paper) contains measurements from a specific week and comprises between 21 and 27 variables. These datasets were obtained via personal communication with Michael Schmaußer. Random Forests are used as a classification method here. Our second example is the dataset ProstatecMethyl (ID 38 in Table 1

F Combination of several steps
We considered the following combination of steps: imputation of missing values, supervised variable selection, and optimization of tuning parameters. For imputation the algorithm described in Appendix D was used. For supervised variable selection we chose the 10 variables with the smallest p-values from Wilcoxon's two-sample tests. As a classification method Random Forests with mtry as a tuning parameter was used. Here, mtry was optimized from the grid {1, 2, . . . , 5} through internal CV as described in the section 'Study design' of the main paper.
In addition to e f ull,K (s) (where all three preparation steps are conducted on the training datasets and addon procedures are used to prepare the test datasets) and e incompl,K (s) (where all three steps are conducted on the whole dataset), for each step we also calculate the "stepspecific incomplete CV error" (abbreviation: stspincCVerr) obtained when the considered step is performed using the whole dataset and the others within CV. stspincCVerrs are calculated to investigate the contribution of the individual steps to the discrepancy between e f ull,K (s) and e incompl,K (s).
In the following we describe the procedures performed to obtain the stspincCVerrs. For the first step (imputation), the stspincCVerr is simply computed by performing the considered step using the whole dataset and then starting the CV repetitions, i.e., all subsequent steps are included in the repeated CV. For steps which are not performed when starting preparation, we have to act differently to obtain a stspincCVerr. In fact, each preceding step changes the data. Therefore we cannot employ ordinary CV in the process of calculating the stspincCVerrs, because it is not possible to perform the considered step only once on the whole dataset. Instead, to obtain a stspincCVerr-value in these cases, we apply the following CV-like procedure using a random partition of the dataset into K equally sized folds as in ordinary CV. For k ∈ {1, . . . , K}: 1) Perform all steps up to the considered step on the k-th training set, each time each time adjusting the corresponding test set with the appropriate addon procedure; 2) Merge training and test set and perform the considered step on the resulting concatenated dataset; 3) Split the result out of 2) into the division of training and test data again; 4) Perform all remaining data preparation steps on the training set, again each time adjusting the test set with the appropriate addon procedure; 5) Calculate the misclassification error of the prediction rule on the test set. We stress that the only goal of this procedure is to assess the impact of the individual steps within the combinations. It has no other meaningful application in error estimation in practice.
As in the analyses of single steps, we also perform B = 300 repetitions of the CV(-like) procedures. We used the dataset ProstatecMethyl.
The CVIIM s,n,K -values were 0.2846, 0.3121 and 0.3442 for K = 3, K = 5 and K = 10, respectively. When calculating the step-specific errors (Supplementary Figure 13) it was interesting to notice that when performing only supervised variable selection on the whole dataset the error was virtually identical as when performing all steps on the whole dataset. Correspondingly, incomplete CV based on each of the other steps singly gave results almost no different to full CV. Therefore in this example the supervised variable selection was almost completely alone responsible for the difference between the completely correct and incorrect CV procedure. We summarize that individual influential steps can play a dominating role when considering combinations of different steps. Note, however, that the analysis presented in this section should be considered an illustration. The results observed here cannot be generalized, since they can depend strongly on the specific setting and dataset used. Step-specific incomplete CV errors for the data preparation combination of imputation, variable selection and tuning