Re-evaluation of the comparative effectiveness of bootstrap-based optimism correction methods in the development of multivariable clinical prediction models

Multivariable predictive models are important statistical tools for providing synthetic diagnosis and prognostic algorithms based on multiple patients' characteristics. Their apparent discriminant and calibration measures usually have overestimation biases (known as 'optimism') relative to the actual performances for external populations. Existing statistical evidence and guidelines suggest that three bootstrap-based bias correction methods are preferable in practice, namely Harrell's bias correction and the .632 and .632+ estimators. Although Harrell's method has been widely adopted in clinical studies, simulation-based evidence indicates that the .632+ estimator may perform better than the other two methods. However, there is limited evidence and these methods' actual comparative effectiveness is still unclear. In this article, we conducted extensive simulations to compare the effectiveness of these methods, particularly using the following modern regression models: conventional logistic regression, stepwise variable selections, Firth's penalized likelihood method, ridge, lasso, and elastic-net. Under relatively large sample settings, the three bootstrap-based methods were comparable and performed well. However, all three methods had biases under small sample settings, and the directions and sizes of the biases were inconsistent. In general, the .632+ estimator is recommended, but we provide several notes concerning the operating characteristics of each method.

their performances under rare event settings. They also considered Firth's penalized likelihood method [8,9] for estimating regression coefficients, and concluded that the .632+ estimator performed especially well. Several other modern effective estimation methods have been widely adopted in clinical studies. Representative approaches include regularized estimation methods such as ridge [10], lasso (least absolute shrinkage and selection operator) [11], and elastic-net [12]. Also, conventional stepwise variable selections are still widely adopted in current practice [13]. Note that the previous studies of these various estimation methods did not assess the comparative effectiveness of the bootstrapping estimators. Also, their simulation settings were not very comprehensive because their simulations were conducted to assess various methods and outcome measures, and comparisons of the bootstrap estimators comprised only part of their analyses.
In this article, we conducted extensive simulation studies to provide statistical evidence concerning the comparative effectiveness of the bootstrapping estimators, as well as to provide recommendations for their practical use. In particular, we evaluated these methods using multivariable prediction models that were developed with various model building strategies: conventional logistic regression (ML), stepwise variable selection, Firth's penalized likelihood, ridge, lasso, and elastic-net. We considered extensive simulation settings based on a real-world clinical study data, the GUSTO-I trial [14,15]. Note that we particularly focused on the evaluation of the C-statistic [16] in this article, since it is the most popular discriminant measure in clinical prediction models, because the simulation data should be too rich for the extensive studies; the generalizations to other performance measures would be discussed in the Discussion section.

Logistic regression model for a multivariable prediction model
At first, we briefly introduce the estimating methods for the multivariable prediction models. We consider constructing a multivariable prediction model for a binary outcome. A logistic regression model is widely used as a regression-based prediction model [2,17]. Plugging an appropriate estimate of into the above model, the estimated probability 1,2, … , is defined as the risk score of individual patients, and this score is used as the criterion score to determine the predicted outcome.
The ML estimates of for are obtained by maximizing the log likelihood function log 1 log 1 The ordinary ML estimation can be easily implementable by standard statistical software and has been widely adopted in practice [2,17]. However, the ML-based modelling strategy is known to have several finite sample problems when applied to a small or sparse dataset. In particular, the regression coefficient estimators may be associated with overestimation bias [18,19] and can produce optimistic estimates for discriminant measures. Also, the model estimation can become unstable when the predictor effects are large or sparse [9,20]. The estimated probabilities tend to be close to 0 or 1, and can perfectly discriminate between events and non-events. Both properties disappear with increasing events per variables (EPVs), defined as the ratio of the number of events to the number of predictor variables (p) of the prediction model. EPV has been formally adopted as a sample size criterion for model development, and in particular, EPV ≥ 10 is a widely adopted criterion [21]; recent studies showed that the validity of this criterion depended on case-by-case conditions [4, 13,22]. As noted below, the following shrinkage estimation methods can moderate these problems.
For variable selection, some automated mathematical algorithms are available. The representative example is stepwise selection [23]. Forward and backward strategies can be adopted, with the latter generally recommended for the development of prediction models [23]. For the stopping rule, the significance level criterion (a conventional threshold is P < .050), Akaike Information Criterion (AIC) [24], and Bayesian Information Criterion (BIC) [25] can be adopted.

Firth's logistic regression
In order to reduce the first-order finite sample bias of the ML estimator, Firth's penalized likelihood logistic regression was developed [8]. In Firth's method, the regression coefficients are estimated using a modified log likelihood function with a penalty term, where is the Fisher's information matrix [9]. By including the penalty term, the modified log-likelihood could have a maximum even under (quasi-)complete separation, which causes infinite value of ML estimates [9]. Moreover, the penalty term would shrink an estimate towards 0 based on information contained in each predictor variable. As a result, the regression coefficients estimate becomes stable even for small or sparse datasets [9]. However, if the Fisher information is degenerated, the resultant estimator can be instable.

Ridge regression
Ridge regression was originally proposed to deal with collinearity among predictor variables; the ML estimate can be unstable if strong collinearity exists [10]. The ridge estimation is a shrinkage estimation performed by the following penalized log likelihood function: where 0 is a turning parameter to control the degree of shrinkage. Note that the predictors are usually standardized in advance to have mean 0 and variance 1. The penalty term for ridge regression is the sum of squares of the regression coefficients, and the resultant regression coefficients estimate is shrunk towards zero and thereby can reduce overfitting [26]. Several methods, such as CV, were proposed for selection of the tuning parameter [17,26,27].

Lasso regression
Lasso regression is another well-known shrinkage estimation method. The penalized log likelihood function of lasso [11] is defined as where is a positive turning parameter similar to that in ridge regression. The penalty term for lasso is the sum of the absolute value of regression coefficients. Note that the predictors are usually standardized in advance to have mean 0 and variance 1. In contrast to ridge regression, lasso can shrink some regression coefficients to be exactly 0; therefore, lasso can perform shrinkage estimation and variable selection simultaneously. However, lasso is known to perform poorly if highly correlated predictors exist [12].

Elastic-net regression
Elastic-net [12] is an effective method of shrinkage regression that was proposed to overcome the disadvantage of lasso while still retaining the variable selection property.
The penalty of elastic-net is constructed by combining the penalties of lasso and ridge: 1 where 0 is the turning parameter to determine the degree of shrinkage and 0 1 is the weight of the lasso and ridge penalties. Even when there are strong correlations among the predictor variables, the strongly correlated predictor variables can remain together for elastic-net regression. Also, similar to lasso, some coefficients shrink to exactly 0 when the modified log likelihood function is used [12]. The predictors should also be standardized in advance to have mean 0 and variance 1.

Software packages
In numerical analyses in the following sections, all of the above methods were performed in R ver. 3.5.1 [28]. The ordinary logistic regression was fitted by the glm function. The stepwise variable selections were performed using the stats and logistf packages [29]. Firth's logistic regression was also conducted using the logistf package [29].
The ridge, lasso, and elastic-net regressions were performed using the glmnet package [30]; the turning parameters were consistently determined using the 10-fold CV of deviance.

C-statistic
Secondly, we review the internal validation methods used in the numerical studies. We focused especially on the C-statistic in our numerical evaluations, as it is most frequently used in clinical studies as an summary measure of the discrimination of prediction models [2]. The C-statistic is defined as the empirical probability that a randomly selected patient who has experienced an event has a larger risk score than a patient who has not experienced the event [16]. The C-statistic also corresponds to the area under the curve (AUC) of the empirical receiver operating characteristic (ROC) curve for the risk score.
The C-statistic ranges from 0.5 to 1.0, with larger values corresponding to superior discriminant performance.

Harrell's bias correction method
The most widely applied method for bootstrapping optimism correction is Harrell's bias correction [1], which is obtained by the conventional bootstrap bias correction method In the following numerical studies, the numbers of bootstrap resamples were consistently set to 2000. Also, for the model-building methods involving variable selections (e.g., stepwise regression) and shrinkage methods which require tuning parameter selections (ridge, lasso, and elastic-net), all estimation processes were included in the bootstrapping analyses in order to appropriately reflect their uncertainty.

Real-data example: Applications to the GUSTO-I trial
Since we consider the GUSTO-I trial as a model example for the simulation settings, we first illustrate the prediction model analyses for this clinical trial. The GUSTO-I dataset has been adopted by many performance evaluation studies of multivariable prediction models [4, 13,27], and we specifically used the West region dataset [23]. GUSTO-I was a comparative clinical trial to assess four treatment strategies for acute myocardial infarction [14]. Here we adopted death within 30 days as the outcome variable (binary).
The summary of the outcome and the 17 covariates are presented in Table 1. Of the 17 covariates, two variables (height and weight) are continuous, one variable (smoking) is ordinal, and remaining 14 variables are binary; age was dichotomized at 65 years old. For smoking, which is a three-category variable (current smokers, ex-smokers, and never smokers), we generated two dummy variables (ex-smokers vs. never smokers, and current smokers vs. never smokers) and used them in the analyses.
We considered two modelling strategies: (1) 8-predictor models (age >65 years, female gender, diabetes, hypotension, tachycardia, high risk, shock, and no relief of chest pain), which was adopted in several previous studies [4,32]; and (2) 17-predictor models which included all variables presented in Table 1. The EPVs for these models were 16.9 and 7.5, respectively. For the two modelling strategies, we constructed multivariable logistic prediction models using the estimating methods in the Methods section (ML, Firth, ridge, lasso, elastic-net) and backward stepwise methods with AIC and statistical significance (P < 0.05). We also calculated the C-statistics and the bootstrap-based optimism-corrected estimates for these prediction models. The results are presented in Table 2.
For the 8-predictor models, the lasso and elastic-net models selected all eight variables. Also, the two backward stepwise methods selected the same six variables for the 8-predictor models; only 'diabetes' and 'no relief of chest pain' were excluded.
Further, for the 17-predictor models, the lasso and elastic-net models selected the same 12 variables, while 'diabetes', 'height', 'hypertension history', 'ex-smoker', 'hypercholesterolemia', and 'family history of MI' were excluded. For the backward stepwise selections, the AIC method selected nine variables and the significance-based method selected seven variables; see Table 2 for the selected variables.
As a whole, comparison of the apparent C-statistics showed that those of the 17predictor models were larger than those of the 8-predictor models. Although the apparent C-statistics of all the prediction models were not very different for the 8-predictor models, the C-statistic of the backward stepwise selection (P < 0.05) was slightly smaller than those of the others for the 17-predictors models. The optimism-corrected C-statistics were smaller than the apparent C-statistics for all models, and certain biases were discernible.
However, among the three bootstrapping methods, there were no differences for any of the prediction models. Note that the optimism-corrected C-statistics between the 8-and 17-predictor models were not very different. These results indicate that the 17-predictor model had greater optimism, possibly due to the inclusion of noise variables.

Data generation and simulation settings
As described in the previous section, we conducted extensive simulation studies to assess The predictor variables were generated as random numbers based on the parameters estimated from the GUSTO-I dataset. Three continuous variables (height, weight, and age) were generated from a multivariate normal distribution with the same mean vector and covariance matrix as in the GUSTO-I data; the age variable was dichotomized at age 65 years, similar to the analyses in the real-data example. For smoking, an ordinal variable, random numbers were generated from multinomial distribution using the same proportions as in the GUSTO-I data; this variable was converted to two dummy variables before being incorporated into the prediction models. In addition, the remaining binary variables were generated from a multivariate binomial distribution [33] using the same marginal probabilities and correlation coefficients estimated from the GUSTO-I dataset.
We used the mipfp package [34] for generating the correlated binomial variables. The event occurrence probability ( 1,2, … , ) was determined based on the generated predictor variables and the logistic regression model 1/ 1 exp .
The outcome variable was generated from a Bernoulli distribution with a success probability .
The actual prediction performances of the developed models were assessed by 500,000 independently generated external test samples; the empirical C-statistics for the test datasets, which we refer to as the 'external' C-statistics, were used as the estimands.
The number of simulations was consistently set to 2,000 for all the scenarios. Based on the generated derivation cohort dataset, the multivariable prediction models were constructed using the seven modelling strategies (ML, Firth, ridge, lasso, elastic-net, and backward stepwise selections with AIC and P < 0.05). The C-statistics for the derivation cohort were estimated by the apparent, Harrell, .632, and .632+ bootstrapping methods in the derivation cohort. For the evaluation measures, biases and RMSEs (root mean squared errors) for the estimated C-statistics from the external C-statistic for the 500,000 test datasets were used.

Simulation results
The as shown by previous studies [17]. Also, the external C-statistics for ML estimation were relatively smaller when EPV were small; those of Firth regression were a bit larger than those of ML estimation in these cases. Ridge, lasso, and elastic-net showed slightly larger external C-statistics compared with Firth regression. Similar trends were observed under event fraction = 0.0625, but the external C-statistics were relatively large, around 0.75, under EPV = 3-5. These results were caused by the total sample sizes: the event fraction of the latter scenario was smaller and the total sample sizes were larger if EPV were the same. Also, the external C-statistics of the 17-predictor models were smaller than those of the 8-predictor models, especially when EPV were small. The external C-statistics of the stepwise regression were relatively small. Also, the shrinkage estimation methods provided larger external C-statistics compared with ML estimation when EPV were small.
Also, under small EPV settings, the external C-statistics of ridge, lasso, and elastic-net were a bit larger than those of Firth regression. In particular, for the 8-predictor models, the external C-statistics of ML estimation were relatively small when EPV were small.
In Figures 3 and 4, we present the empirical biases of the estimators of C-statistics derived from the external C-statistics. Under all settings and methods, the apparent Cstatistics had relatively large overestimation biases. In particular, under smaller EPV settings, the biases were larger. For the same EPV settings, the biases were smaller when the event fraction was smaller (i.e., the total sample sizes were larger). For event fraction = 0.5, the biases of the apparent C-statistics of ML estimation were larger than those of the other methods for the 8-predictor models. The biases of the stepwise regression were the smallest, but the external C-statistics were also smaller, as noted above. Also, the biases of ridge, lasso, and elastic-net were comparable and were a bit smaller than those of the Firth method. For the 17-predictor models, the overall trends were similar to those of the 8-predictor models, but lasso showed slightly smaller biases than ridge and elastic- In this article, we conducted extensive simulation studies to assess these methods under a wide range of situations. In particular, we assessed their properties in the context of the prediction models developed by modern shrinkage methods (ridge, lasso, and elastic-net), which are becoming increasingly more popular. We also evaluated stepwise selections, which are additional standard methods of variable selection, taking into consideration the uncertainties of variable selection processes.
Conventionally, the rule-of-thumb criterion for sample size determination in prediction model studies is EPV ≥ 10 [21]. In our simulations, the internal validation methods generally worked well under these settings. However, several counterexamples were reported in previous studies [4, 13,22], so this should not be an absolute criterion.
Also, the external C-statistics of the stepwise selections were smaller than those of ML estimation under certain situations, as previously discussed [13], and variable selection methods might not be recommended in practice. Moreover, the shrinkage regression methods (ridge, lasso, elastic-net) provided larger C-statistics than ML estimation and the relative overfitting rate was overestimated in that case, since a small number of events was discriminated well by less overfitted models. This tendency was clearly shown for the ML method, which has the strongest overfitting risk.
Although the bias of the .632+ estimator was relatively small, its RMSE was comparable or sometimes larger than those of the other two methods. Since the .632+ estimator adjusts the weights of apparent and out-of-sample performances using the relative overfitting rate, the .632+ estimator has variations due to the variability of the estimated models under small sample settings. Also, the RMSE of the .632+ estimator was particularly large in the shrinkage estimation methods; the penalty parameters were usually selected by 5-or 10-fold CV and we adopted the latter in our simulations. Since the 10-fold CV is unstable with small samples [35], the overfitting rate often has large variations. We attempted to use the leave-one-out CV instead of the 10-fold CV, and this decreased the RMSE of the .632+ estimator (see e- Table 2 in Supplementary Materials).
On the other hand, the RMSE in the Harrell method became larger. These results indicate that the performances of the optimism-corrected estimators depend on the methods of penalty parameter selections.
In this paper, we conducted simulation studies based on the GUSTO-I study. We considered a wide range of settings by varying several factors to investigate detailed operating characteristics of the bootstrapping methods. A limitation of the study is that parameter settings were based on the GUSTO-I study, but we took care for covering similar settings that were considered in previous large simulation studies [4,17]. In addition, we assessed only the C-statistic in this study. Other measures such as the Brier score and calibration slope can also be considered for the evaluation of optimism corrections. However, in previous simulation studies, these measures showed similar trends [4].

Conclusions
In

e-Appendix B: Supplementary tables
In the Simulations section, we mentioned the final estimated models by lasso, elastic-net, and stepwise selections degenerated to only intercept models at certain frequencies. We present the actual proportions that the only intercept models were obtained from these variable selection methods in e- Table 1.
In addition, we discussed instabilities of the C-statistic estimators of lasso predictive models under small sample settings. It can be caused that the tuning parameter was selected by 10-fold CV, i.e., the small datasets were splitted to 10 subgroups, and the resultant individual training datasets can involve only small number of events. To assess the sensitivity of this estimating method, we conducted a supplementary simulations using leave-one-out CV. e- Table 2 show the supplementary results by lasso of the eightpredictor model at EPV = 3.  There were no intercept model at EPV = 20 and EPV = 40. † S1: Scenario 1, S2: Scenario 2 e- Table 2. RMSE in apparent and optimism corrected C-statistics for lasso of the eight-predictor model based on 10-fold and leave-one-out CV at EPV = 3