The relative importance of variables in predictive models: bootstrapping, p-values, and sensible modelling strategies

Ewout Steyerberg, ErasmusMC - University Medical Center Rotterdam, the Netherlands

16 December 2009

The development and validation of prediction models poses many challenges. Determining the relative importance of variables for inclusion in such models is an extremely tough research question, which was addressed recently by Beyene et al [1].

Previous work on the relative importance of predictors in a model has shown very worrying results. For example, simulations with stepwise selection methods showed that the specific set of predictors in a model was very unstable, and that the rank order of importance of predictors in a selected model was even more unstable [2].

There are several major problems with the procedures that the authors of the recent BMC paper propose, apart from general suboptimal modelling such as by dichotomizing all predictors [3]. First, the authors use bootstrap resampling to study the frequency of selection of predictors with stepwise methods. We agree that the bootstrap is an important tool in prediction modelling, for example to quantify the optimism in predictive performance of a model. Also, the bootstrap can warn us against overinterpretation of a specific model that is selected by stepwise methods [4] [5]. The chances of identifying exactly the same model among bootstrap resamples are disappointingly low in many cases. Bootstrap selection frequencies of variables often provide no new information, since they just reflect the overall p-values. If bootstrap selection frequencies are different than what is noted from the overall p-values, this reflects correlations between variables. This makes the bootstrap frequency of a predictor meaningless, just as statistical tests of univariate association for a predictor with outcome are rendered meaningless in the presence of confounding variables. For example, the effect of age as a single predictor, the effect of age adjusted for sex, and the effect of age adjusted for sex plus T-cell immunophenotype are all different analyses for the effect of age [1]. We may label all the same as ‘effect of age’, but the meaning is different in each analysis.

Second, the proposed cross-validation procedure is misleading. By splitting the data set 50:50, and requiring significance in both parts of the data set, the effective p-value for selection is 0.05 * 0.05 = 0.0025. Hence, effectively a much lower p-value is used for selection of predictors. This reduces the Type I error, at the price of limited power to detect true effects. The latter is known to be most important for the predictive performance of a model [6].

Third, the authors use the term generalizability while they only study one particular data set from one particular setting. Generalizability is synonymous to external validation, and should be reserved for the situation that a prediction model is tested in a new, independent setting, different in time and/or place [7]. The current proposal even lacks internal validation; this might require a double bootstrap procedure, where the whole modelling process is repeated in bootstrap resamples of the original data set.

In sum, we have severe doubts about the proposals made considering the relative importance of predictors. No simulation studies or other support of validity is provided for the claim that ‘stable and reproducible models with good performances’ are obtained with the proposed strategy. Neither are the ‘methods a good tool for validating a predictive model’. Stepwise methods in whatever variant lack a reasonable scientific foundation. Many drawbacks are known, including instability of selection, biased estimation of effects, underestimation of variability, inflation of p-values [4] [5]. We should hence avoid using stepwise methods as much as possible, and may better resort to penalization methods such as the Lasso [8].

If one were to judge the relative importance of predictors, more sensible strategies can readily be imagined. A simple solution is to study the Wald statistics in a full model, i.e. the decrease in fit when a predictor is omitted from the model. The advantage of using the full prediction model is that effects of predictors are adjusted for each other. The GUSTO-I prediction model was presented with such an anova table [9]. An alternative approach is to study the improvement in fit by each predictor. For example, 26 predictors of outcome after traumatic brain injury were studied by partial Nagelkerke R2 statistics, including contributions in univariate analysis of each predictor and in fully adjusted analyses [10]. As expected, the relative importance of a predictor was smaller in a larger model, since various positive correlations between predictors were present.

We hope that the latter approaches will be followed by others involved in prediction modelling rather than the proposals made in the recent paper.

Authors: E.W. Steyerberg, PhD Department of Public Health Erasmus MC PO Box 2040 3000 CA Rotterdam The Netherlands e.steyerberg@erasmusmc.nl

F.E. Harrell, PhD Department of Biostatistics Vanderbilt University S-2323 Medical Center North Nashville, TN 37232-2158 f.harrell@vanderbilt.edu

References 1. Beyene J, Atenafu EG, Hamid JS, To T, Sung L: Determining relative importance of variables in developing and validating predictive models. BMC medical research methodology 2009, 9:64. 2. Derksen S, Keselman H: Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. Br J Math Stat Psychol 1992, 45:265-282. 3. Royston P, Altman DG, Sauerbrei W: Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006, 25(1):127-141. 4. Harrell FE: Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York: Springer; 2001. 5. Steyerberg EW: Clinical prediction models: a practical approach to development, validation, and updating. New York: Springer; 2009. 6. Steyerberg EW, Eijkemans MJ, Harrell FE, Jr., Habbema JD: Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis Making 2001, 21(1):45-56. 7. Justice AC, Covinsky KE, Berlin JA: Assessing the generalizability of prognostic information. Ann Intern Med 1999, 130(6):515-524. 8. Tibshirani R: Regression and shrinkage via the Lasso. J R Stat Soc, Ser B 1996, 58:267-288. 9. Lee KL, Woodlief LH, Topol EJ, Weaver WD, Betriu A, Col J, Simoons M, Aylward P, Van de Werf F, Califf RM: Predictors of 30-day mortality in the era of reperfusion for acute myocardial infarction. Results from an international trial of 41,021 patients. GUSTO-I Investigators. Circulation 1995, 91(6):1659-1668. 10. Murray GD, Butcher I, McHugh GS, Lu J, Mushkudiani NA, Maas AI, Marmarou A, Steyerberg EW: Multivariable prognostic analysis in traumatic brain injury: results from the IMPACT study. J Neurotrauma 2007, 24(2):329-337.

## The relative importance of variables in predictive models: bootstrapping, p-values, and sensible modelling strategies

Ewout Steyerberg, ErasmusMC - University Medical Center Rotterdam, the Netherlands

16 December 2009

The development and validation of prediction models poses many challenges. Determining the relative importance of variables for inclusion in such models is an extremely tough research question, which was addressed recently by Beyene et al [1].

Previous work on the relative importance of predictors in a model has shown very worrying results. For example, simulations with stepwise selection methods showed that the specific set of predictors in a model was very unstable, and that the rank order of importance of predictors in a selected model was even more unstable [2].

There are several major problems with the procedures that the authors of the recent BMC paper propose, apart from general suboptimal modelling such as by dichotomizing all predictors [3]. First, the authors use bootstrap resampling to study the frequency of selection of predictors with stepwise methods. We agree that the bootstrap is an important tool in prediction modelling, for example to quantify the optimism in predictive performance of a model. Also, the bootstrap can warn us against overinterpretation of a specific model that is selected by stepwise methods [4] [5]. The chances of identifying exactly the same model among bootstrap resamples are disappointingly low in many cases. Bootstrap selection frequencies of variables often provide no new information, since they just reflect the overall p-values. If bootstrap selection frequencies are different than what is noted from the overall p-values, this reflects correlations between variables. This makes the bootstrap frequency of a predictor meaningless, just as statistical tests of univariate association for a predictor with outcome are rendered meaningless in the presence of confounding variables. For example, the effect of age as a single predictor, the effect of age adjusted for sex, and the effect of age adjusted for sex plus T-cell immunophenotype are all different analyses for the effect of age [1]. We may label all the same as ‘effect of age’, but the meaning is different in each analysis.

Second, the proposed cross-validation procedure is misleading. By splitting the data set 50:50, and requiring significance in both parts of the data set, the effective p-value for selection is 0.05 * 0.05 = 0.0025. Hence, effectively a much lower p-value is used for selection of predictors. This reduces the Type I error, at the price of limited power to detect true effects. The latter is known to be most important for the predictive performance of a model [6].

Third, the authors use the term generalizability while they only study one particular data set from one particular setting. Generalizability is synonymous to external validation, and should be reserved for the situation that a prediction model is tested in a new, independent setting, different in time and/or place [7]. The current proposal even lacks internal validation; this might require a double bootstrap procedure, where the whole modelling process is repeated in bootstrap resamples of the original data set.

In sum, we have severe doubts about the proposals made considering the relative importance of predictors. No simulation studies or other support of validity is provided for the claim that ‘stable and reproducible models with good performances’ are obtained with the proposed strategy. Neither are the ‘methods a good tool for validating a predictive model’. Stepwise methods in whatever variant lack a reasonable scientific foundation. Many drawbacks are known, including instability of selection, biased estimation of effects, underestimation of variability, inflation of p-values [4] [5]. We should hence avoid using stepwise methods as much as possible, and may better resort to penalization methods such as the Lasso [8].

If one were to judge the relative importance of predictors, more sensible strategies can readily be imagined. A simple solution is to study the Wald statistics in a full model, i.e. the decrease in fit when a predictor is omitted from the model. The advantage of using the full prediction model is that effects of predictors are adjusted for each other. The GUSTO-I prediction model was presented with such an anova table [9]. An alternative approach is to study the improvement in fit by each predictor. For example, 26 predictors of outcome after traumatic brain injury were studied by partial Nagelkerke R2 statistics, including contributions in univariate analysis of each predictor and in fully adjusted analyses [10]. As expected, the relative importance of a predictor was smaller in a larger model, since various positive correlations between predictors were present.

We hope that the latter approaches will be followed by others involved in prediction modelling rather than the proposals made in the recent paper.

Authors:

E.W. Steyerberg, PhD

Department of Public Health

Erasmus MC

PO Box 2040

3000 CA Rotterdam

The Netherlands

e.steyerberg@erasmusmc.nl

F.E. Harrell, PhD

Department of Biostatistics

Vanderbilt University

S-2323 Medical Center North

Nashville, TN 37232-2158

f.harrell@vanderbilt.edu

References

1. Beyene J, Atenafu EG, Hamid JS, To T, Sung L: Determining relative importance of variables in developing and validating predictive models. BMC medical research methodology 2009, 9:64.

2. Derksen S, Keselman H: Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. Br J Math Stat Psychol 1992, 45:265-282.

3. Royston P, Altman DG, Sauerbrei W: Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006, 25(1):127-141.

4. Harrell FE: Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. New York: Springer; 2001.

5. Steyerberg EW: Clinical prediction models: a practical approach to development, validation, and updating. New York: Springer; 2009.

6. Steyerberg EW, Eijkemans MJ, Harrell FE, Jr., Habbema JD: Prognostic modeling with logistic regression analysis: in search of a sensible strategy in small data sets. Med Decis Making 2001, 21(1):45-56.

7. Justice AC, Covinsky KE, Berlin JA: Assessing the generalizability of prognostic information. Ann Intern Med 1999, 130(6):515-524.

8. Tibshirani R: Regression and shrinkage via the Lasso. J R Stat Soc, Ser B 1996, 58:267-288.

9. Lee KL, Woodlief LH, Topol EJ, Weaver WD, Betriu A, Col J, Simoons M, Aylward P, Van de Werf F, Califf RM: Predictors of 30-day mortality in the era of reperfusion for acute myocardial infarction. Results from an international trial of 41,021 patients. GUSTO-I Investigators. Circulation 1995, 91(6):1659-1668.

10. Murray GD, Butcher I, McHugh GS, Lu J, Mushkudiani NA, Maas AI, Marmarou A, Steyerberg EW: Multivariable prognostic analysis in traumatic brain injury: results from the IMPACT study. J Neurotrauma 2007, 24(2):329-337.

## Competing interests

No competing interests.