Determining relative importance of variables in developing and validating predictive models
© Beyene et al. 2009
Received: 21 November 2008
Accepted: 14 September 2009
Published: 14 September 2009
Skip to main content
© Beyene et al. 2009
Received: 21 November 2008
Accepted: 14 September 2009
Published: 14 September 2009
Multiple regression models are used in a wide range of scientific disciplines and automated model selection procedures are frequently used to identify independent predictors. However, determination of relative importance of potential predictors and validating the fitted models for their stability, predictive accuracy and generalizability are often overlooked or not done thoroughly.
Using a case study aimed at predicting children with acute lymphoblastic leukemia (ALL) who are at low risk of Tumor Lysis Syndrome (TLS), we propose and compare two strategies, bootstrapping and random split of data, for ordering potential predictors according to their relative importance with respect to model stability and generalizability. We also propose an approach based on relative increase in percentage of explained variation and area under the Receiver Operating Characteristic (ROC) curve for developing models where variables from our ordered list enter the model according to their importance. An additional data set aimed at identifying predictors of prostate cancer penetration is also used for illustrative purposes.
Age is chosen to be the most important predictor of TLS. It is selected 100% of the time using the bootstrapping approach. Using the random split method, it is selected 99% of the time in the training data and is significant (at 5% level) 98% of the time in the validation data set. This indicates that age is a stable predictor of TLS with good generalizability. The second most important variable is white blood cell count (WBC). Our methods also identified an important predictor of TLS that was otherwise omitted if relying on any of the automated model selection procedures alone. A group at low risk of TLS consists of children younger than 10 years of age, without T-cell immunophenotype, whose baseline WBC is < 20 × 109/L and palpable spleen is < 2 cm. For the prostate cancer data set, the Gleason score and digital rectal exam are identified to be the most important indicators of whether tumor has penetrated the prostate capsule.
Our model selection procedures based on bootstrap re-sampling and repeated random split techniques can be used to assess the strength of evidence that a variable is truly an independent and reproducible predictor. Our methods, therefore, can be used for developing stable and reproducible models with good performances. Moreover, our methods can serve as a good tool for validating a predictive model. Previous biological and clinical studies support the findings based on our selection and validation strategies. However, extensive simulations may be required to assess the performance of our methods under different scenarios as well as check their sensitivity to a random fluctuation in the data.
Regression models have long been used in medicine, clinical epidemiology, health services research, pharmaceutical research, social sciences and business studies. Their application has greatly increased in the past few decades [1–3]. In particular, regression models such as logistic and Cox models have become standard statistical methods in medicine and epidemiology . Regression models can also be used to combine data from different sources. In cancer research, for example, predictive models based on wide range of data such as clinical, gene expression and single nucleotide polymorphism (SNP) are frequently developed to assess risk factors with the ultimate aim of developing better prevention, diagnostic and treatment strategies [5–8]. In recent years, clinical data have been integrated with genomic and other types of data for developing more accurate predictive models [9–11]. For instance, clinico-genomic models were used in breast cancer outcomes prediction where genomic data combined with clinical and demographic characteristics were used for improving predictive accuracy .
Despite the fact that predictive models are commonly used in wide range of scientific disciplines, they are frequently presented inadequately [3, 12, 13]. A recently published systematic review discusses the gravity of this issue . The authors investigated 2,234 papers published in 104 journals in areas of obstetrics and gynecology. A considerable percentage (34.2%) of the studies used logistic regression. Among those that utilized logistic regression, most (96%) reported significance (in the form of p-values), however, only 3.6% reported goodness of fit. None of them considered model validation of any sort. The authors, however, did not distinguish between etiologic and predictive models. The empirical results, therefore, might well include both types of models. Concerns regarding such inadequate presentation of predictive models in other areas of medicine such as cancer research have also been reported [3, 13–15].
Validity of predictive models as well as their application in the general population highly depends on their goodness of fit [3, 4, 16, 17]. Assessing goodness of fit and model validation are, therefore, essential in any model fitting task in general and in constructing predictive models in particular.
There are different methods for developing predictive models and it is very challenging to maintain balance between including too many variables (and risk loss of precision) and omitting important variables (and risk biased prediction) . Automated variable selection methods are frequently used in regression analysis for selecting important predictors, however, these approaches may result in spurious noise variables being mistakenly identified as independent predictors of the outcome . Moreover, predictive models developed using such automated algorithms tend to be non-reproducible and hence are not reliable in practice. Use of good strategies for model selection along with adequate performance and goodness of fit measures are, therefore, needed in developing accurate prediction.
The ultimate goal of developing a predictive model is to use the model for predicting future outcomes in a much larger, broader and heterogeneous population. It is important to note that accurate models that perform extremely well may perform poorly outside the data set they are built on. In fact, the performance of a predictive model is overestimated if determined on data used to construct the model and may not be reliable for the larger population . For more details regarding model selection and the tradeoff between model over-fitting and under-fitting can be found in [20, 21].
In this paper, we propose two different approaches for ordering potential predictors according to their relative importance with respect to model stability and generalizability. We use the concept of explained variation and area under the Receiver Operating Characteristic (ROC) curve for developing models and in determining how many predictors to include in final model. We illustrate the approaches using a study conducted by Truong and colleagues that aimed at developing a prediction rule to identify patients with Acute Lymphoblastic Leukemia (ALL) at lower risk of Tumor Ly sis Syndrome (TLS). A publicly available data set is also used for additional illustration.
To illustrate model selection and validation methods presented in this paper, data and statistical results from previous two studies are considered. Here we give brief descriptions of the original studies. For more details about the data and the statistical analyses performed, we refer the reader to the original papers [5, 22]. We focus on the first case study and illustrate our methods using this data set. However, data from a prostate cancer study is also considered for additional and supplementary illustration.
With the long-term aim of a risk-stratified approach to the prevention of Tumor Lysis Syndrome (TLS), the study was originally performed to describe the prevalence and predictors of TLS in childhood Acute Lymphoblastic Leukemia (ALL) and develop a sensitive prediction rule to identify patients at low risk of TLS .
In this paper, we consider the primary outcome, i.e., presence or absence of TLS. In total, 328 patients (of which 74 are TLS cases) were included in the study. Potential predictors included in the model consist of 1) Laboratory features: white blood cell count (WBC ≥ 20 × 109/L) and presence of T-cell immunophenotype 2) Clinical indicators: presence of mediastinal mass, hepatomegaly(palpable liver ≥ 3 cm) and splenomegaly (palpable spleen ≥ 2 cm), and 3) Demographic characteristics: age (≥ 10 years) and sex.
The main objective of the original study was to determine whether variables measured at a baseline exam can be used to predict whether the tumor has penetrated the prostate capsule. A total of 380 subjects are considered of which 153 had a cancer that penetrated the prostate capsule. Variables included in the study are age, race, results of digital rectal exam(DPROS), detection of capsular involvement in rectal exam(DCAPS), prostate specific antigen (PSA), tumor volume from ultrasound (VOLUME) and total Gleason score. We refer the reader to for more details about this data set.
We propose two approaches for ordering variables according to their relative importance with respect to model stability and generalizability. A predictive model is then constructed using explained variation and area under the Receiver Operating Characteristic (ROC) curve where variables from our ordered list enter the model according to their relative importance. Increase in explained variation as well as the area under the ROC curve is used to determine the number of variables included in the final model.
Bootstrapping is a modern, computer-intensive re-sampling approach which allows simulation of test data sets that mimic the initial original data set and the technique has been used successfully in a wide range of applications . An excellent comprehensive tutorial paper that focuses on the application of the bootstrap method for constructing confidence intervals is provided in . Bootstrap re-sampling along with an automated selection procedure have been used to develop parsimonious models [25, 26]. It has been demonstrated that bootstrap methods can be used to assess the strength of evidence that an identified variable truly is an important predictor. Sauerbrei and Schumacher used bootstrap sampling to assess the distribution of an indicator variable denoting the inclusion of a specific predictor for identifying strong and weak factors for predicting survival . Bootstrapping has also been used to show that automated model selection procedures are likely to identify noise variables as important predictors and hence produce unstable models in a logistic regression setup .
Here we propose the use of bootstrapping for ordering potential predictors according to their importance. We took one thousand bootstrap samples from the original data. On each bootstrap sample, we applied three most commonly used automated model selection procedures (forward, backward and stepwise) where all candidate variables are included at the initial stage of the analysis. Candidate variables are, then, ordered according to their importance, where the variable chosen most frequently is ranked first.
After the initial ordering of the candidate predictors using bootstrapping, our final predictive model is constructed using a stepwise procedure where the top ranked variable from our ordered list enters the model first. Variables are then sequentially added to the model according to their relative importance. At each stage, we calculate the improvement in the predictive accuracy of the model, measured by the percentage of explained variation and area under the ROC curve.
The bootstrap approach described above addresses the issue of variable stability. However, model reproducibility is also another important criterion since models are often developed with the aim of predicting future outcomes. We considered variables to be reproducible or generalizable if their predictive power can be generalized to new data set. That is, the model consisting of these variables not only have a good predictive power for the data set the model is built on, but also its performance (with respect to predictive power) can be generalized in predicting future outcomes. Here we propose an approach for model selection that not only considers model stability but also generalizability and reproducibility. We randomly split the original data evenly into training and validation data sets. This step is repeated one thousand times. We use the training data sets to build the models and validate the selected models using the validation data sets. All candidate variables are considered and potential predictors are selected using forward, backward and stepwise selection procedures using the training data sets. Moreover, for each random split of data, we checked if the selected predictors are significant in the validation data set. The variables are then ordered according to their relative importance, where the importance of a variable is measured by the proportion (s) it is selected as a predictor and as well as the proportion (v) at which it is significant in the validation data set. We propose the product sv as a measure of relative importance incorporating stability and generalizability. This product is equivalent to the proportion at which a variable is both selected in the final model and is significant using the validation data set.
That is, if we define two events such that A = variable is selected using an automated model selection procedure based on the training data set, B = variable is significant using the validation data set. Then, A & B = variable is selected using the training data set and it is significant in the validation data set. What we are interested in is the probability of A & B. This probability, which incorporates stability and generalizability measures, can be estimated as the proportion at which a given variable is selected and is significant simultaneously. It is possible to show that this proportion is equivalent to sv.
Automated model selection procedures: backward elimination, forward selection and stepwise selection, are the three most commonly used variable selection procedures. In backward elimination, variables are eliminated from the full model based on a certain criteria mostly based on a decrease in R2or deviance. In forward selection, however, we start with an empty model and variables are added sequentially where, at each step, a variable that brings the largest increase in R2or deviance will be added in the model. Stepwise procedure is a variation of forward selection where variables are allowed to be eliminated from the model. The stopping rule, which is usually pre-specified, for all the three automated model selection procedures is based on significance at a specified level. For comprehensive overviews of automated model selection procedures, we refer the reader to [28, 29]. For variable selection from a Bayesian perspective we refer the reader to .
We use two measures of explained variation appropriate for binary outcome - direct and indirect . The direct measure is based on residual from the fit and the indirect index is related to standard measure of information. Suppose the estimates from a logistic model without a covariate (i.e., unconditional) are denoted by and let the estimates from a model with covariate xi (i.e., conditional) be . Define and . The explained variation based on the direct estimates is given by . Similarly, let and . The explained variation based on the indirect estimates is calculated as .
We performed 5-fold cross validation based on the original TLS data to check the performance of our approaches. That is, we randomly divided the data into five parts, estimate model parameters based on 4 of the 5 data sets and use the remaining one fold for validation. The randomization is repeated 5 times and prediction error is averaged over the five cross validations and the number of randomizations. By doing repeated cross-validation, it is possible to reduce variability and provide more stable estimates.
Proportion (s) at which each of the variables entered the final model and average p-value for three automated model selection procedures based on one thousand bootstrap samples
Cumulative percentage of explained variation and area under ROC obtained by including the variables in the model according to the order of importance as given in Table 1.
under ROC curve
A model with the first four variables explained 27.18% with area under ROC curve 81.3%. The increases in percentage variation as well as area under ROC curve are negligible after the first four variables have been included in the model. A final model with a good predictive accuracy (area under ROC = 81.3%) can be constructed using the first 4 variables. Group of children at low risk of TLS can, therefore, be defined as absence of age, baseline WBC, splenomegaly and T-cell immunophenotype. That is, children younger than 10 with absence of T-cell immunophenotype, whose baseline WBC < 20 × 109/L and a palpable spleen < 2 cm, are at low risk for TLS. It is important to note that this model is different from the one constructed in the original study where the predictors in the final model are age, baseline WBC, splenomegaly and mediastinal mass. The last variable is the second last variable in our ordering.
As can be seen from Table 2, the percentage of explained variation increases sharply in the beginning. For instance, a model with only age explained almost 9% of the variation with area under ROC 64.3, whereas age and baseline WBC combined explained 19% of the variation with area under ROC 75.7.
The proportion at which a given variable is selected in the training set and is significant in the validation set after randomly splitting the data 1000 times
Percentage of explained variation and area under ROC obtained by including the variables in the model according to the order of importance given in Table 3.
under ROC curve
The results presented in Table 4 indicate that about 25% of the variation is explained by the first three variables and the area under the ROC curve for a predictive model consisting of only these three variables is 79.1. Table 4 show s that the increase in the percentage of variation as well as the area under the ROC curve is very small after including the first 3 variables.
The model with the first 5 variables explains about 28% of the variation and the area under the ROC curve is 81.5. However, the contribution of mediastinal mass towards the total variation as well as area ROC is very small once T-cell immunophenotype is included in the model. This might be an indication that these two variables are highly correlated and hence some of the contribution of mediastinal mass is already accounted for by T-cell immunophenotype. We used a chi-square test to investigate the relationship between these two variables and the results indeed show a very strong association (p-value < 0.0001). As a result, including both of the variables in the final model is not only unnecessary but also introduces multicolinearity to the model. From Tables 1 and 4, we can see that T-cell immunophenotype is more important than mediastinal mass according to importance measures related to model stability and generalizability. We, therefore, keep age, baseline WBC, T-cell immunophenotype and splenomegaly in the final model. In practice, however, further biological/clinical investigation might be required to assess the biological and/or clinical importance of the variables involved before we decide to exclude variables from the final model. According to our statistically derived and validated model, a group at low risk of developing TLS consists of children younger than 10 years, without of T-cell immunophenotype, whose baseline WBC < 20 × 109/L and palpable spleen < 2 cm. Note that this low risk group is similar with the group selected using the bootstrap approach.
It is also important to note that, T-cell immunophenotype, which is selected as the third most important variable using the random split and the fourth important variable in the bootstrapping was excluded from the model proposed in the original paper .
We used 5-fold cross validation to assess the performance of the models elected based on the bootstrapping and random split approaches. Recall that both approaches produced the same model - a predictive model consisting of age, baseline WBC, T-cell immunophenotype and splenomegaly. We also consider the model proposed in the original paper for comparison purposes. Their model is based on age, baseline WBC, Mediastinal mass and Splenomegaly .
The results from our analysis show that the model selected using our approach has a slightly less prediction error (0.17) than the model proposed in the original paper (0.19). The corresponding ROC values for the two models are 81.3% and 81.1%, respectively. This indicates that the model proposed in the original paper performs well in general and has good generalizability. However, an important variable was omitted from the model as discussed earlier. Moreover, the original model consisting of age, baseline WBC, mediastinal mass and splenomegaly, was not validated although some goodness of fit properties were reported. The approaches proposed in this paper and the statistical results from the analysis can, therefore, be used as a confirmation for the performance of the proposed model and its appropriateness for practical use.
Since the data is relatively larger than case study 1 and also has a larger event rate, we divided the data set in to training and validation where we set aside 1/3rdof the data for validation purposes. For the training set, a model consisting of the first two important variables, Gleason score and digital rectal exam, gave area under the ROC = 84.2%. This value increased to just 85.3%when PSA is added to the model. When testing using the validation data(the 1/3 rd data set aside in the beginning), the area under the ROC increased from 74.2% to 75.6% indicating that although PSA seems to be an independent predictor for this particular data set, it may not be a good predictor for future data and hence non reproducible. This finding is supported by previous studies that have shown that, although PSA is negatively associated with cancer penetration, it was not able to predict pathological features in clinically localized prostate cancer in the overall population [33, 34].
Predictive models are used frequently in medical research and automated model selection procedures are traditionally used for identifying predictors. However, assessing model fit and validation of the selected models are rarely performed or not done thoroughly, if done at all. Moreover, models selected using automated procedures often include noise variables that have little or no relationship with the outcome variable. Similarly, important variables might be omitted.
In this paper we proposed two approaches for ordering candidate variables according to measures incorporating model stability and generalizability. In our first approach, we used bootstrapping and defined variable importance as the proportion a given variable is selected and hence leads to models that are more stable than models selected using only automated procedures. This method is similar to that of Austin and Tu [26, 27]. What distinguishes our approach from theirs is that we use d measures of importance such as relative increase in percentage of explained variation and area under the ROC curve in determining the number of variables included in the final model. In the second approach, we performed one thousand random split of data into training and testing data sets. Variables are then ordered according to the frequency they were selected using the training data set at the same time statistically significant in the validation data set. Our second approach, thus, leads to a ranking that not only incorporated model stability but also takes generalizability and reproducibility into account. We would like to note that random split, like cross validation, involves random sampling. The underlying population is, therefore, the same. When available, a more rigorous test based on a completely different sample from another population can be used to assess model generalizability. How to do the random split is also an important issue that has been addressed by many in classification/prediction literature. The amount of data to set aside often depends on the sample size and also predictive ability of the variables included in the analyses. For a small sample size, for instance, 50% split as done in our manuscript might lead to poor predictive model (classifier) since the model is built on insufficient data set. Under such circumstances, setting 1/3rd of data aside for testing might give better results. One could also use k-fold cross validation (where k > 2) when the sample size is small.
We used the concept of explained variation and area under ROC where the most important variable enters the model first and variables are added until the improvement in explained variation as well as area under the curve is negligible. This is particularly important when the number of candidate variables is very large and one has to select the first top most important variables. The issue of how small the improvement has to be, in order to decide the number of variables entering the final model, is a very challenging task. In this paper, we used an approximate cutoff point using plots similar to scree plot in principal component analysis which allows us to visually assess the point where the improvement becomes negligible. However, it might be interesting to develop a test statistic based on the percentage increase in explained variation and/or area under the ROC curve. One can, thus, test if the improvement is statistically significant or not. This is a topic we would like to investigate in future research.
It is important to note that, the original childhood leukemia study presented in this paper is well designed where candidate predictors were selected after a thorough investigation of their clinical relationship with the outcome variable. Consequently, not much difference was observed between the performance of the original model and the model constructed using our approaches. However, the results in this paper indicate that even in such well designed study an important variable could be omitted if relying on automated model selection procedures alone. This indicates that researchers should be more cautious when using automated procedures and they are encouraged to use re-sampling techniques along with these procedures.
Application of our approaches to the prostate cancer data set identified two variables as important predictors of whether tumor has penetrated the prostate capsule. These variables have been previously identified as powerful indicators of tumor penetration and prostate cancer death supporting our findings. Moreover, prostate specific antigen (PSA), which was ranked as the third important variable based on the bootstrap approach, was not identified as an important variable in using our random split approach. This indicates that PSA, although associated with prostate cancer, is not a reproducible predictor of tumor penetration and hence not helpful for predicting for patients outside the study population. This finding is also supported by previous biological and clinical studies.
In conclusion, automated model selection procedures may result in the inclusion of noise variables and unstable models. Our proposed methods and analytical strategies can be used to assess variable stability and reproducibility and hence may lead to more robust models with favorable predictive performance. However, extensive simulation is needed to study the operating characteristic of our approaches in a more general setting. We plan to pursue this in future research.
We would like to gratefully acknowledge a seed grant support provided by the Research Institute of the Hospital for Sick Children for the establishment of the Biostatistics Methodology Unit(BMU).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.