The search for stable prognostic models in multiple imputed data sets

Background In prognostic studies model instability and missing data can be troubling factors. Proposed methods for handling these situations are bootstrapping (B) and Multiple imputation (MI). The authors examined the influence of these methods on model composition. Methods Models were constructed using a cohort of 587 patients consulting between January 2001 and January 2003 with a shoulder problem in general practice in the Netherlands (the Dutch Shoulder Study). Outcome measures were persistent shoulder disability and persistent shoulder pain. Potential predictors included socio-demographic variables, characteristics of the pain problem, physical activity and psychosocial factors. Model composition and performance (calibration and discrimination) were assessed for models using a complete case analysis, MI, bootstrapping or both MI and bootstrapping. Results Results showed that model composition varied between models as a result of how missing data was handled and that bootstrapping provided additional information on the stability of the selected prognostic model. Conclusion In prognostic modeling missing data needs to be handled by MI and bootstrap model selection is advised in order to provide information on model stability.


Background
In healthcare predicting how long it takes for an episode of musculoskeletal pain to resolve can be difficult. Outcome varies between patients and over time. Although clinicians can be relatively good "prognosticians" [1,2] clinical judgment and intuition can be incorrect and difficult to quantify or to be made explicit. To understand the ingredients that contribute to correct prognosis and to improve upon clinical judgment, clinical prediction rules can be useful. These provide a quantitative estimate of the absolute risk of particular outcomes of interest for individual patients, which may subsequently be used to support decisions regarding treatment. Until now, several clinical prediction rules have been developed in the field of musculoskeletal pain, for example to estimate the outcome of low back [2][3][4], knee [5] or shoulder pain [6].
In the development of clinical prediction models, researchers frequently use a regression analysis with a backward or forward selection strategy. However, this methodology may result in overoptimistically estimated regression coefficients, omission of important predictors and random selection of less important predictors. As a result derived models may be unstable [7]. Incorporating a bootstrap resampling procedure in model development has been suggested to provide information on model stability [8][9][10][11]. Since bootstrapping mimics the sampling variation in the population from which the sample was drawn it is expected to produce a model which better represents the underlying population [9][10][11].
Another problem occurring in prognostic studies is missing data. Multiple imputation (MI), which uses all observed information, was shown to be superior to other imputation techniques like single regression imputation [12,13]. Though, MI is not yet frequently used in predictive modelling and model stability is hardly ever accounted for in MI approaches. It has been shown that for low back pain extending MI with a bootstrapping procedure provides an accurate model selection and information on model stability [14]. However generalizability of this method was never tested in other patient datasets.
Therefore, the objective of our research was to examine the influence of bootstrapping and multiple imputation on model composition and stability in a shoulder pain data set with missing values.

Study population
We used data from the Dutch Shoulder Study (DSS) [15]. This cohort consists of 587 patients who consulted their general practitioner (GP) with a new episode of shoulder disorders. Inclusion criteria were: no GP consultation or treatment received for the afflicted shoulder in the preceding three months. Exclusion criteria were: dementia, severe psychiatric of physical conditions (i.e. fractures or dislocation in the shoulder region, rheumatic diseases, neoplasms, neurological of vascular disorders). The ethics review board of the VU University medical centre approved the study protocol.

Outcome measures
We focused on two outcome measures; persistent shoulder disability (16-item SDQ; 0-100) [16] and persistent shoulder pain intensity (Numeric Rating Scale; 0-10) [17]. To define 'persistence' baseline scores were subtracted from follow-up scores. An optimal cut-off point was defined by studying the relationship between the change scores and a secondary outcome measure 'patient perceived recovery' [6]. Patients were denoted as recovered when they characterized their complaints as 'fully recovered' or 'very much improved'. By constructing Receiver Operating Characteristic (ROC) curves with patient perceived recovery as the external criterion, the optimal cutoff point (i.e. that point that yields the lowest overall misclassification) was determined [18]. According to this analysis a 50% decrease in disability and pain intensity compared to baseline was considered a minimal important change, and was used as a cut-off value to dichotomize both outcome measures. Patients who improved less then 50% were denoted as having persistent pain or disability. Outcomes were measured three months after enrolment by postal questionnaire.

Prognostic factors
Based on a systematic review of the literature [19] a set of candidate predictors was selected, including demographic variables, characteristics of the shoulder pain problem, physical and psychological factors (see Table  1). The following questionnaires were used to gather cause of shoulder problem: sporting injury 29 (5) information on psychological factors: the Pain Coping and Cognition List (PCCL [20]: pain coping, catastrophizing, internal and external locus of control), the 4 Dimensional Symptom Questionnaire (4DSQ [21]: anxiety, depression, somatisation, distress), the Fear-Avoidance Beliefs Questionnaire (FABQ [22]: fear-avoidance) and the Tampa Scale for Kinesiophobia (TSK [23,24]: kinesiophobia). Within 10 days after consulting the GP, participants completed a baseline questionnaire to assess potential predictors.

Analysis
For all continuous predictors the linearity assumption was checked. When the relationships between variables and outcome did not resemble linearity, variables were categorized (3 categories) or dichotomized. Although this causes loss of information [25], these procedures were retained since they are part of the frequently used standard statistical methodology in predictive modeling. Variables were checked for (multi)collinearity using Pearson's r, given that correlated variables can disturb variable selection in multivariable regression [26]. In case of correlation (r≥0.5) the variables which could most easily be obtained in clinical practice by the physician were retained.
To reduce the initial number of variables, an univariable analysis (α > 0.157) was performed in both the imputed and unimputed data sets, thus all analyses were preceded by this pre-selection. The subsequent analyses were all based on a multivariable analysis with a backward selection strategy and a stopping rule of α = 0.157. This significance level is available in many statistical software packages and results have been shown to be comparable with the more complex Akaike Information Criterion (AIC) [27]. The number of events per variable (EPV) was calculated for each method to check whether the analysis was sufficiently powered (EPV > 10) [28]. The checklist proposed by Harrell [29] for multivariable modeling was followed where possible. To study the effect of missing data and model stability on model composition, the following four methods were compared:

1) Complete Case Analysis (CCA)
To handle missing data, subjects with missing values on any of the variables were omitted and only those subjects with information on all variables in the model were included for analysis.

2) Multiple imputation (MI-5)
Missing values were imputed using a Multivariate Imputation by Chained Equations (MICE) procedure with the "predictive mean matching" as imputation method [30]. All available data including outcome measure were used in the imputation method [13]. We generated five imputed data sets (MI-5).
Multivariable regression was applied to each of the 5 imputed data sets. From these 5 models, predictors which appeared in at least 2 models (a Inclusion Fraction of ≥40%) qualified for the final model. Whether these predictors significantly contributed to the final model was tested using a likelihood ratio test [31] with a critical P-value of P = 0.157. Predictors were dropped from the final model in case of a nonsignificant (P > 0.157) likelihood ratio.

3) Bootstrapping (B)
A two-step bootstrap model selection procedure [9,11] was applied to provide information on model stability. First 500 samples with replacement were taken from the complete case data set. In each sample a multivariable model was built. To be consistent with the MI-5 method, predictors which appeared in ≥40% of these models qualified for the second step. In this second step 500 new complete case samples were taken and in each of which a multivariable model was built using the predictors from the first step. These 500 models provided information on model stability (i.e. which combination of predictors is most frequently selected in the model).

4) Multiple imputation + bootstrapping (MI-5+B)
Missing data was imputed using the MICE procedure and five imputed data sets were created. In each of the five imputed data sets the two step bootstrap model selection procedure as described above was applied. Information on model stability was provided by studying which combination of predictors occurred most frequently in 2500 data sets.

Internal validation
The apparent performance of a predictive model is typically better in the data set in which the model has been developed compared to its performance in another similar data set [32]. This phenomenon is called overoptimism. Using a n = 200 samples bootstrap procedure for internal validation [33] the performance of each developed model was tested in similar populations as in the derivation sample. This method was used to estimate the overoptimism of the derived models, and to adjust the measures of performance.

Model evaluation
Derived models were evaluated by comparing the model's composition (combination of predictors). Next several measures of predictive performance were considered. Discrimination refers to how well a model distinguishes between patients with and without persistent symptoms and is quantified by the c-index that, for binary outcomes, is identical to the area under the ROC curve (AUC) [34]. The c-index varies between 0.5 and 1, with 0.5 indicating no discrimination above chance and 1 indicating perfect discrimination. The agreement between predicted probabilities and observed probabilities is called calibration and was measured by computing the slope of the calibration plot (predicted probabilities against observed frequencies). Well-calibrated models have a slope of 1. As a measure of the explained variance Nagelkerke's R 2 was computed.

Software
All analyses were performed using the R-statistics software (version 2.4.0). The R Design package was used for the CCA, MICE was used for the MI and additional routines were developed for applying the bootstrap.

Results
The baseline patient characteristics are listed in Table 1. After three months 517 patients (88%) returned the follow-up questionnaire. Subjects lost to follow-up were younger (mean difference of 7 years) and showed more often an acute onset (47% versus 37%). Due to nonresponse the percentage of missing data was largest for the outcome measures (shoulder disability 12.3% and shoulder pain intensity 12.9%). Other (baseline) variables had missing values within the range of 0 to 9.2%. The combination of missing values in CCA resulted in the exclusion of 24.7% (disability model) and 28.8% (pain intensity model) of participants.
In the CCA 12 variables showed a univariable association with persistent disability and 16 with persistent pain, resulting in an EPV of 11.9 for pain intensity and 17 for shoulder disability. In the five imputation data sets the EPV varied between 19.1 and 19.6 for disability and between 13.5 and 13.8 for pain intensity. This means that the analyses were sufficiently powered (with a sufficient number of cases in the models) to reliably estimate the associations between predictors and outcome.

Model composition
For all presented models, the directions of the associations (i.e. regression coefficients) between the selected predictors and outcome were the same for both disability and pain (data not presented). Tables 2 and 3 show that for both measures of outcome, model composition was influenced by missing data (CCA vs. MI-5). When models were derived from imputed data, model composition diverged from the CCA model. For both measures of outcome predictors with lower predictive abilities in the CCA (i.e. rank order according to regression coefficient estimates) were not included in the MI-5 (e.g. concomitant lower extremity pain for shoulder disability and for pain intensity; sporting activities and higher physical workload). Predictors that were no part of the CCA model entered the MI-5 model for persistent shoulder disability (e.g. duration of complaints, somatisation, external locus of control and age) were included in the MI-5 model. Tables 4, 5, 6 and 7 show the results of assessing model stability by the bootstrap model selection procedure. CCA and MI-5 models were not identified as the most frequently occurring combination of predictors for both outcome measures (Tables 4, 5, 6). Only the  sporting activities 0 (0%) 7 higher physical workload 0 (0%) 8 CCA -complete case analysis MI -5 -multiple imputation using 5 imputation files rank -the order of appearance of predictors in the derived model arranged by their predictive ability (regression coefficient) % -inclusion frequency; the proportion of times that a variable with a univariable association with the outcome is retained in the automated backward selected models. When a variable was selected in each of the replications, the inclusion frequency was 100% * -outcome measure persistent shoulder disability MI-5 method was identical to its bootstrapped enhanced version (Table 7). Model selection frequencies for the most frequently selected models were uniformly low (ranging from 24.0% to 3.6%). Indicating on a large variability in model composition within the bootstrap replicate data sets. When fewer potential predictors are retained after the first step of the bootstrap model selection procedure, this variability seemed to decrease and model selection frequency increased. Table 8 presents the performance of the models derived with the four methods for both outcome measures. The slopes of the calibration plots ranged from 0.973 to 1.077, which indicates good calibration. Explained variance ranged from 8.8% to 12.0% for disability and from 13.5% to 18.8% for pain. The apparent c-indices varied between 0.645 and 0.667 for disability and between 0.684 and 0.717 for pain intensity. CCA models were more optimistic compared to the other models.

Discussion
Prognostic research aims at identifying the ingredients that contribute to a correct prognosis for a specific subgroup of patients. Though, finding a stable set of predictors that can consistently be used in a broad patient population proves to be difficult. Several methodological issues (missing data and model stability) which are not accounted for by the standard statistical methodology are expected to complicate this matter. We showed that accounting for missing data by MI and providing information on model stability by bootstrapping -the order of appearance of predictors in the derived models arranged by their predictive ability (regression coefficient estimates) MI-5+B -the multiple imputation based bootstrap selected model (i.e. the most frequently occurring combination of predictors in 2500 replicate data sets of the second bootstrap model selection step) MI-5 -the multiple imputation based model using 5 imputed data sets was also the most frequently occurring combination of predictors in the 2500 bootstrap replicate data sets Count -the number of times the model was selected in the 2500 replicate data sets of the second bootstrap model selection step.
are instructive methods when deriving a prognostic model.
In the standard statistical methodology the use of a backward or forward selection strategy has been criticized. It may result in overoptimistically estimated regression coefficients, omission of important predictors and random selection of less important predictors. Derived models may therefore be unstable. Research has focussed on how to derive stable models. One frequently used method is the bootstrapping approach suggested by Austin and Tu [35]. It considers the strength of evidence that identified variables are truly important predictors in re-sampled data. Although this approach is often claimed to reduce model instability [8,10,14,35,36], separating strong from weak predictors was shown to perform comparative to automated backward elimination in identifying the true regression model [37]. Furthermore, this approach has limited abilities when there is a high number of potential prognostic factors. For these situations a modified bootstrapping procedure was suggested [11]. Our study showed that the application of this two-step bootstrap model selection procedure provides valuable information on model stability.
As frequently described, model size and model composition are also affected by missing data. Especially in standard statistical methodology where subjects with missing values on any of the recorded variables are omitted from analysis. When missing data does not depend on observed or unobserved measurements (Missing Completely At Random, MCAR), this leads to loss of costly gathered information, decreased statistical power, altered associations between predictors and therefore differences in model composition [12,13,[38][39][40][41]. In this context our study findings formed no exception. Model composition varied as a result of whether cases with missing data were omitted from analyses (CCA) or whether the values of the missings were estimated using MI. Since missing values appeared to be related to other observed information, the MCAR condition did not hold and CCA was expected to be biased. Most of the missing data was observed in the outcome because participants did not consent to follow-up. As subjects lost to follow-up showed more often an acute onset (47% versus 37%), were younger (mean difference of 7 years) and the variable age is included in the MI model for the outcome measure persistent shoulder disability, it is plausible to assume that these missings are MAR. For that reason, accounting for missing data by MI using 5 imputed data sets was in our multivariate data setting the most optimal choice to reduce the uncertainty in model derivation caused by missing values. The use of even more data sets in the imputation routine is possible (up to 20), however 5 was shown to be an sufficient number in order to get stable results [30]. Yet the addition of a bootstrap model selection procedure showed that the MI-5 model might still be unstable. A possible source for this instability might be the suboptimal variable selection procedure applied in the MI-5 procedure. However, how to optimally perform variable selection in multiple imputed data is still a subject of discussion [42]. As illustrated by our study, the bootstrap model selection procedure may provide valuable additional information on model stability when deriving a prognostic model in multiple imputed data.
To study the effects of accounting for missing data and incorporating model stability we used a large clinical data set in which we empirically evaluated different methods of deriving a prognostic model. By this, the uncertainties researchers commonly face when knowledge of the true predictors of outcome is lacking, were illustrated. Furthermore, the practical utility of the additional information provided by the bootstrap model selection procedure in prognostic modeling is demonstrated. Though results need to be interpreted with caution, as our approach limits us from identifying a superior methodology. Although performance parameters for each derived model are presented, these play no role in the decision on the superiority of a certain method. They only show that the performance of all derived models was comparable to that from existing clinical prediction rules on shoulder pain [6,15]. For deciding on the superiority of a certain method, a simulation study in which true predictors and noise variables are assigned would be needed. Such data is not presented by this study.

Conclusions
Our study showed that in this particular dataset of shoulder pain patients, model composition varied as a result of how missing data was handled. Furthermore, the bootstrap model selection routine gave additional information on model stability.