This article has Open Peer Review reports available.
Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
- Andrea Marshall^{1, 2}Email author,
- Douglas G Altman^{1},
- Patrick Royston^{3} and
- Roger L Holder^{4}
https://doi.org/10.1186/1471-2288-10-7
© Marshall et al; licensee BioMed Central Ltd. 2010
Received: 19 August 2009
Accepted: 19 January 2010
Published: 19 January 2010
Abstract
Background
There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.
Methods
Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.
Results
Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.
Conclusion
The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.
Keywords
Background
Assessing the prognostic ability of clinical factors using a Cox proportional hazards model is often performed [1]. However, missing covariate data complicates the analysis, but often occurs [1]. A review of published prognostic studies [1] found that on average 13% of cases had incomplete data (range 0 - 60%) in 39 studies where this information could be obtained. In addition, 27% of values, on average, were missing within a single variable (range 0 - 72%) in 55 studies [1]. Simply using the cases with complete covariate data, i.e. performing a complete case (CC) analysis, loses information and hence efficiency, and may lead to biased regression coefficients if the missingness is related to the outcome [2, 3]. Sophisticated likelihood based techniques can explicitly handle missing covariate data in analyses of survival (time to event) data (e.g. [4–6]). However, these generally require problem-specific programs to be written and hence may not be readily available.
Imputing the missing data poses a suitable alternative that uses all the data and can be performed using easily accessible methods. Multiple imputation (MI), where each missing value is replaced with a set of m (>1) independent values [7] to give m separate complete datasets, incorporates uncertainty of the missing data that cannot be achieved with single imputation (m = 1). The m completed datasets are analysed individually using standard statistical methods and the results combined into one summary estimate using simple rules devised by Rubin [7]. The parameter estimates of interest are averaged and a variance estimate is obtained that incorporates both the within and between imputation variability. There are many different techniques for performing MI, but most approaches assume the missing data to be at least missing at random (MAR), where the probability of missingness is only associated with the observed and not the unobserved data [8]. MI approaches are generally based on an imputation model from which plausible values for the missing data are obtained. The imputation model should contain all variables to be subsequently analysed, which for prognostic modelling studies includes the outcome and all potential covariates, but also any variables that help to explain the missing data [9]. The more compatible the imputation and analysis models are, the more successful the MI approach will be [10]. However, the use of MI in the published medical literature remains limited [11].
Simulation studies provide a framework to compare the performance of different approaches for handling missing data with a variety of missing data mechanisms, as the true value is known. Several simulation studies have investigated the effects of missing data using different MI approaches, but these have primarily imposed missingness only on the outcome variable (e.g. [11, 12]). These studies demonstrated that model based imputation approaches for an incomplete outcome variable were better than ad hoc imputation procedures and were fairly robust to some model departures [13]. Furthermore, when a fully parametric imputation model correctly fitted the data, it performed better than alternative techniques such as predictive mean matching [14]. Conversely, fully parametric methods performed worse when the imputation model did not fit [15]. Few simulation studies have considered missing covariate data (e.g. [3]), especially situations where missingness was imposed on more than one covariate (e.g. [11, 16]). Only a limited number of these studies included survival as the outcome and these have only considered a CC analysis [3] or maximum likelihood based approaches (e.g. [17, 18]) and not MI techniques. There remains a lack of evidence about the effects of missing covariate data and its handling on the performance of the survival models and no consensus on the most appropriate MI techniques to use with a survival outcome.
In addition, no definitive guidelines appear to exist on the allowable proportion of missing data to validly apply MI techniques [19]. With a single incomplete covariate or outcome, Harrell [20] suggested using imputation rather than a CC analysis with 5% missingness, although Barzi and Woodward [21] suggested that a CC analysis may still be suitable with up to 10% missingness. For MAR data, MI performed well up to 25% missingness, and adequately with 50% missingness [22]. However with more than 60% missingness, the extreme levels of uncertainty about the imputed values resulted in high standard deviations and convergence problems of the imputation procedure with MI [21]. With missing not at random (MNAR) data, where the probability of missingness is associated with the unobserved values [8], variance estimates were affected when more than 5% of the data were missing [22]. All of these findings relate to an incomplete outcome or a single covariate and not to the situation with multiple incomplete covariates, where the missingness could relate to the level of an individual covariate or to the proportion of cases that have incomplete data for at least one covariate.
This paper reports the results of an extensive simulation study that aimed to assess the effects of applying different standard approaches to handle missing data in more than one covariate when fitting a Cox proportional hazards model to the full set of covariates. This simulation study investigated how the performance of the model was affected by varying amounts of missingness and different missing data mechanisms. We aimed to determine the maximum allowable proportion of missingness to validly apply these missing data techniques.
Methods
Details of the simulation procedures used within this simulation study are provided below. All simulations were performed using the freely available R statistical software [23], thus allowing all researchers access to any suitable methods identified.
Generating the datasets
Data structure for the breast cancer dataset and associated means and standard deviations (SDs) after suitable transformation
Covariate | Variable Type | Groupings/Measurement | Label | X | Mean(SD) |
---|---|---|---|---|---|
Age | Continuous | Years | Age | X _{1} | 53.05(10.12) |
Lymph nodes | Continuous | Number of | LN | X _{2} | 1.16(0.94) |
Progesterone receptor | Continuous | fmol | PGR | X _{3} | 3.35(1.93) |
Oestrogen receptor | Continuous | fmol | ER | X _{4} | 3.35(1.84) |
Hormonal treatment | Binary | 1 = Yes, 0 = No | TRT | X _{5} | 0.36(0.48) |
Menopausal status | Binary | 0 = Pre, 1 = Post | MENO | X _{6} | 0.58(0.49) |
Tumour group | Binary | 0 = Grade I, 1 = Grade II/III | TG | X _{7} | 0.88(0.32) |
Tumour size | Continuous variable categorised | 1 = ≤20 mm, 2 = 21-30 mm, 3 = >30 mm | TS | X _{8} | 3.27(0.46) |
For simplicity, the covariate data were generated using an underlying multivariate normal distribution [25] with the covariate means and covariance matrix obtained from the German breast cancer study data after suitable transformations (Table 1, Figure 1b). A log transformation was used for the continuous covariates X_{2} and X_{8} and a log (X + 1) transformation used for X_{3} and X_{4} to avoid taking logs of zero. The generated covariate data were back transformed onto their original scales, e.g. using exponential transformations, prior to any analyses being performed. A cut-point of 0.5 was used to obtain the three binary covariates (X_{5}, X_{6}, X_{7}) and the same cut-points as in the original data were used for the categorical covariate (X_{8}). Two dummy variables for X_{8} were created to indicate values of 21-30 mm or not and >30 mm or not respectively. Continuous covariates were truncated using their upper observed limits to produce realistic values and reasonable estimates for the mean and standard deviations that were not too dissimilar from the original dataset.
All continuous covariates were assumed to have a linear effect on the log relative hazard. An uncensored survival time was generated for each case assuming an exponential distribution with a hazard rate of 0.00027, which approximated the hazard seen in the breast cancer dataset, and their associated linear predictor [26]. A censored time was also generated for each case using an exponential distribution with a hazard rate of 0.0002 to give approximately 35% censored observations. A smaller censoring rate than that seen in the breast cancer dataset ensured there were a sufficient number of events to fit a prognostic model for all levels of missingness. The required survival time was then defined for each case as the minimum of the uncensored and censored survival times and the event status determined accordingly.
A sample size of 1000 cases was used for all simulations, which represented the average sample size observed in a literature review of 100 reported prognostic factor studies [1].
Number of simulations
The whole simulation process was repeated 1000 times, which enabled the smallest regression coefficient for X_{4} to be estimated with at least 20% accuracy and all remaining regression coefficients estimated to within 10% accuracy of their true values [25]. The true values were obtained from fitting a Cox proportional hazards model to the motivating breast cancer data. Independent random samples were generated using different starting seeds that were separated by at least the sample size [25].
Imposing missing data mechanisms
Missingness was imposed on four covariates: X_{3}, X_{2}, X_{5} and X_{8}. A case was said to be incomplete if they had at least one missing covariate, but each case could have up to four covariates missing. Five overall rates of missingness of 5, 10, 25, 50 and 75% per case were considered to explore the effects with small, medium and large amounts of missingness. A moderately independent simulation strategy [25] was adopted, utilising the same set of 1000 datasets each time but with different values randomly deleted through using different starting seeds. This approach strengthens the comparison between different methods as it eliminates any sampling variability leaving all methods striving for exactly the same results, whilst allowing variability to exist between simulated datasets and amounts of missingness.
Data for four incomplete covariates (performance status, albumin, grade and residual disease) from an ovarian cancer study [27] provided empirical evidence of realistic patterns and frequencies of missing data and associations between the missingness of each covariate. The amount of missingness imposed on each of the four covariates, X_{3}, X_{2}, X_{5} and X_{8} were approximately 70%, 55%, 20% and 10%, respectively, of the overall amount of cases with any missing data. Dependencies between the missingness indicators for the incomplete covariates were generated such that 35% of incomplete cases were missing both X_{2} and X_{3}, 10% were missing X_{5} and X_{2}, and 5% were missing X_{8} and X_{3}.
Specification of the missing data mechanisms to be imposed
Mechanism | X_{3} (PGR) | X_{2} (LN) | X_{5} (TRT) | X_{8} (TS) |
---|---|---|---|---|
MCAR | β_{0} | β_{0} + ln(OR)M_{X3} | β_{0} + ln(OR)M_{X2} | β_{0} + ln(OR)M_{X3} |
MAR | ln(0.8)X_{4} | ln(3)X_{1} | ln(0.7)ln(t) | ln(7)X_{7} |
MNAR | ln(1.3)X_{3} | ln(0.6) X_{2} | ln(8)X_{5} | ln(0.9)X_{8} |
COMBINED | ln(0.7)ln(t) + ln(0.3)X_{5} | ln(3)X_{1} | ln(0.9)X_{8} |
Odds ratios (OR) to be specified in the missing data mechanisms given in Table 2
Mechanism | OR | Missingness (%) | ||||
---|---|---|---|---|---|---|
for: | for: | 5 | 10 | 25 | 50 | 75 |
X _{ 2 } | M _{ X3 } | 101.12 | 45.5 | 15.68 | 5.50 | 2.17 |
X _{ 5 } | M _{ X2 } | 42.04 | 20.78 | 7.41 | 3.00 | 1.51 |
X _{ 8 } | M _{ X3 } | 45.14 | 14.23 | 5.44 | 1.92 | 0.92 |
Analysis and outcomes of interest
A Cox proportional hazards model including all eight covariates was fitted to each dataset. A linear relationship was assumed for all continuous covariates as used in the data generation process.
The outcomes of interest were the regression coefficients, associated standard errors (SE) and the significance of the covariates in the regression model. The average regression coefficient estimates over all simulations were assessed using the bias from the true value [12], the percentage bias and the coverage [28]. The effect of the missingness on the overall model performance was assessed using the likelihood ratio chi-square test [20], the model's predictive ability using Nagelkerke's R^{2} statistic [20], the prognostic separation D statistic [29] and the 2-year predicted survival probability.
The bias introduced from maximising the partial likelihood estimator and not the full estimator when fitting a Cox regression model [30] in addition to any bias due to the data generation process impedes the detection of any additional bias incurred due to the missing data and its handling. Hence the average regression coefficient estimates and associated empirical SE (i.e. the standard deviation of the estimates across simulations) from performing a large simulation study with no missingness involving 20,000 replications formed the true values against which the missing data simulations were compared.
Missing data methods
Summary of the missing data methods investigated
Method Label | Method Description | Library used within R statistical software | Number of iterations |
---|---|---|---|
CC | Complete case analysis: Analyses only cases with complete data for all covariates | - | |
SI | Single imputation performed using PMM | 'pmm' function in 'mice' | 20 |
MI-NORM | Multiple imputation (MI) using data augmentation approach [31] with a multivariate normal assumption for all variables | 'norm' [41] | 100 |
MI-MIX | MI using data augmentation approach using a general location model | 'mix' [42] | 100 |
MI-MIX-no truncating | MI using data augmentation approach using a general location model, but imputed values are not truncated to within plausible range | 'mix' [42] | 100 |
MI-MICE | MI using regression switching imputation [9]. Linear model are used for continuous covariates and logistic model for binary covariates and dummy variables for categorical covariates | 'mice' [43] | 20 |
MI-MICE-PMM | MI using MICE with PMM | 'pmm' function in 'mice' [43] | 20 |
MI-MICE-PMM-no transformation | MI using MICE with PMM without transforming the incomplete covariates | 'pmm' function in 'mice' [43] | 20 |
MI-Aregimpute | MI using flexible additive imputation models [20] with PMM | 'aregImpute' function in 'Hmisc' [44] | 1 |
For all imputation approaches, the imputation model included all eight covariates in addition to the survival time and event status, indicating whether a case had the event or was censored at the time of analysis [9]. A logarithmic transformation was used for survival time and the incomplete continuous covariates to make the assumption of normality more applicable [9]. All imputed values were rounded to plausible values, where necessary. Twenty imputations were performed for each MI approach to provide a relative efficiency of at least 96% [7] compared to having an infinite number of imputations for the five amounts of missingness to be imposed from 5% to 75%.
Combining estimates of the outcomes of interest after MI and over all simulations
Estimates of the outcomes of interest after MI were combined following proposed guidelines [32]. Rubin's Rules [7] were used to combine each of the regression coefficient estimates, the prognostic separation D statistic and the predicted survival estimates after a complementary log-log transformation. An overall MI p-value from the Wald test for assessing the significance of each covariate in the regression model was also determined using Rubin's Rules [7]. An overall significance estimate for the likelihood ratio statistic was obtained using the method for combining X ^{2} statistics [33]. The median and inter-quartile ranges of the m Nagelkerke's R^{2} statistics [32] were calculated for each of the simulated datasets. Any deficiencies in the model performance measures and approaches for combining these estimates after MI should be similar across missing data methods and therefore still allow a valid and worthwhile comparison. After performing all 1000 simulations, the outcomes of interest were summarised, in general using the average value over all simulations or using the median value, where appropriate.
Results
The results from performing MI using MICE, NORM and MIX were indistinguishable for all mechanisms and therefore only the results using MICE are presented. Firstly, the results from imposing a multivariate MAR mechanism are reported for all missing data methods.
Results from imposing a multivariate MAR mechanism
Regression coefficient estimates from the Cox proportional hazards model
For the highly skewed continuous covariates X_{2}, X_{3} and X_{4}, the least biased regression coefficient estimates were produced when MI was performed using MICE-PMM without transformations. In contrast, more bias was seen for the regression estimates for X_{1}, X_{5}, and X_{6} using this approach. When the imputed values were not truncated to within a plausible range (MIX-no truncating), all regression coefficient estimates tended to be slightly more extreme than with all other MI approaches.
SE of the regression coefficient estimates
Coverage of the true value
The coverage using the different MI approaches remained around the nominal 95% level irrespective of the amount of missingness for all covariates except the highly skewed covariates of X_{2}, X_{3} and X_{4} (Figure 4). The coverage for X_{2} and X_{4} fell below 90% with 75% missingness for all MI approaches, except using MICE-PMM without transformations for X_{2}, which still had coverage of 93% with 75% missingness. The coverage for X_{3}, the covariate with a highly skewed distribution and the most missingness imposed, fell below 90% with 50% missingness using MICE-PMM without transformations and the 'aregImpute' function, but fell below 90% with only 25% missingness for all other MI approaches.
Significance of covariates in the prognostic model
With MI and SI, none of the covariates changed their significance in the model at the 5% level (Figure 5). However, the binary covariate X_{5} and the dummy variable for X_{8} representing group 3 (>30 mm) became borderline significant with increasing amounts of missingness.
Model performance measures
Results from imposing other missingness mechanisms
No apparent differences from the above results for a multivariate MAR mechanism were seen in the results after imposing a multivariate MCAR or combined missing data mechanism. The similarity of results for the multivariate MAR mechanism and the combined mechanism may have occurred because the MNAR mechanism was imposed on the covariate with the smallest amount of missingness and hence this mechanism had the least effect on the overall results.
Discussion
Using a real dataset to provide a suitable structure for simulating the datasets, as in this study, simplifies the data generation procedures, avoids arbitrary choices and can aid the generalisability of the results. The simulated data were not an exact replica of the original, but provided sufficiently strong similarities to the original data to not warrant using more complicated semi-continuous distributions for PGR level (X_{3}), and ER level (X_{4}). Approximating the patterns of missingness seen in an incomplete dataset provided a realistic framework for simulating the missing data. The levels of missingness imposed reflected those seen in a review of prognostic modelling studies [1]. The effects of using MI when fitting prognostic models were unknown. Therefore this simulation study restricted the modelling process to including all covariates in the prognostic model and assuming linear relationships for all continuous covariates, both in the generation of the survival times and in the fitting of the prognostic model. Extensions of this research could include investigating the possible approaches for performing variable selection and fitting fractional polynomials after MI (e.g. [34]).
In this simulation study, performing a CC analysis with any multivariate missingness produced less biased regression coefficient estimates with better coverage rates than using SI or MI. However, this benefit was at the expense of larger standard errors and hence loss of efficiency due to the reduction in the sample being analysed [8]. This loss in efficiency affected the significance of the covariates in the prognostic model of the marginal prognostic covariates, making them appear non-significant with 25% or more missingness when in fact they were prognostically important. It is advisable to use a CC analysis only when fitting a Cox proportional hazards model with a reasonably small amount of missing data. Previous evidence [2, 3] suggested that imposing univariate MAR missingness associated with outcome would result in biased regression coefficient estimates when using CC analysis. Demissie et al [3] found large biases when the MAR mechanisms were associated with longer survival times or event status and the covariates had large hazard ratios for survival, but not with a hazard ratio of 1 or when the missingness was associated with shorter survival times. Relatively unbiased results were found in our simulations with multivariate MAR missingness where the missingness of X_{5}, the binary covariate with 20% of the total missingness, was associated with shorter survival times and the hazard ratio for X_{5} in the prognostic model was 0.7. Biased regression estimates may be more evident with more missing covariate data that is highly dependent on the outcome, especially longer survival times and event status, and with large hazard ratios.
With multivariate missing covariate data, using SI is not recommended with more than 10% missingness, due to its underestimation of the variability and corresponding detrimental effect on the coverage rates.
The results from performing simulations based on the German breast cancer dataset showed some bias, especially with over 25% missingness, for all mechanisms and MI approaches. The average SEs for all MI approaches and mechanisms were similar. They fell between the inflated estimate from the CC analysis and the underestimated SE from SI, as expected from previous research [8, 35]. The coverage was unaffected and remained around the nominal 95% level for all mechanisms and covariates, except for the skewed covariates. Tang et al [36] also found that the coverage may be poorer after MI for highly skewed data. Better coverage rates were seen using MICE-PMM without transformations or the 'aregImpute' function than with the other MI approaches and also when a MAR mechanism was imposed rather than a MNAR mechanism.
Researchers have suggested that MI approaches are fairly robust to departures from normality due to the separation of the imputation and analysis phases [12, 13, 37]. Any deficiencies in the assumptions and implementation of the imputation model will only affect the incomplete component of the dataset and not the whole sample [38]. Having skewed continuous data and an outcome of survival time as in this simulation study may have affected the performance of the methods under investigation, especially those which assumed an underlying normal distribution for the continuous covariates, e.g. NORM, MIX and MICE. This study highlighted the problems that can exist when the imputation and analysis models differ and the model assumptions may not be fully satisfied. The bias seen in this simulation study even when the MAR mechanism assumption was correct may be an artefact of the transformations used in the imputation process [39]. Not only are the incomplete covariates transformed for imputation and then back-transformed prior to analysis, but the survival times are also transformed in the imputation model and then fitted using an alternative model. Imputing without transformation can reduce the bias in the mean estimate but distort other aspects of the distributional shape [39]. Log transformations were used for the continuous covariates in the data generation process. However, as the simulated data were then truncated to resemble the real data, applying the same transformations in the imputation process failed to satisfy normality. No other simple power transformations sufficiently improved normality or provided more plausible imputations.
The inclusion in the imputation model of survival time after log transformation and event status may not be the optimal choice to account for the censoring process and thus may have also introduced bias into the results. Using the Nelson-Aalen estimate of the cumulative hazard of the survival time may be more appropriate in future [40]. If the hazard rates for the survival and censoring times differ then it may be sensible to consider these times separately in the imputation model.
The MI procedures using MICE-PMM or the 'aregImpute' function, which rely on the distributional assumptions only to match complete and incomplete responders, performed better for all missingness mechanisms than the other MI approaches. This confirmed the results from Faris et al [15] that with incomplete skewed data, MICE-PMM would be preferred to other MI Markov Chain Monte Carlo type approaches. However, caution is needed when using the 'aregImpute' function, especially when the missingness is highly related to survival, as although the estimates for the incomplete covariates may exhibit little bias, the estimates for other prognostically important covariates may display more bias than seen with other MI approaches. Both MICE-PMM and the 'aregImpute' function identify suitable matches from the observed data and therefore additional caution is required with small samples and with covariates with rare events as there may be a limited number of available cases to be selected as imputed values. With skewed data, values of a few cases have a lot of leverage that may distort the imputations and influence the results. Therefore it is essential to examine the distributions of the covariates requiring imputation to determine whether transformations are likely to provide reasonable estimates for the data to be analysed. With MICE-PMM, transforming the continuous covariates produced worse estimates than simply using the covariate values on their original skewed scale. Our findings suggest that if suitable transformations do not improve normality it is better to use MICE-PMM without transformations. With a fully observed normally distributed outcome and more normally distributed incomplete covariates and hence compatible imputation and analysis models, MICE-PMM may not remain the best MI approach.
In this simulation study, truncating imputed values for the continuous covariates to within the plausible range produced less bias than allowing implausible values. Schafer [31] suggested rounding values for the incomplete binary covariate to the observed values. In these simulations, where only 20% of the total missingness was imposed on the binary covariate, that approach did not produce any more bias than using the correct distributional assumption, e.g. fitting logistic regression models. Biases may be more apparent when the binary covariate has an uneven split or greater missingness [39].
From this simulation study, with incomplete skewed data, MI using MICE-PMM without transformations produced precise unproblematic estimates [12] within the allowable 10% accuracy with up to 25% missingness, but would not be recommended with 50% or more missingness for any missing data mechanism. Furthermore, with a MNAR mechanism, MI performed poorly with 25% or more overall missingness. Including variables in the imputation model that help to explain the missingness or are highly associated with the incomplete covariates themselves, can reduce the effect of an MNAR missing data mechanism [8]. With less enriched imputation models, and datasets where there is little correlation between variables, the results from the MNAR may be even more extreme than seen here. Further research is required to assess whether alternative MI procedures or fully Bayesian approaches that can model the skewness of the covariate distribution and the missing data mechanism may be more appropriate when there is more than 25% missingness.
The true performance of the various missing data methods is likely to vary in relation to the underlying distribution of the covariates, the correlations between these variables as well as with different missing data mechanisms and associations between the outcome and the covariates with missing data. Therefore, the generalisability of the results from this simulation study, however rigorous, is limited due to reflecting the data from a single real prognostic study and imposing a restricted number of missing data mechanisms. Confirmatory investigations are required to examine the extent to which these findings are consistent across alternative populations, distributions and clinical contexts.
Conclusions
For approximately 10% or less missingness, it remains unclear whether the benefits of MI, including efficiency and utilising all data, outweigh the simplicity of a CC analysis. With increasing amounts of missingness, the benefits of MI over a CC analysis become clearer. When some data are skewed, as in this simulation study, MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.
Declarations
Acknowledgements
Andrea Marshall (nee Burton) was supported by a Cancer Research UK project grant. Douglas G Altman is supported by Cancer Research UK.
Patrick Royston is supported by the UK Medical Research Council.
Authors’ Affiliations
References
- Burton A, Altman DG: Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines. British Journal of Cancer. 2004, 91 (1): 4-8. 10.1038/sj.bjc.6601907.View ArticlePubMedPubMed CentralGoogle Scholar
- Vach W, Blettner M, Armitage P, Colton T: Missing data in epidemiologic studies. Encyclopedia of Biostatistics. 1998, New York: John Wiley & Sons, 2641-2654.Google Scholar
- Demissie S, LaValley MP, Horton NJ, Glynn RJ, Cupples LA: Bias due to missing exposure data using complete-case analysis in the proportional hazards regression model. Statistics in Medicine. 2003, 22 (4): 545-557. 10.1002/sim.1340.View ArticlePubMedGoogle Scholar
- Lipsitz SR, Ibrahim JG: Using the EM-algorithm for survival data with incomplete categorical covariates. Lifetime Data Analysis. 1996, 2 (1): 5-14. 10.1007/BF00128467.View ArticlePubMedGoogle Scholar
- Lipsitz SR, Ibrahim JG: Estimating equations with incomplete categorical covariates in the Cox model. Biometrics. 1998, 54 (3): 1002-1013. 10.2307/2533852.View ArticlePubMedGoogle Scholar
- Meng XL, Schenker N: Maximum likelihood estimation for linear regression models with right censored outcomes and missing predictors. Computational Statistics & Data Analysis. 1999, 29 (4): 471-483. 10.1016/S0167-9473(98)00074-7.View ArticleGoogle Scholar
- Rubin DB: Multiple Imputation for Nonresponse in Surveys. 2004, New York: John Wiley and SonsGoogle Scholar
- Little RJA, Rubin DB: Statistical Analysis with Missing Data, Second edition. 2002, New York: John Wiley and SonsGoogle Scholar
- van Buuren S, Boshuizen HC, Knook DL: Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine. 1999, 18 (6): 681-694. 10.1002/(SICI)1097-0258(19990330)18:6<681::AID-SIM71>3.0.CO;2-R.View ArticlePubMedGoogle Scholar
- Meng XL: Multiple-imputation inferences with uncongenial sources of input. Statistical Science. 1994, 9 (4): 538-558.Google Scholar
- Hu M, Salvucci S, Weng S, Cohen MP: Evaluation of Proc Impute and Schafer's imputation software. Proceedings of the survey research methods section of the American Statistical Association. Chicago, Illinois. 1996, 287-292.Google Scholar
- Schafer JL, Graham JW: Missing data: our view of the state of the art. Psychological Methods. 2002, 7 (2): 147-177. 10.1037/1082-989X.7.2.147.View ArticlePubMedGoogle Scholar
- Schafer J, Ezzati-Rice T, Johnson W, Khare M, Little R, Rubin D: The NHANES III multiple imputation project. Proceedings of the Survey Research Methods Section of the American Statistical Association. Chicago, Illnois. 1996, 28-37.Google Scholar
- Schenker N, Taylor JMG: Partially parametric techniques for multiple imputation. Computational Statistics & Data Analysis. 1996, 22 (4): 425-446. 10.1016/0167-9473(95)00057-7.View ArticleGoogle Scholar
- Faris PD, Ghali WA, Brant R, Norris CM, Galbraith PD, Knudtson ML: Multiple imputation versus data enhancement for dealing with missing data in observational health care outcome analyses. Journal of Clinical Epidemiology. 2002, 55 (2): 184-191. 10.1016/S0895-4356(01)00433-4.View ArticlePubMedGoogle Scholar
- Greenland S, Finkle WD: A critical look at methods for handling missing covariates in epidemiologic regression analyses. American Journal of Epidemiology. 1995, 142 (12): 1255-1264.PubMedGoogle Scholar
- Chen HY: Double-semiparametric method for missing covariates in Cox regression models. Journal of the American Statistical Association. 2002, 97 (458): 565-576. 10.1198/016214502760047096.View ArticleGoogle Scholar
- Herring AH, Ibrahim JG, Lipsitz SR: Non-ignorable missing covariate data in survival analysis: a case-study of an International Breast Cancer Study Group trial. Journal of the Royal Statistical Society Series C-Applied Statistics. 2004, 53 (2): 293-310. 10.1046/j.1467-9876.2003.05168.x.View ArticleGoogle Scholar
- Oostenbrink R, Moons KGM, Bleeker SE, Moll HA, Grobbee DE: Diagnostic research on routine care data prospects and problems. Journal of Clinical Epidemiology. 2003, 56 (6): 501-506. 10.1016/S0895-4356(03)00080-5.View ArticlePubMedGoogle Scholar
- Harrell FE: Regression Modeling Strategies with Applications to Linear Models, Logistic Regression, and Survival Analysis. 2001, New York: Springer-VerlagGoogle Scholar
- Barzi F, Woodward M: Imputations of missing values in practice: Results from imputations of serum cholesterol in 28 cohort studies. American Journal of Epidemiology. 2004, 160 (1): 34-45. 10.1093/aje/kwh175.View ArticlePubMedGoogle Scholar
- Scheffer J: Dealing with missing data. Research Letters in the Information and Mathematical Sciences. 2002, 3: 153-160.Google Scholar
- R Development Core Team: R: A language and environment for statistical computing. 2004, Vienna, Austria: R Foundation for Statistical ComputingGoogle Scholar
- Sauerbrei W, Royston P, Bojar H, Schmoor C, Schumacher M: Modelling the effects of standard prognostic factors in node-positive breast cancer. German Breast Cancer Study Group (GBSG). British Journal of Cancer. 1999, 79 (11-12): 1752-1760. 10.1038/sj.bjc.6690279.View ArticlePubMedPubMed CentralGoogle Scholar
- Burton A, Altman DG, Royston P, Holder RL: The design of simulation studies in medical statistics. Statistics in Medicine. 2006, 25 (24): 4279-4292. 10.1002/sim.2673.View ArticlePubMedGoogle Scholar
- Bender R, Augustin T, Blettner M: Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine. 2005, 24 (11): 1713-1723. 10.1002/sim.2059.View ArticlePubMedGoogle Scholar
- Clark TG, Stewart ME, Altman DG, Gabra H, Smyth JF: A prognostic model for ovarian cancer. British Journal of Cancer. 2001, 85 (7): 944-952. 10.1054/bjoc.2001.2030.View ArticlePubMedPubMed CentralGoogle Scholar
- Collins LM, Schafer JL, Kam CM: A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods. 2001, 6 (4): 330-351.View ArticlePubMedGoogle Scholar
- Royston P, Sauerbrei W: A new measure of prognostic separation in survival data. Statistics in Medicine. 2004, 23 (5): 723-748. 10.1002/sim.1621.View ArticlePubMedGoogle Scholar
- Kong FH: Adjusting regression attenuation in the Cox proportional hazards model. Journal of Statistical Planning and Inference. 1999, 79 (1): 31-44. 10.1016/S0378-3758(98)00178-5.View ArticleGoogle Scholar
- Schafer JL: Analysis of Incomplete Multivariate Data. 1997, New York: Chapman and HallView ArticleGoogle Scholar
- Marshall A, Altman D, Holder R, Royston P: Combining estimates of interest in prognostic modelling studies after multiple imputation: current practice and guidelines. BMC Medical Research Methodology. 2009, 9 (1): 57-10.1186/1471-2288-9-57.View ArticlePubMedPubMed CentralGoogle Scholar
- Li KH, Meng XL, Raghunathan TE, Rubin DB: Significance levels from repeated p-values with multiply-imputed data. Statistica Sinica. 1991, 1 (1): 65-92.Google Scholar
- Wood AM, White IR, Royston P: How should variable selection be performed with multiply imputed data?. Statistics in Medicine. 2008, 27 (17): 3227-3246. 10.1002/sim.3177.View ArticlePubMedGoogle Scholar
- Rubin DB, Schenker N: Multiple imputation in health-care databases: an overview and some applications. Statistics in Medicine. 1991, 10 (4): 585-598. 10.1002/sim.4780100410.View ArticlePubMedGoogle Scholar
- Tang LQ, Song JW, Belin TR, Unutzer J: A comparison of imputation methods in a longitudinal randomized clinical trial. Statistics in Medicine. 2005, 24 (14): 2111-2128. 10.1002/sim.2099.View ArticlePubMedGoogle Scholar
- Rubin DB: Multiple imputation after 18+ years. Journal of the American Statistical Association. 1996, 91 (434): 473-489. 10.2307/2291635.View ArticleGoogle Scholar
- Schafer JL, Olsen MK: Modelling and imputation of semicontinuous survey variables. 2000, The Methodology Center, Penn State University, USAGoogle Scholar
- Horton NJ, Lipsitz SR, Parzen M: A potential for bias when rounding in multiple imputation. American Statistician. 2003, 57 (4): 229-232. 10.1198/0003130032314.View ArticleGoogle Scholar
- White I, Royston P: Imputing missing covariate values for the Cox model. Statistics in Medicine. 2009, 28 (15): 1982-1998. 10.1002/sim.3618.View ArticlePubMedPubMed CentralGoogle Scholar
- Schafer JL, Novo AA: norm: Analysis of multivariate normal datasets with missing values. 2002, R package version 1.0.9Google Scholar
- Schafer JL: mix: Estimation/multiple Imputation for Mixed Categorical and Continuous Data. 2003, R package version 1.0.4Google Scholar
- van Buuren S, Oudshoorn CGM: mice: Multivariate Imputation by Chained Equations library. 2005, R package version 1.13.1Google Scholar
- Harrell FE: Hmisc: Harrell Miscellaneous library for R statistical software. 2004, R package 2.2-3Google Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/10/7/prepub
Pre-publication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.