Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study
© Marshall et al. 2010
Received: 19 August 2009
Accepted: 19 January 2010
Published: 19 January 2010
Skip to main content
© Marshall et al. 2010
Received: 19 August 2009
Accepted: 19 January 2010
Published: 19 January 2010
There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.
Datasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.
Performing a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.
The results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.
Assessing the prognostic ability of clinical factors using a Cox proportional hazards model is often performed . However, missing covariate data complicates the analysis, but often occurs . A review of published prognostic studies  found that on average 13% of cases had incomplete data (range 0 - 60%) in 39 studies where this information could be obtained. In addition, 27% of values, on average, were missing within a single variable (range 0 - 72%) in 55 studies . Simply using the cases with complete covariate data, i.e. performing a complete case (CC) analysis, loses information and hence efficiency, and may lead to biased regression coefficients if the missingness is related to the outcome [2, 3]. Sophisticated likelihood based techniques can explicitly handle missing covariate data in analyses of survival (time to event) data (e.g. [4–6]). However, these generally require problem-specific programs to be written and hence may not be readily available.
Imputing the missing data poses a suitable alternative that uses all the data and can be performed using easily accessible methods. Multiple imputation (MI), where each missing value is replaced with a set of m (>1) independent values  to give m separate complete datasets, incorporates uncertainty of the missing data that cannot be achieved with single imputation (m = 1). The m completed datasets are analysed individually using standard statistical methods and the results combined into one summary estimate using simple rules devised by Rubin . The parameter estimates of interest are averaged and a variance estimate is obtained that incorporates both the within and between imputation variability. There are many different techniques for performing MI, but most approaches assume the missing data to be at least missing at random (MAR), where the probability of missingness is only associated with the observed and not the unobserved data . MI approaches are generally based on an imputation model from which plausible values for the missing data are obtained. The imputation model should contain all variables to be subsequently analysed, which for prognostic modelling studies includes the outcome and all potential covariates, but also any variables that help to explain the missing data . The more compatible the imputation and analysis models are, the more successful the MI approach will be . However, the use of MI in the published medical literature remains limited .
Simulation studies provide a framework to compare the performance of different approaches for handling missing data with a variety of missing data mechanisms, as the true value is known. Several simulation studies have investigated the effects of missing data using different MI approaches, but these have primarily imposed missingness only on the outcome variable (e.g. [11, 12]). These studies demonstrated that model based imputation approaches for an incomplete outcome variable were better than ad hoc imputation procedures and were fairly robust to some model departures . Furthermore, when a fully parametric imputation model correctly fitted the data, it performed better than alternative techniques such as predictive mean matching . Conversely, fully parametric methods performed worse when the imputation model did not fit . Few simulation studies have considered missing covariate data (e.g. ), especially situations where missingness was imposed on more than one covariate (e.g. [11, 16]). Only a limited number of these studies included survival as the outcome and these have only considered a CC analysis  or maximum likelihood based approaches (e.g. [17, 18]) and not MI techniques. There remains a lack of evidence about the effects of missing covariate data and its handling on the performance of the survival models and no consensus on the most appropriate MI techniques to use with a survival outcome.
In addition, no definitive guidelines appear to exist on the allowable proportion of missing data to validly apply MI techniques . With a single incomplete covariate or outcome, Harrell  suggested using imputation rather than a CC analysis with 5% missingness, although Barzi and Woodward  suggested that a CC analysis may still be suitable with up to 10% missingness. For MAR data, MI performed well up to 25% missingness, and adequately with 50% missingness . However with more than 60% missingness, the extreme levels of uncertainty about the imputed values resulted in high standard deviations and convergence problems of the imputation procedure with MI . With missing not at random (MNAR) data, where the probability of missingness is associated with the unobserved values , variance estimates were affected when more than 5% of the data were missing . All of these findings relate to an incomplete outcome or a single covariate and not to the situation with multiple incomplete covariates, where the missingness could relate to the level of an individual covariate or to the proportion of cases that have incomplete data for at least one covariate.
This paper reports the results of an extensive simulation study that aimed to assess the effects of applying different standard approaches to handle missing data in more than one covariate when fitting a Cox proportional hazards model to the full set of covariates. This simulation study investigated how the performance of the model was affected by varying amounts of missingness and different missing data mechanisms. We aimed to determine the maximum allowable proportion of missingness to validly apply these missing data techniques.
Details of the simulation procedures used within this simulation study are provided below. All simulations were performed using the freely available R statistical software , thus allowing all researchers access to any suitable methods identified.
Data structure for the breast cancer dataset and associated means and standard deviations (SDs) after suitable transformation
1 = Yes,
0 = No
0 = Pre,
1 = Post
0 = Grade I,
1 = Grade II/III
Continuous variable categorised
1 = ≤20 mm,
2 = 21-30 mm,
3 = >30 mm
For simplicity, the covariate data were generated using an underlying multivariate normal distribution  with the covariate means and covariance matrix obtained from the German breast cancer study data after suitable transformations (Table 1, Figure 1b). A log transformation was used for the continuous covariates X2 and X8 and a log (X + 1) transformation used for X3 and X4 to avoid taking logs of zero. The generated covariate data were back transformed onto their original scales, e.g. using exponential transformations, prior to any analyses being performed. A cut-point of 0.5 was used to obtain the three binary covariates (X5, X6, X7) and the same cut-points as in the original data were used for the categorical covariate (X8). Two dummy variables for X8 were created to indicate values of 21-30 mm or not and >30 mm or not respectively. Continuous covariates were truncated using their upper observed limits to produce realistic values and reasonable estimates for the mean and standard deviations that were not too dissimilar from the original dataset.
All continuous covariates were assumed to have a linear effect on the log relative hazard. An uncensored survival time was generated for each case assuming an exponential distribution with a hazard rate of 0.00027, which approximated the hazard seen in the breast cancer dataset, and their associated linear predictor . A censored time was also generated for each case using an exponential distribution with a hazard rate of 0.0002 to give approximately 35% censored observations. A smaller censoring rate than that seen in the breast cancer dataset ensured there were a sufficient number of events to fit a prognostic model for all levels of missingness. The required survival time was then defined for each case as the minimum of the uncensored and censored survival times and the event status determined accordingly.
A sample size of 1000 cases was used for all simulations, which represented the average sample size observed in a literature review of 100 reported prognostic factor studies .
The whole simulation process was repeated 1000 times, which enabled the smallest regression coefficient for X4 to be estimated with at least 20% accuracy and all remaining regression coefficients estimated to within 10% accuracy of their true values . The true values were obtained from fitting a Cox proportional hazards model to the motivating breast cancer data. Independent random samples were generated using different starting seeds that were separated by at least the sample size .
Missingness was imposed on four covariates: X3, X2, X5 and X8. A case was said to be incomplete if they had at least one missing covariate, but each case could have up to four covariates missing. Five overall rates of missingness of 5, 10, 25, 50 and 75% per case were considered to explore the effects with small, medium and large amounts of missingness. A moderately independent simulation strategy  was adopted, utilising the same set of 1000 datasets each time but with different values randomly deleted through using different starting seeds. This approach strengthens the comparison between different methods as it eliminates any sampling variability leaving all methods striving for exactly the same results, whilst allowing variability to exist between simulated datasets and amounts of missingness.
Data for four incomplete covariates (performance status, albumin, grade and residual disease) from an ovarian cancer study  provided empirical evidence of realistic patterns and frequencies of missing data and associations between the missingness of each covariate. The amount of missingness imposed on each of the four covariates, X3, X2, X5 and X8 were approximately 70%, 55%, 20% and 10%, respectively, of the overall amount of cases with any missing data. Dependencies between the missingness indicators for the incomplete covariates were generated such that 35% of incomplete cases were missing both X2 and X3, 10% were missing X5 and X2, and 5% were missing X8 and X3.
Specification of the missing data mechanisms to be imposed
β0 + ln(OR)MX3
β0 + ln(OR)MX2
β0 + ln(OR)MX3
Odds ratios (OR) to be specified in the missing data mechanisms given in Table 2
A Cox proportional hazards model including all eight covariates was fitted to each dataset. A linear relationship was assumed for all continuous covariates as used in the data generation process.
The outcomes of interest were the regression coefficients, associated standard errors (SE) and the significance of the covariates in the regression model. The average regression coefficient estimates over all simulations were assessed using the bias from the true value , the percentage bias and the coverage . The effect of the missingness on the overall model performance was assessed using the likelihood ratio chi-square test , the model's predictive ability using Nagelkerke's R2 statistic , the prognostic separation D statistic  and the 2-year predicted survival probability.
The bias introduced from maximising the partial likelihood estimator and not the full estimator when fitting a Cox regression model  in addition to any bias due to the data generation process impedes the detection of any additional bias incurred due to the missing data and its handling. Hence the average regression coefficient estimates and associated empirical SE (i.e. the standard deviation of the estimates across simulations) from performing a large simulation study with no missingness involving 20,000 replications formed the true values against which the missing data simulations were compared.
Summary of the missing data methods investigated
Library used within R statistical software
Number of iterations
Complete case analysis: Analyses only cases with complete data for all covariates
Single imputation performed using PMM
'pmm' function in 'mice'
Multiple imputation (MI) using data augmentation approach  with a multivariate normal assumption for all variables
MI using data augmentation approach using a general location model
MI using data augmentation approach using a general location model, but imputed values are not truncated to within plausible range
MI using regression switching imputation . Linear model are used for continuous covariates and logistic model for binary covariates and dummy variables for categorical covariates
MI using MICE with PMM
'pmm' function in 'mice' 
MI using MICE with PMM without transforming the incomplete covariates
'pmm' function in 'mice' 
MI using flexible additive imputation models  with PMM
'aregImpute' function in 'Hmisc' 
For all imputation approaches, the imputation model included all eight covariates in addition to the survival time and event status, indicating whether a case had the event or was censored at the time of analysis . A logarithmic transformation was used for survival time and the incomplete continuous covariates to make the assumption of normality more applicable . All imputed values were rounded to plausible values, where necessary. Twenty imputations were performed for each MI approach to provide a relative efficiency of at least 96%  compared to having an infinite number of imputations for the five amounts of missingness to be imposed from 5% to 75%.
Estimates of the outcomes of interest after MI were combined following proposed guidelines . Rubin's Rules  were used to combine each of the regression coefficient estimates, the prognostic separation D statistic and the predicted survival estimates after a complementary log-log transformation. An overall MI p-value from the Wald test for assessing the significance of each covariate in the regression model was also determined using Rubin's Rules . An overall significance estimate for the likelihood ratio statistic was obtained using the method for combining X 2 statistics . The median and inter-quartile ranges of the m Nagelkerke's R2 statistics  were calculated for each of the simulated datasets. Any deficiencies in the model performance measures and approaches for combining these estimates after MI should be similar across missing data methods and therefore still allow a valid and worthwhile comparison. After performing all 1000 simulations, the outcomes of interest were summarised, in general using the average value over all simulations or using the median value, where appropriate.
The results from performing MI using MICE, NORM and MIX were indistinguishable for all mechanisms and therefore only the results using MICE are presented. Firstly, the results from imposing a multivariate MAR mechanism are reported for all missing data methods.
For the highly skewed continuous covariates X2, X3 and X4, the least biased regression coefficient estimates were produced when MI was performed using MICE-PMM without transformations. In contrast, more bias was seen for the regression estimates for X1, X5, and X6 using this approach. When the imputed values were not truncated to within a plausible range (MIX-no truncating), all regression coefficient estimates tended to be slightly more extreme than with all other MI approaches.
The coverage using the different MI approaches remained around the nominal 95% level irrespective of the amount of missingness for all covariates except the highly skewed covariates of X2, X3 and X4 (Figure 4). The coverage for X2 and X4 fell below 90% with 75% missingness for all MI approaches, except using MICE-PMM without transformations for X2, which still had coverage of 93% with 75% missingness. The coverage for X3, the covariate with a highly skewed distribution and the most missingness imposed, fell below 90% with 50% missingness using MICE-PMM without transformations and the 'aregImpute' function, but fell below 90% with only 25% missingness for all other MI approaches.
With MI and SI, none of the covariates changed their significance in the model at the 5% level (Figure 5). However, the binary covariate X5 and the dummy variable for X8 representing group 3 (>30 mm) became borderline significant with increasing amounts of missingness.
No apparent differences from the above results for a multivariate MAR mechanism were seen in the results after imposing a multivariate MCAR or combined missing data mechanism. The similarity of results for the multivariate MAR mechanism and the combined mechanism may have occurred because the MNAR mechanism was imposed on the covariate with the smallest amount of missingness and hence this mechanism had the least effect on the overall results.
Using a real dataset to provide a suitable structure for simulating the datasets, as in this study, simplifies the data generation procedures, avoids arbitrary choices and can aid the generalisability of the results. The simulated data were not an exact replica of the original, but provided sufficiently strong similarities to the original data to not warrant using more complicated semi-continuous distributions for PGR level (X3), and ER level (X4). Approximating the patterns of missingness seen in an incomplete dataset provided a realistic framework for simulating the missing data. The levels of missingness imposed reflected those seen in a review of prognostic modelling studies . The effects of using MI when fitting prognostic models were unknown. Therefore this simulation study restricted the modelling process to including all covariates in the prognostic model and assuming linear relationships for all continuous covariates, both in the generation of the survival times and in the fitting of the prognostic model. Extensions of this research could include investigating the possible approaches for performing variable selection and fitting fractional polynomials after MI (e.g. ).
In this simulation study, performing a CC analysis with any multivariate missingness produced less biased regression coefficient estimates with better coverage rates than using SI or MI. However, this benefit was at the expense of larger standard errors and hence loss of efficiency due to the reduction in the sample being analysed . This loss in efficiency affected the significance of the covariates in the prognostic model of the marginal prognostic covariates, making them appear non-significant with 25% or more missingness when in fact they were prognostically important. It is advisable to use a CC analysis only when fitting a Cox proportional hazards model with a reasonably small amount of missing data. Previous evidence [2, 3] suggested that imposing univariate MAR missingness associated with outcome would result in biased regression coefficient estimates when using CC analysis. Demissie et al  found large biases when the MAR mechanisms were associated with longer survival times or event status and the covariates had large hazard ratios for survival, but not with a hazard ratio of 1 or when the missingness was associated with shorter survival times. Relatively unbiased results were found in our simulations with multivariate MAR missingness where the missingness of X5, the binary covariate with 20% of the total missingness, was associated with shorter survival times and the hazard ratio for X5 in the prognostic model was 0.7. Biased regression estimates may be more evident with more missing covariate data that is highly dependent on the outcome, especially longer survival times and event status, and with large hazard ratios.
With multivariate missing covariate data, using SI is not recommended with more than 10% missingness, due to its underestimation of the variability and corresponding detrimental effect on the coverage rates.
The results from performing simulations based on the German breast cancer dataset showed some bias, especially with over 25% missingness, for all mechanisms and MI approaches. The average SEs for all MI approaches and mechanisms were similar. They fell between the inflated estimate from the CC analysis and the underestimated SE from SI, as expected from previous research [8, 35]. The coverage was unaffected and remained around the nominal 95% level for all mechanisms and covariates, except for the skewed covariates. Tang et al  also found that the coverage may be poorer after MI for highly skewed data. Better coverage rates were seen using MICE-PMM without transformations or the 'aregImpute' function than with the other MI approaches and also when a MAR mechanism was imposed rather than a MNAR mechanism.
Researchers have suggested that MI approaches are fairly robust to departures from normality due to the separation of the imputation and analysis phases [12, 13, 37]. Any deficiencies in the assumptions and implementation of the imputation model will only affect the incomplete component of the dataset and not the whole sample . Having skewed continuous data and an outcome of survival time as in this simulation study may have affected the performance of the methods under investigation, especially those which assumed an underlying normal distribution for the continuous covariates, e.g. NORM, MIX and MICE. This study highlighted the problems that can exist when the imputation and analysis models differ and the model assumptions may not be fully satisfied. The bias seen in this simulation study even when the MAR mechanism assumption was correct may be an artefact of the transformations used in the imputation process . Not only are the incomplete covariates transformed for imputation and then back-transformed prior to analysis, but the survival times are also transformed in the imputation model and then fitted using an alternative model. Imputing without transformation can reduce the bias in the mean estimate but distort other aspects of the distributional shape . Log transformations were used for the continuous covariates in the data generation process. However, as the simulated data were then truncated to resemble the real data, applying the same transformations in the imputation process failed to satisfy normality. No other simple power transformations sufficiently improved normality or provided more plausible imputations.
The inclusion in the imputation model of survival time after log transformation and event status may not be the optimal choice to account for the censoring process and thus may have also introduced bias into the results. Using the Nelson-Aalen estimate of the cumulative hazard of the survival time may be more appropriate in future . If the hazard rates for the survival and censoring times differ then it may be sensible to consider these times separately in the imputation model.
The MI procedures using MICE-PMM or the 'aregImpute' function, which rely on the distributional assumptions only to match complete and incomplete responders, performed better for all missingness mechanisms than the other MI approaches. This confirmed the results from Faris et al  that with incomplete skewed data, MICE-PMM would be preferred to other MI Markov Chain Monte Carlo type approaches. However, caution is needed when using the 'aregImpute' function, especially when the missingness is highly related to survival, as although the estimates for the incomplete covariates may exhibit little bias, the estimates for other prognostically important covariates may display more bias than seen with other MI approaches. Both MICE-PMM and the 'aregImpute' function identify suitable matches from the observed data and therefore additional caution is required with small samples and with covariates with rare events as there may be a limited number of available cases to be selected as imputed values. With skewed data, values of a few cases have a lot of leverage that may distort the imputations and influence the results. Therefore it is essential to examine the distributions of the covariates requiring imputation to determine whether transformations are likely to provide reasonable estimates for the data to be analysed. With MICE-PMM, transforming the continuous covariates produced worse estimates than simply using the covariate values on their original skewed scale. Our findings suggest that if suitable transformations do not improve normality it is better to use MICE-PMM without transformations. With a fully observed normally distributed outcome and more normally distributed incomplete covariates and hence compatible imputation and analysis models, MICE-PMM may not remain the best MI approach.
In this simulation study, truncating imputed values for the continuous covariates to within the plausible range produced less bias than allowing implausible values. Schafer  suggested rounding values for the incomplete binary covariate to the observed values. In these simulations, where only 20% of the total missingness was imposed on the binary covariate, that approach did not produce any more bias than using the correct distributional assumption, e.g. fitting logistic regression models. Biases may be more apparent when the binary covariate has an uneven split or greater missingness .
From this simulation study, with incomplete skewed data, MI using MICE-PMM without transformations produced precise unproblematic estimates  within the allowable 10% accuracy with up to 25% missingness, but would not be recommended with 50% or more missingness for any missing data mechanism. Furthermore, with a MNAR mechanism, MI performed poorly with 25% or more overall missingness. Including variables in the imputation model that help to explain the missingness or are highly associated with the incomplete covariates themselves, can reduce the effect of an MNAR missing data mechanism . With less enriched imputation models, and datasets where there is little correlation between variables, the results from the MNAR may be even more extreme than seen here. Further research is required to assess whether alternative MI procedures or fully Bayesian approaches that can model the skewness of the covariate distribution and the missing data mechanism may be more appropriate when there is more than 25% missingness.
The true performance of the various missing data methods is likely to vary in relation to the underlying distribution of the covariates, the correlations between these variables as well as with different missing data mechanisms and associations between the outcome and the covariates with missing data. Therefore, the generalisability of the results from this simulation study, however rigorous, is limited due to reflecting the data from a single real prognostic study and imposing a restricted number of missing data mechanisms. Confirmatory investigations are required to examine the extent to which these findings are consistent across alternative populations, distributions and clinical contexts.
For approximately 10% or less missingness, it remains unclear whether the benefits of MI, including efficiency and utilising all data, outweigh the simplicity of a CC analysis. With increasing amounts of missingness, the benefits of MI over a CC analysis become clearer. When some data are skewed, as in this simulation study, MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.
Andrea Marshall (nee Burton) was supported by a Cancer Research UK project grant. Douglas G Altman is supported by Cancer Research UK.
Patrick Royston is supported by the UK Medical Research Council.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.