 Research
 Open Access
 Published:
Evaluation of multiple imputation approaches for handling missing covariate information in a casecohort study with a binary outcome
BMC Medical Research Methodology volume 22, Article number: 87 (2022)
Abstract
Background
In casecohort studies a random subcohort is selected from the inception cohort and acts as the sample of controls for several outcome investigations. Analysis is conducted using only the cases and the subcohort, with inverse probability weighting (IPW) used to account for the unequal sampling probabilities resulting from the study design. Like all epidemiological studies, casecohort studies are susceptible to missing data. Multiple imputation (MI) has become increasingly popular for addressing missing data in epidemiological studies. It is currently unclear how best to incorporate the weights from a casecohort analysis in MI procedures used to address missing covariate data.
Method
A simulation study was conducted with missingness in two covariates, motivated by a case study within the Barwon Infant Study. MI methods considered were: using the outcome, a proxy for weights in the simple casecohort design considered, as a predictor in the imputation model, with and without exposure and covariate interactions; imputing separately within each weight category; and using a weighted imputation model. These methods were compared to a complete case analysis (CCA) within the context of a standard IPW analysis model estimating either the risk or odds ratio. The strength of associations, missing data mechanism, proportion of observations with incomplete covariate data, and subcohort selection probability varied across the simulation scenarios. Methods were also applied to the case study.
Results
There was similar performance in terms of relative bias and precision with all MI methods across the scenarios considered, with expected improvements compared with the CCA. Slight underestimation of the standard error was seen throughout but the nominal level of coverage (95%) was generally achieved. All MI methods showed a similar increase in precision as the subcohort selection probability increased, irrespective of the scenario. A similar pattern of results was seen in the case study.
Conclusions
How weights were incorporated into the imputation model had minimal effect on the performance of MI; this may be due to casecohort studies only having two weight categories. In this context, inclusion of the outcome in the imputation model was sufficient to account for the unequal sampling probabilities in the analysis model.
Background
Epidemiological studies often collect large amounts of data on many individuals. Some of this information may be costly to analyse, for example biological samples. Furthermore, if there are a limited number of cases, data on all noncases may provide little additional information to that provided by a subset [1]. In this context, investigators may opt to use a casecohort study design, in which background covariate data and outcomes are collected on all participants and more costly exposures (e.g. metabolite levels) are collected on a smaller subset. An example of a cohort study adopting the casecohort design is the Barwon Infant Study (BIS). This is a populationderived cohort study with a focus on noncommunicable diseases and the biological processes driving them. Given that a number of investigations within BIS involve exposures collected through costly biomarker and metabolite analysis, for example serum vitamin D levels, the casecohort design was implemented to minimise cost [2].
In the casecohort design, a subset of the full cohort, hereafter termed the subcohort, is randomly selected from the inception cohort. This subcohort is used as the sample of controls for all subsequent investigations, with exposure data collected from the subcohort and all cases [3]. In such a study, the analysis is conducted on the subcohort and cases only, resulting in an unequal probability of selection into the analysis, with cases having probability of selection equal to1 and noncase subcohort members having a probability of selection less than 1. This unequal sampling should be accounted for in the analysis so as to avoid bias induced due to the oversampling of cases [4].
One way to view the casecohort design, and to address the unequal sampling, is to treat it as a missing data problem, where the exposure data is ‘missing by design’. Standard practice in the analysis of casecohort studies is to handle this missing exposure data using inverse probability weighting (IPW) based on the probability of selection into the analysis [1]. Additionally, casecohort studies may be subject to unintended missing data in the covariates, and multiple imputation (MI) may be applied to address this missing data.
MI is a twostage procedure in which missing values are first imputed by drawing plausible sets of values from the posterior distribution of the missing data given the observed data, to form multiple, say m > 1, complete datasets. In the second stage, the analysis is conducted on each of these m datasets as though they were fully observed, producing m estimates of the target estimands. An overall estimate for the parameter of interest is produced, along with its variance, using Rubin’s rules [5]. If the imputation model is appropriate, and the assumptions on the missing data mechanism hold, then the resulting estimates are unbiased with standard errors (SE) that reflect not only the variation of the data but also the uncertainty in the missing values [6].
When conducting MI, there are two general approaches that can be used to generate the imputed datasets when there is multivariate missingness: joint modelling, most commonly multivariate normal imputation (MVNI), and fully conditional specification (FCS). Under MVNI the missing covariates are assumed to jointly follow a multivariate normal distribution [7]. In contrast, FCS uses a series of univariate imputation models, one for each incomplete covariate, and imputes missing values for each variable by iterating through these models sequentially [8]. To obtain valid inferences, careful consideration must be made when constructing the imputation model such that it incorporates all features of the analysis model, in order to ensure compatibility between the imputation and analysis model [9, 10].
In simple terms, to achieve compatibility, the imputation model must at least include all the variables in the analysis model, and in the same form. It may also include additional variables, termed auxiliary variables, which can be used to improve the precision of the inference if the auxiliary variables are associated with the variables that have missing data. Auxiliary variables may additionally decrease bias if they are strong predictors of missingness [11]. In the context of a casecohort study where the target estimand is the coefficient for the risk ratio (RR) estimated from a logbinomial model with IPW to address unequal probability of selection, two key features should be reflected in the imputation model; 1) the assumed distribution of the outcome, given the exposure and covariates (i.e. logbinomial), and 2) the weights. It is currently unclear how best to incorporate weights into MI in the context of a casecohort analysis.
It has been previously shown that ignoring weights during MI can introduce bias into the point estimates and estimated variance produced through Rubin’s rules in the context of an IPW analysis model [12, 13]. Various approaches to incorporate weights into MI have been proposed in the literature. Previous work from Marti and Chavance [14] in a survival analysis of casecohort data suggests that simply including the outcome, as a proxy for the weights, in the imputation model may be sufficient for incorporating the weights. In the casecohort setting we are considering, there are only two distinct weights, representing the unequal selection for cases and controls, so the weighting variable is completely collinear with the outcome. An additional approach is to include weights and an interaction between the weights and all of the variables in the analysis model in the imputation model. Carpenter and Kenward [12] illustrated that this corrected for the bias seen when the weights are ignored in the imputation model. One difficulty with this approach is that it may be infeasible if there are several incomplete variables. Another drawback is the increased number of parameters to be estimated during imputation. Another potential approach is to use stratumspecific imputation, in which missing values for cases and noncase subcohort members are imputed separately. While many studies have compared MI approaches in the casecohort setting [14,15,16,17], these are in the context of a timetoevent endpoint and predominantly considered MI only to address the missing exposure due to the design. Keogh [16] considered additional missingness in the covariates, but this was in the context of a survival analysis where weights were dependent on time. While there are many approaches available, it is unclear how these preform in a simple casecohort context with missing covariate data.
The aim of this study was to compare the performance of a range of possible methods for implementing MI to handle missing covariate data in the context of a casecohort study where the target analysis uses IPW to estimate the i) RR and ii) odds ratio (OR). Whilst the common estimand in casecohort studies is the RR due to the ability to directly estimate this quantity without the raredisease assumption [18], we have chosen to additionally consider the target estimand being the coefficient for the OR as this is still a commonly used estimand. The performance of the MI approaches was explored under a range of scenarios through the use of a simulation study closely based on a case study within BIS, and application of these methods to the BIS data. The ultimate goal was to provide guidance on the use of MI for handling covariate missingness in the analysis of casecohort studies.
The paper is structured as follows. We first introduce the motivating example, a casecohort investigation within BIS, and the target analysis models used for this study. This is followed by a description of the MI approaches to be assessed and details of a simulation study designed to evaluate these approaches based on the BIS case study. We then apply these approaches to the BIS case study. Finally, we conclude with a discussion.
Methods
Motivating example
The motivating example for this study comes from a casecohort investigation within BIS [19]. A full description of BIS can be found elsewhere [2]. Briefly, it is a populationderived longitudinal birth cohort study of infants recruited during pregnancy (n = 1074). The research question focused on the effect of vitamin D insufficiency (VDI) at birth on the risk of food allergy at oneyear. Cord blood was collected and stored after birth, and the children were followed up at oneyear. During this review, the infant’s allergy status to five common food allergens (cow’s milk, peanuts, egg, sesame and cashew) was determined through a combination of a skin prick test and a food challenge. Of those who completed the oneyear review (n = 894, 83%), a random subcohort was selected (n = 274), with a probability of approximately 0.31. The exposure, VDI, was defined as 25(OH)D_{3} serum metabolite levels below 50 nM and was measured from those with a confirmed food allergy at oneyear and those who were selected into the subcohort.
The planned primary analysis of the case study was to estimate the RR for the target association using IPW in a binomial regression model adjusted for the confounding variables: family history of allergy (any of asthma, hay fever, eczema, or food allergy in an infant’s parent or sibling), Caucasian ethnicity of the infant, number of siblings, domestic pet ownership, and formula feeding at 6 and 12 months. The target analysis of this study adjusted for a slightly different set of confounders to the BIS example. A description of these variables and the amount of missing data in each is shown in Table 1.
Target analysis
In this study, we focus on two estimands from two different analysis models. Each model targets the association between VDI and food allergy at oneyear, adjusting for confounders. The first model estimates the adjusted RR using a Poisson regression model with a loglink and a robust error variance [20] to avoid the known convergence issues of the logbinomial model:
The RR of interest is exp(θ_{1}). The second target estimand is the adjusted OR for the exposureoutcome association, estimated via a logistic regression model:
The OR of interest is exp(β_{1}). Estimation for each model uses IPW, where the weights are estimated using the method outlined by Borgan [21] for stratified sampling of the cohort, noting that the oversampling of the cases is a special case of stratified sampling where stratification depends on the outcome. The weight for ith individual can be defined as w_{i} = 1 for cases, and n_{0}/m_{0} for noncases, where n_{0} is the number of noncases in the full cohort and m_{0} is the number of noncases within the subcohort.
MI methods
Below we outline the four approaches we considered in the BIS case study and simulation study for incorporating the weights into the imputation model. All MI approaches include the outcome, exposure, covariates, and auxiliary variables in the imputation model except where specified. Where imputation has been applied under FCS, binary variables have been imputed from a logistic model. For MVNI, all variables are imputed from a multivariate normal distribution, conditional on all other variables, with imputed covariates included into the analysis as is (i.e. without rounding).
Weight proxy as a main effect
Under this approach, only the analysis and auxiliary variables listed above were included in the imputation model, with the outcome acting as a proxy for the weights due to the collinearity between the outcome and weights. This approach is implemented under both the FCS and MVNI frameworks.
Weight proxy interactions
The second approach includes twoway interactions between the outcome (as a proxy for the weights) and all other analysis variables in the imputation model. Within FCS, the interactions were included as predictors, with these derived within each iteration of the imputation [22]. Within MVNI interactions were considered as ‘just another variable’ in the imputation model [22].
Stratumspecific imputation
In the casecohort setting, where there are only two weight strata, another option is to impute separately within each weight/outcome stratum. Here, the outcome is not included in the imputation model, but rather the incomplete covariates are imputed using a model including the exposure, other covariates and auxiliary variables, for cases and noncases separately.
Weighted imputation model
A final option is to impute the missing values using a weighted imputation model, where the weights are set to those used during analysis. This can only be conducted within the FCS framework.
The approaches for handling the missing covariate data are summarised in Table 2.
Simulation study
A simulation study was conducted to assess the performance of each MI approach for accommodating the casecohort weights into the imputation model, across a range of scenarios. Simulations were conducted using Stata 15.1 [23]. Cohorts of size 1000 were generated using models outlined above with parameter values based on the observed relationships in BIS (except where noted).
Complete data generation
Complete data, comprising the exposure, five confounders and two auxiliary variables, were generated sequentially using the models listed below. Models for data generation were constructed based on the plausible causal structure specified in Fig. 1. A table showing the parameter values can be found in Additional file 1.

i.
Caucasian ethnicity

ii.
Maternal age at birth, in years, (auxiliary variable)
where ε~N(0, σ^{2})

iii.
Socioeconomic Index for Areas (SEIFA) tertile, (auxiliary variable)

iv.
Family history of allergy

v.
Number of siblings

vi.
Pet ownership and antenatal vitamin D supplement usage

vii.
The exposure, VDI
Finally, the outcome, food allergy at oneyear, was generated from a Bernoulli distribution with a probability determined by either model (1) or model (2) so the target analysis was correctly specified. In these models we set θ_{1} = log(RR_{adj}) = log(1.16) and β_{1} = log(OR_{adj}) = log(1.18) as estimated from BIS. Given the weak exposureoutcome association in BIS, we also generated food allergy with an enhanced association where we set θ_{1} = β_{1} = log(2.0).
An additional extreme data generation scenario was considered as a means to stresstest the MI approaches under more extreme conditions. In this scenario, the associations between the continuous auxiliary variable of maternal age and the exposure, missing covariates, and missing indicator variables were strengthened.
Inducing Missingness
Missingness was introduced into two covariates, antenatal vitamin D usage and pet ownership. Missingness was generated such that p% of overall observations had incomplete covariate information, with \(\frac{p}{2}\%\) having missing data in just one covariate and \(\frac{p}{2}\%\) having missing data in both, where p was chosen as either 15 or 30. Three missing data mechanisms were considered: an independent missing data mechanism and two dependent missing data mechanisms as depicted in the missingness directed acyclic graph (mDAG) in Fig. 1.
Under the independent missing data mechanism, missingness in each covariate was randomly assigned to align with the desired proportions. Under the dependent missingness mechanisms, an indicator for missingness in pet ownership, M_{petown}, was initially generated from a logistic model (13), followed by an indicator for missingness in antenatal vitamin D usage, M_{antevd} (model (14)).
The parameters, ν_{0} and τ_{0}, and τ_{4} were iteratively chosen until the desired proportions of missing information were obtained. Missingness indicators were generated dependent on the outcome (a setting where the completecase analysis would be expected to be biased) and an auxiliary variable (a setting where we expect a benefit of MI over the completecase analysis), as depicted in the causal diagram in Fig. 1. The dependency between the missing indicator variables was used to simultaneously control both the overall proportion of incomplete records and the proportion with missingness in both variables.
The two dependent missing mechanisms differed in the strength of association between predictor variables and the missing indicators. The first mechanism used parameter values set to those estimated in BIS (termed Dependent Missingness – Observed, or DMO). The second used an enhanced mechanism where the parameters values were doubled (termed Dependent Missingness – Enhanced, or DME). The parameter values used to induce missingness under the dependent missingness mechanisms can be found in Additional file 1.
To mimic the casecohort design, a subcohort was then randomly selected using one of three probabilities of selection (0.20, 0.30, 0.40). The exposure, VDI, was set to missing for participants without the outcome and who had not been selected into the subcohort.
Overall, we considered 78 scenarios (2 data generation processes, 2 exposureoutcome associations, 3 missing data mechanisms, 2 incomplete covariate proportions, and 3 subcohort selection probabilities, plus another 6 scenarios under extreme conditions).
For the 6 extreme scenarios presented in the results section, the mean percentage of cases in the full cohort across the 2000 simulated datasets was 20.4% (standard deviation: 1.3%) for scenarios targeting RR estimation, and 18.4% (1.2%) for OR estimation. The average casecohort sample size ranged from 348 to 522, increasing with the probability of subcohort selection. The percentage of incomplete observations within the casecohort sample ranged from 30.5 to 32.6%, with the percentage of incomplete cases increasing as the subcohort size decreased due to the dependency between the probability of being incomplete and the outcome. Additional file 1 contains a table showing summaries for the 2000 simulated datasets for the 6 extreme scenarios.
Evaluation of MI approaches
For each scenario, the MI approaches outlined above were applied and 30 imputed datasets generated, to match the maximum proportion of missing observations. The imputed datasets were analysed using IPW with the corresponding target analysis model. A completedata analysis (with no missing data in the subcohort) and a completecase analysis (CCA) were also conducted for comparison. Performance was measured in terms of percentage bias relative to the true value (relative bias), the empirical and modelbased SE, and coverage probability of the 95% confidence interval for the target estimand, the effect of VDI on food allergy (θ_{1} in model (1) and β_{1} in model (2)). In calculating the performance measures, the true value was taken to be the value used during data generation with measures calculated using the simsum package in Stata (see [24] for details). We also report the Monte Carlo standard error (MCSE) for each performance measure. For each scenario we presented results for 2000 simulations. With 2000 simulations, the MCSE for a true coverage of 95% would be 0.49%, and we can expect the estimated coverage probability to fall between 94.0 and 96.0% [25]. Since convergence issues were expected across the methods, we generated 2200 datasets in each scenario and retained the first 2000 datasets on which all methods converged. These 2000 datasets were used to calculate all performance measures except the convergence rate, which was calculated across the 2200 simulations.
Bias in RR estimation
Incompatibility between the imputation and analysis model may arise due to the imputation of missing values from a linear or logistic model when the analysis targets the RR [26]. To explore the bias introduced into the point estimate in this context, MI was conducted on the full cohort with completely observed exposure (i.e. before data were set to missing by design) and analysed without weighting. This was conducted to understand the baseline level bias, prior to the introduction of weighting. Results for this analysis are presented in Additional file 2.
Case study
Each of the MI methods were also applied to the target analyses using BIS data. For consistency with the simulation study, the analysis was limited to observations with complete outcome and exposure data (n = 246). In the case study there were also missing values in the covariates, Caucasian ethnicity (1%) and family history of allergy (1%), and the auxiliary variable SEIFA tertiles (2%), which were imputed alongside pet ownership (1%) and antenatal vitamin D usage (23%). For the FCS approaches, all variables were imputed using a logistic model, except for SEIFA tertile, which was imputed using an ordinal logistic model. Imputed datasets were analysed under each target analysis model with weights of 1 for cases and (0.31)^{−1} for noncase subcohort members. A CCA was also conducted.
Results
Given that the pattern of results were similar across the range of scenarios, we describe the results for the 6 scenarios under extreme conditions (enhanced exposureoutcome association, 30% missing covariates under DME, and enhanced auxiliary associations). The results for the remaining scenarios are provided in Additional file 3.
Across the 2200 simulations, only FCSWX and FCSSS had convergence issues (i.e., successfully completing the analysis without nonconvergence of the imputation procedure or numerical issues in the estimation). The rate of nonconvergence was greatest for FCSWX with the smallest subcohort size (probability of selection = 0.2), with 4.0% of simulations under RR estimation and 3.2% under OR estimation failing to converge. Less than 0.2% of simulations failed to converge for FCSSS under any combination of estimand and subcohort probability of selection.
Figure 2 shows the relative bias in the estimate of the association (RR or OR) between VDI and food allergy at oneyear for each scenario considered under extreme conditions. The largest bias for the large sample completedata analysis occurred for the coefficient of the OR and the largest subcohort size at 1.5%, with the MCSE range covering a relative bias of 0%. In all scenarios shown, the CCA resulted in a large relative bias, ranging between 10 and 20%. All MI approaches reduced this bias drastically, irrespective of estimand and subcohort size. When the target estimand was the coefficient for the OR and the smallest subcohort probability was used, all MI approaches were approximately unbiased, with MVNIWX showing the largest relative bias at 1.4% and FCSWX showing the least at 0.4%. For the remaining scenarios across both estimands, the completedata analysis and MI approaches showed a positive bias, with the largest occurring with the log (OR) estimand and a probability of selection of 0.3, where the relative bias was centred around 5%. Overall, minimal differences can be seen between the MI approaches when the estimation targeted the RR. When the target estimand was the log (OR), there was a slight decrease in relative bias for FCSWX, when compared to other MI approaches, and a slight increase in the relative bias for MVNI approaches, when compared to FCS approaches.
The empirical SE and modelbased SE are shown in Fig. 3. For most scenarios, we can see the SE has been underestimated in the CCA. There was a slight underestimation of the SE when the subcohort selection probability was 0.3 and the target estimand the log (RR), however, the modelbased SE appears to fall within the MCSE intervals for the empirical SE. There appears to be no systematic deviation between the empirical SE and the modelbased SE for any scenario. An increase in the precision can be seen as the subcohort size increases, and there is an increase in precision for all MI methods compared to the CCA for any given scenario, as expected.
The estimated coverage probabilities are shown in Fig. 4. For a nominal coverage of 95%, all MI approaches have a satisfactory coverage with 95% falling within the MCSE range for all scenarios, with the exception of the smallest subcohort size with OR estimation. Under this scenario, all MI approaches produce overcoverage, as a result of the point estimate being unbiased and the SE overestimated (but with the average modelbased SE falling within the MCSE range). There is no apparent pattern in the coverage probability across the MI methods, with all methods performing similarly. Results from the CCA showed acceptable levels of coverage.
The results from the case study are shown in Fig. 5. Results are consistent with the simulation study in that there is little variation in the estimated association across the MI methods. Unlike the simulation study under extreme conditions, the estimated coefficient is similar in the CCA and the MI approaches. There is an expected recovery of information leading to an increase in precision for MI approaches compared to the CCA.
Discussion
In this study we compared a number of different approaches for accommodating unequal sampling probabilities into MI in the context of a casecohort study. We found that how the weights were included in the imputation model had minimal effect on the estimated association or performance of MI which, as expected, outperformed CCA. Results were consistent across different levels of missing covariate information, target estimand and subcohort selection probability.
While bias was seen in some scenarios, this was minimal (~ 5%) and consistent across all MI approaches. We conducted a large sample analysis to confirm the data generation process was correct, given the bias observed in the completedata analysis, which showed minimal bias. We have therefore attributed the positive bias seen in the simulation study to a finite sampling bias, which was observed for large effect sizes in similar studies [16]. The minimal difference across MI methods seen in the current study may be due to the casecohort setting having only two weight strata that are collinear with the outcome and all MI approaches including the outcome either directly or indirectly (in the case of stratumspecific imputation). The results of this study complement the work by Marti and Chavance [14] who showed that inclusion of the outcome in the imputation model was sufficient to account for the unequal sampling probabilities in the context of a casecohort survival analysis. In the case study, the MI approaches performed similarly to the CCA and we believe this is due to the observed weak associations in BIS.
Our simulation study was complicated by potential bias due to the incompatibility between the imputation and analysis model when the target analysis estimated a RR through a Poisson regression model. The same would be true if the RR was estimated using a binomial model, as in the case study. We assessed this explicitly through imputation of the full cohort prior to subcohort selection, with results shown in Additional file 2. Minimal bias was seen due to this incompatibility. This may be because we only considered missing values in the covariates, which have been generated from a logistic model. Studies that have shown bias due to this incompatibility had considered missing values in both the outcome and exposure [26].
One strength of the current study was that it was based on a real case study, with data generated under a causal structure depicted by mDAGs informed by subject matter knowledge. This simulation study also examined a range of scenarios; however, it is important to note that not all possible scenarios can be considered, and these results may not extend to scenarios with missingness dependent on unobserved data or with unintended missingness in the exposure or outcome. This study also has a number of limitations. The simulations were conducted under controlled conditions such that the analysis model was correctly specified, and the missing data mechanism was known. Under the specified missing data mechanisms, the estimand was known to be recoverable and therefore MI was expected to perform well [27]. The missing data mechanism is generally unknown in a real data setting and results may not be generalisable.
Another limitation of this study is that only covariates have been considered incomplete. Often there can be missingness in the outcome (e.g. subcohort members dropout prior to oneyear followup and outcome collection) and/or unintended missingness in the exposure (e.g. cord blood not stored for infants selected into the subcohort or with food allergy). This study has also only considered a combination of MI and IPW. There are other analytic approaches that could have been used, for example using weighting to account for the missing covariates as well as the missing data by design, or imputing the exposure in those not in the subcohort and conducting a full cohort analysis. These approaches have been explored in a timetoevent setting [14, 16, 17] but little is known on the appropriateness for the casecohort setting with a binary outcome. Furthermore, no study to date has considered the scenario of additional exposure missing by chance within the subcohort. The limitations mentioned here offer an avenue for future work.
Conclusions
When performing MI in the context of casecohort studies, how unequal sampling probabilities were accounted for in the imputation model made minimal difference in the analysis. In this setting, inclusion of the outcome in the imputation model, which is already standard practice, was a sufficient approach to account for the unequal sampling probabilities incorporated in the analysis model.
Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Abbreviations
 BIS:

Barwon Infant Study
 CCA:

CompleteCase Analysis
 DME:

Dependent Missing – Enhanced
 DMO:

Dependent Missing – Observed
 FCS:

Fully Conditional Specification
 IPW:

Inverse Probability Weighting
 mDAG:

Missingness Directed Acyclic Graph
 MCSE:

Monte Carlo Standard Error
 MI:

Multiple Imputation
 MVNI:

Multivariate Normal Imputation
 OR:

Odds Ratio
 RR:

Risk Ratio
 SE:

Standard Error
 SEIFA:

Socioeconomic Index for Areas
 VDI:

Vitamin D Insufficiency
References
Prentice RL. A casecohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986;73(1):1–11.
Vuillermin P, Saffery R, Allen KJ, Carlin JB, Tang ML, Ranganathan S, et al. Cohort profile: the Barwon infant study. Int J Epidemiol. 2015;44(4):1148–60.
Lumley T. Complex surveys: a guide to analysis using R. Hoboken, NJ: Wiley; 2010.
Cologne J, Preston DL, Imai K, Misumi M, Yoshida K, Hayashi T, et al. Conventional casecohort design and analysis for studies of interaction. Int J Epidemiol. 2012;41(4).
Rubin D. Inference and missing data. Biometrika. 1976;63(3):581–92.
Rubin DB. Multiple imputation for nonresponse in surveys: John Wiley & Sons; 1987.
Schafer JL. Analysis of incomplete multivariate data: chapman and hall/CRC; 1997.
van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007;16(3):219–42.
Bartlett J, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res. 2015;24(4):462–87.
Meng XL. Multipleimputation inferences with uncongenial sources of input. Stat Sci. 1994;9(4):538–58.
Lee K, Simpson J. Introduction to multiple imputation for dealing with missing data. Respirology. 2014;19(2):162–7.
Carpenter J, Kenward M. Multiple imputation and its application: John Wiley & Sons; 2012.
Kim JK, Brick JM, Fuller WA, Kalton G. On the Bias of the multipleimputation variance estimator in survey sampling. J Royal Statistical Society Series B (Statistical Methodology). 2006;68(3):509–21.
Marti H, Chavance M. Multiple imputation analysis of casecohort studies. Stat Med. 2011;30:1595–607.
Breslow N, Lumley T, Ballantyne CM, Chambless LE, Kulich M. Using the whole cohort in the anlaysis of casecohort data. Am J Epidemiol. 2009;169(11):1398–405.
Keogh RH, Seaman SR, Bartlett J, Wood AM. Multiple imputation of missing data in nested casecontrol and casecohort studies. Biometrics. 2018;74:1438–49.
Keogh RH, White IR. Using fullcohort data in nested casecontrol and casecohort studies by multiple imputation. Stat Med. 2013;32:4021–43.
Sato T. Risk ratio estimation in casecohort studies. Environ Health Perspect. 1994;102(Suppl 8):53–6.
Molloy J, Koplin JJ, Allen KJ, Tang MLK, Collier F, Carlin JB, et al. Vitamin D insufficiency in the first 6 months of infancy and challengeproven IgEmediated food allergy at 1 year of age: a casecohort study. Allergy. 2017;72:1222–31.
Zhou G. A modified poisson regression approach to prospective studies with binary data. Am J Epidemiol. 2004;159(7):702–6.
Borgan O, Langholz B, Samuelsen SO, Goldstein L, Pogoda J. Exposure stratified casecohort designs. Lifetime Data Anal. 2000;6:39–58.
von Hippel PT. How to impute interactions, squares, and other transformed variables. Sociological Methodology. 2009;39:1.
StataCorp. Stata Statistical Software: Release 15. In: StataCorp, editor. College Station, TX: StataCorp LLC; 2017.
White IR. Simsum: Analyses of simulation studies including monte carlo error. The Stata Journal. 2010;10(3).
Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019:1–29.
Sullivan TR, Lee K, J, Ryan P, Salter AB. Multiple imputation for handling missing outcome data when estimating the relative risk. BMC Medical Research Metholody. 2017;17:134.
MorenoBetancur M, Lee KJ, Leacy FB, White IR, Simpson JA, Carlin J. Canonical causal diagrams to guide the treatment of missing data in epidemiologic studies. Am J Epidemology. 2018;187:12.
Acknowledgements
The authors would like to thank the Melbourne Missing Data group and members of the Victorian Centre for Biostatistics for providing feedback in designing and interpreting the simulation study. We would also like to thank the BIS investigator group for providing access to the casestudy data for illustrative purposes in this work.
Funding
This work was supported by the Australian National Health and Medical Research Council (Postgraduate Scholarship 1190921 to MM, career development fellowship 1127984 to KJL, and project grant 1166023). MMB is the recipient of an Australian Research Council Discovery Early Career Researcher Award (project number DE190101326) funded by the Australian Government. MM is funded by an Australian Government Research Training Program Scholarship. Research at the Murdoch Children’s Research Institute is supported by the Victorian Government’s Operational Infrastructure Support Program. The funding bodies do not have any role in the collection, analysis, interpretation or writing of the study.
Author information
Authors and Affiliations
Contributions
MM, CN, MMB, JBC and KJL conceived the project and designed the study. MM designed the simulation study and conducted the analysis, with input from coauthors, and drafted the manuscript. KJL, CN, MMB and JBC provided critical input to the manuscript. All of the coauthors read and approved the final version of this paper.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The case study used data from the Barwon Infant Study, which has ethics approval from the Barwon Health Human Research and Ethics Committee (HREC 10/24). Participating parents provided informed consent and research methods followed national and international guidelines.
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
12874_2021_1495_MOESM1_ESM.docx
Additional file 1.
12874_2021_1495_MOESM2_ESM.docx
Additional file 2.
12874_2021_1495_MOESM3_ESM.docx
Additional file 3.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Middleton, M., Nguyen, C., MorenoBetancur, M. et al. Evaluation of multiple imputation approaches for handling missing covariate information in a casecohort study with a binary outcome. BMC Med Res Methodol 22, 87 (2022). https://doi.org/10.1186/s12874021014954
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874021014954
Keywords
 Multiple imputation
 Casecohort study
 Simulation study
 Missing data
 Unequal sampling probability
 Inverse probability weighting