Multiple imputation for estimating hazard ratios and predictive abilities in casecohort surveys
 Helena Marti^{1}Email author,
 Laure Carcaillon^{2} and
 Michel Chavance^{1}
DOI: 10.1186/147122881224
© Marti et al; licensee BioMed Central Ltd. 2012
Received: 29 June 2011
Accepted: 9 March 2012
Published: 9 March 2012
Abstract
Background
The weighted estimators generally used for analyzing casecohort studies are not fully efficient and naive estimates of the predictive ability of a model from casecohort data depend on the subcohort size. However, casecohort studies represent a special type of incomplete data, and methods for analyzing incomplete data should be appropriate, in particular multiple imputation (MI).
Methods
We performed simulations to validate the MI approach for estimating hazard ratios and the predictive ability of a model or of an additional variable in casecohort surveys. As an illustration, we analyzed a casecohort survey from the ThreeCity study to estimate the predictive ability of Ddimer plasma concentration on coronary heart disease (CHD) and on vascular dementia (VaD) risks.
Results
When the imputation model of the phase2 variable was correctly specified, MI estimates of hazard ratios and predictive abilities were similar to those obtained with full data. When the imputation model was misspecified, MI could provide biased estimates of hazard ratios and predictive abilities. In the ThreeCity casecohort study, elevated Ddimer levels increased the risk of VaD (hazard ratio for two consecutive tertiles = 1.69, 95%CI: 1.631.74). However, Ddimer levels did not improve the predictive ability of the model.
Conclusions
MI is a simple approach for analyzing casecohort data and provides an easy evaluation of the predictive ability of a model or of an additional variable.
Background
Casecohort surveys produce incomplete data by design. A subcohort is selected by simple or stratified random sampling, all subjects are followed up and the events of interest are recorded. The phase1 variables are observed for the entire cohort, whilethe phase2 variables are only known for the casecohort sample, i.e., subjects belonging to the subcohort and all those presenting the event of interest [1]. Thus, in casecohort studies, the noncases who do not belong to the subcohort are incompletely observed by design, enabling cost reduction with a small loss of efficiency.
Various approaches have been described to estimate the proportional hazard model in casecohort surveys: Weighted estimators [2–6] are classically used in these surveys, with analysis restricted to the completely observed subsample, so the information collected for incompletely observed noncases is ignored and inefficient estimators for the effect of phase1 variables are obtained. One of the most popular is the Borgan II estimator [4]. Scheike and Martinussen [7] proposed a maximum likelihood estimator based on proportional hazards assumption, using the EM algorithm [8], therebyincreasing efficiency as compared to weighted estimators when the relative risk and disease incidence are high. However, in general, the studied disease incidence in casecohort surveys is low. Breslow et al. [9] suggested calibrating or estimating the weights a posteriori, using all the phase1 information, to improve precision with respect to classical weighted estimators. Marti and Chavance [10] showed that multiple imputation (MI) is a good alternative to classical weighted methods for the analysis of casecohort data. When the imputation model was correct, the MI approach provided unbiased estimators of the log hazard ratios and correctly estimated the variance of its estimators. As expected, the MI approach was more precise than the usual weighted estimators for the parameters associated with phase1 variables. The former was also slightly more precise than the latter for the phase2 variable. In Marti and Chavance [10] the imputations were performed according to a correctly specified imputation model. However, in practise, the distribution of the phase2 variable is unknown and onemay wonder how MI compares to weighted estimators when the imputation model is misspecified.
No standard method exists for quantifying the usefulness or predictive ability of a model or an additional variable in the framework of casecohort surveys. The predictive ability can be measured in terms of calibration, which refers to the ability of a model to match predicted and observed values, when we are interested in individual predictions; or in terms of discrimination, which refers to the ability of a model to distinguish between subjects with or without a binary event, when we are interested in identifying a group of highrisk subjects. In the present work, we focus on discrimination.
As shown below, a naive measurement of predictive ability from casecohort data often leads to a biased estimate of the predictive ability because it varies with the censoring rate and thus depends on the subcohort size. Alternatively, because MI reconstitutes whole cohorts, any tool developed to estimate the predictive ability in the framework of cohort surveys can be applied to casecohort data, so we propose using the MI approach to estimate the predictive ability ofa model or of an additional variable and their standard errors.
The objectives of this study were 1) to evaluate MI for estimating hazard ratios when the distribution of the phase2 variable is misspecified; and 2) to present an adequate methodology for estimating the predictive ability of a model or of an additional variable in casecohort surveys. We performed a simulation study to validate the MI approach for estimating the predictive ability of a model or of an additional variable and to assess its potential limits. As an illustration, we analyzed casecohort data from the ThreeCity study [11] to estimate the predictive ability of the Ddimer plasma concentration, a marker of coagulation and fibrinolysis, on coronary heart disease (CHD) and on vascular dementia (VaD) risks.
Methods
Incomplete observations and multiple imputation
where the factor 1 + M ^{1} is an adjustment for using a finite number of imputations [13].
MI requires a model correctly reflecting the relationship between the incomplete variable and the outcome of interest. In casecohort surveys, we need to impute phase2 variable values for the noncases who do not belong to the subcohort. Under the rare disease assumption, we have shown that a simple generalized linear model, using all the complete data (cases and noncases) and including the case indicator among the explanatory variables, has to be considered [10]. Practically, in addition to the case indicator and the stratification variables, when the subcohort was selected by stratified sampling, it is necessary to include in the imputation model all the variables appearing in the proportional hazard model. Because imputations are based on asymptotic distributions, caution is necessary, since if too few subjects present the event of interest, the distribution of the estimators can differ from the asymptotic one. As a consequence, the maximum likelihood estimator of the imputation model could be biased or not normally distributed.
Predictive ability of a model and of a supplementary variable
where π _{ c } is the probability of concordance for a pair (i,j) and π _{ d } is the probability of discordance. We assume continuous survival times and continuous predicted survival probabilities, so P(X _{ i } = X _{ j } ) = P(Y _{ i } = Y _{ j } ) = 0, thus π _{ c } + π _{ d } = 1. C is estimated by the proportion of concordant pairs among the usable pairs. The estimated variance was given by Kremers [16].
In practice, we are often interested in estimating the predictive ability of an additional phase2 variable. Let M _{1} be a proportional hazard model including only phase1 variables, and C _{1} and SE _{C1} respectively the C indexof M _{1} and its standard error. Let M _{2} be a proportional hazard model adding the phase2 variable to M _{1}, and C _{2} and SE _{ C2}, respectively, the C index of M _{2} and its standard error. Harrell's predictive ability ofthe added phase2 variable is Δ = C _{2}  C _{1} . Complementary measures of predictive ability of a new variable, such as the net reclassification improvement (NRI) and the integrated discrimination index (IDI), were proposed by Pencina [17]. NRI needs some a priori meaningful risk categories. It quantifies the correct reclassification introduced by using a model with the added variable as compared to the classification obtained without this variable. The IDI can be viewed as a continuous version of the NRI with probabilities used instead of categories. It can be defined as the discriminationslope difference between the models with and without a quantitative variable. To estimate the predictive ability of a model or of an additional variable, we reconstructed plausible whole cohorts using MI. For each reconstructed whole cohort, we could then directly obtain C _{1}, SE _{ C1}, C _{2}, SE _{ C2}, Δ, NRI, IDI and their respective variances. Using equations (1) and (2), we obtained the MI estimates of these quantities. Concerning the variance of Δ, the betweenimputation component is estimated by the empirical variance of the M estimates of Δ provided by the M completed data sets. However, for the withinimputation component, the asymptotic variance of the estimator provided by a complete data set, does not have an analytical form. With a fully observed cohort, bootstrapping is a way to estimate the variance of the corresponding Δ. Therefore, each whole cohort reconstructed by MI has to be resampled. In the simulations as in the real data analysis, we used 100 bootstrap samples.
Simulation study
Two phase1 variables were simulated: a binary variable, Z _{1}, and a Gaussian variable, Z _{3}, observed for the entire cohort. For the phase2 variable, Z _{2}, we considered three different distributions: normal, lognormal and uniform, all of them with unit variance, independent of Z _{1}, but having a correlation coefficient of 0.2 with Z _{3}. The survival time had an exponential distribution, with λ = exp (β _{1} Z _{1} + β _{2} Z _{2} + β _{3} Z _{3} ). β _{1}, β _{2} and β _{3} were fixed at the same value and set at 0 or log(2). The censoring time followed a uniform distribution over the interval [0,τ], where τ was chosen so that the probability of an event was approximately 0.03 (τ= 0.025). The cohort size was 10,000. We also simulated a phase1 variable predictive of Z _{2}, ${\stackrel{\u0303}{Z}}_{2}\equiv {Z}_{2}+\epsilon $with ε ~ N (0, σ ^{2}) independent of Z _{2}. The variance σ^{2} was fixed at 1 which corresponds to a correlation between ${Z}_{2}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}{\stackrel{\u0303}{Z}}_{2}$ of approximately 0.7. We wanted to estimate the effect of Z _{2} on survival time and its predictive ability. The cohort was divided into 9 strata based on the tertiles of ${\stackrel{\u0303}{Z}}_{2}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}{Z}_{3}$, and the noncases were chosen by stratified sampling. Casecohort sampling was simulated with 1,000 subjects in each subcohort. The phase2 variable was not available for noncases not included in the subcohort, so MI was used to complete the data set. Thus, we built the same linear prediction model for each Z _{2} based on the stratification phase1 variable and the case indicator. Z _{3} was not directly included in the imputation model to predict Z _{2}, because it was a stratification variable included in the model and because of the weak correlation between Z _{2} and Z _{3}. The imputation model was: Z _{2} = α _{0} + α _{1} I _{ case } + α _{2} Strata + ε , where α _{0} and α _{1} are scalar, α _{2} is a vector coefficient, I _{ case } is the case indicator, Strata is the vector of stratum indicators and є is the vector of errors independently and identically distributed ~ N (0, σ). Thus, the imputation model was correctly specified for Z _{2} normally distributed but misspecified for Z _{2} lognormally or uniformly distributed. One thousand cohorts were simulated for each scenario.
Five imputations were performed and 5 complete data sets were generated for each cohort. We estimated the log hazard ratios using MI and the "Borgan II" weighted estimator [4]. We used MI to estimate the predictive ability of models with and without the phase2 variable, and the predictive ability of the phase2 variable, NRI and IDI. We also studied the consistency of the naive estimator of Harrell's C index in casecohort surveys by varying the subcohort size. Using the above simulation conditions and, exceptionally, a scenario with β _{ 1 } = β _{ 3 } = log(2) and β _{ 2 } = log(1.5), we simulated casecohort samples with the subcohort size set at 300 or 1,000 subjects. We estimated the predictive ability in the casecohort samples and in the multiply imputed data sets.
Casecohort survey from ThreeCity study
Briefly, the 3CStudy was designed to examine the relationship between vascular diseases and dementia in a community housing 9,294 persons aged 65 years and over between 1999 and 2001 in three French cities. The detailed methodology has been previously described [11]. A casecohort substudy was conducted [18], to investigate the relationship between biomarkers, such as plasma levels of Ddimer (a marker of coagulation and fibrinolysis) and the 4year incidence of coronary heart disease (CHD), stroke and all subtypes of dementia, including vascular dementia (VaD), in an elderly population. The phase1 variables provided information on sociodemographic characteristics, education, medical history, diet, alcohol and tobacco use. Blood pressure, height and weight were also available. A subcohort of size n = 1,254, (13.5% of the full cohort) was randomly selected, stratifying on age, sex and recruitment center. Observed cumulated incidences of CHD and VaD were approximately 2% and 0.6%, respectively. Plasma Ddimer levels were only available for phase2 subjects. Carcaillon et al. [18] treated quintiles of Ddimer level both qualitatively and linearly. They reported a linear increase in the risk of VaD according to Ddimer quintiles.
We reassessed the relationship between plasma Ddimer levels and the risk of CHD and VaD, using MI and weighted estimators, and evaluated the predictive ability of Ddimer levels on both risks. We included the same explanatory variables as Carcaillon et al. [18] although we used tertiles of Ddimer rather than quintiles, to estimate CHD and VaD risks, due to the small number of events. Therefore, to estimate the risk of CHD, the proportional hazard model includedthe phase1 variables: age, sex, center, body mass index, hypertension, hypercholesterolemia, diabetes, tobacco use, diabetes drugs, and as phase2 variables, indicators of Ddimer tertiles. To estimate the risk of VaD, the proportional hazard model included the phase1 variables: age, sex, centre, educational level, body mass index, the presence or absence of an apolipoprotein є4 allele and indicators of Ddimer tertiles.
For each outcome (CHD or VaD), it was necessary to reproduce the relationships among the incomplete variable, the outcomes and the confounder variables. For each outcome, we built an imputation model of tertiles of Ddimer levels, including the variables used in the proportional hazard model and the caseindicator. We estimated the predictive ability of proportional hazard models, without (C _{1}) and with (C _{2}) Ddimer levels, Δ = C2  C1, and IDI for CHD and VaD risks. The NRI requires that some a priori meaningful risk categories be known. Based on the Third Adult Treatment Panel [ATP III] [19] risk classification for the 10year risk of CHD, we adapted the cutoffs to 4year risk. For VaD, we do not know a priori meaningful risk categories and did not compute NRI.
Results
Simulation study
Mean of the log hazard ratio estimates (Est), mean of the standard error estimates $\stackrel{\wedge}{SE}$, standard error of the estimates (SE) and mean of the mean square error (MSE). Results of 1,000 simulations.
Full cohort  Multiple imputation ^{ a }  Weighted estimator  

Est  $\stackrel{\wedge}{SE}$  SE  MSE  Est  $\stackrel{\wedge}{SE}$  SE  MSE  Est  $\stackrel{\wedge}{SE}$  SE  MSE  
Z_{2} normally distributed  
β _{1} = β _{2} = β _{3} = 0  
β _{1}  0.003  0.107  0.100  0.010  0.003  0.107  0.110  0.010  0.001  0.133  0.128  0.016 
β _{2}  0.001  0.054  0.058  0.003  0.001  0.060  0.062  0.004  0.001  0.065  0.068  0.005 
β3  0.004  0.053  0.056  0.003  0.004  0.054  0.057  0.003  0.003  0.058  0.060  0.004 
β _{1} = β _{2} = β _{3} = log(2)  
β _{1}  0.689  0.118  0.113  0.013  0.676  0.119  0.112  0.013  0.696  0.168  0.165  0.027 
β _{2}  0.687  0.058  0.057  0.003  0.679  0.070  0.068  0.005  0.701  0.088  0.097  0.009 
β _{3}  0.683  0.057  0.057  0.003  0.679  0.058  0.058  0.004  0.689  0.080  0.090  0.007 
Z_{2} log normally distributed  
β _{1} = β _{2} = β _{3} = 0  
β _{1}  0.003  0.107  0.100  0.010  0.003  0.107  0.100  0.010  0.004  0.133  0.128  0.016 
β2  0.001  0.027  0.034  0.001  0.015  0.031  0.032  0.001  0.002  0.034  0.038  0.001 
β _{3}  0.004  0.053  0.056  0.003  0.004  0.054  0.058  0.004  0.005  0.059  0.062  0.004 
β _{1} = β _{2} = β _{3} = log(2)  
β _{1}  0.686  0.058  0.056  0.003  0.621  0.061  0.055  0.008  0.686  0.112  0.117  0.014 
β _{2}  0.692  0.013  0.015  2e0^{4 }  0.602  0.015  0.014  0.008  0.695  0.020  0.023  0.001 
β _{3}  0.685  0.029  0.031  0.001  0.686  0.032  0.031  0.001  0.687  0.049  0.053  0.003 
Z_{2} uniformly distributed  
β _{1} = β _{2} = β _{3} = 0  
β _{1}  0.007  0.181  0.175  0.031  0.007  0.181  0.175  0.031  0.007  0.197  0.188  0.035 
β _{2}  0.001  0.092  0.087  0.008  0.004  0.094  0.088  0.008  0.002  0.098  0.095  0.009 
β _{3}  0.003  0.090  0.090  0.008  0.002  0.090  0.090  0.008  0.004  0.093  0.093  0.009 
β _{1} = β _{2} = β _{3} = log(2)  
β _{1}  0.690  0.120  0.116  0.013  0.680  0.121  0.115  0.013  0.694  0.166  0.169  0.028 
β _{2}  0.695  0.069  0.063  0.004  0.656  0.075  0.066  0.006  0.698  0.087  0.082  0.007 
β _{3}  0.690  0.058  0.054  0.003  0.689  0.059  0.055  0.003  0.698  0.081  0.081  0.007 
Mean of the predictive ability estimates (Est), mean of the standard error estimates $\stackrel{\wedge}{SE}$ and standard error of the estimates (SE).
β _{ 1 }= β _{ 2 }= β _{ 3 } =0  β _{ 1 }= β _{ 3 }= log(2), β _{ 2 }= log(1.5)  β1 = β2 = β3 = log(2)  

Est  $\stackrel{\wedge}{SE}$  SE  % H _{0} rejected  Est  $\stackrel{\wedge}{SE}$  SE  % H _{0} rejected  Est  $\stackrel{\wedge}{SE}$  SE  % H _{0} rejected  
Full Cohort  
C _{1}  0.518  0.033  0.012  0.727  0.032  0.015  0.733  0.029  0.014  
C _{2}  0.524  0.033  0.013  0.747  0.031  0.015  0.733  0.029  0.014  
$\underset{}{\Delta}$  0.006  0.010  0.009  3.7  0.020  0.007  0.007  91.6  0.049  0.010  0.010  100 
NRI  0.007  0.017  0.019  4.8  0.071  0.030  0.033  52.5  0.167  0.034  0.035  99.9 
IDI  2e^{4}  2e^{4}  3e^{4}  6.0  0.014  0.003  0.005  99.9  0.048  0.006  0.009  99.9 
MI1000  
C _{1}  0.518  0.033  0.012  0.724  0.032  0.016  0.733  0.029  0.014  
C _{2}  0.526  0.033  0.013  0.745  0.031  0.016  0.783  0.027  0.014  
Δ  0.008  0.012  0.010  3.4  0.021  0.008  0.008  90.6  0.049  0.010  0.011  100 
NRI  0.009  0.019  0.017  1.5  0.076  0.033  0.033  64.8  0.172  0.037  0.036  100 
IDI  3e^{4}  3e^{4}  4e^{4}  3.5  0.014  0.004  0.005  99.0  0.045  0.008  0.010  100 
MI300  
C _{1}  0.518  0.033  0.012  0.724  0.032  0.016  0.733  0.029  0.014  
C _{2}  0.528  0.033  0.012  0.745  0.031  0.017  0.783  0.027  0.015  
Δ  0.010  0.014  0.011  3.0  0.021  0.008  0.009  84.6  0.050  0.011  0.012  100 
NRI  0.013  0.023  0.018  1.3  0.076  0.035  0.035  57.0  0.172  0.039  0.039  99.7 
IDI  4e^{4}  4e^{4}  5e^{4}  1.8  0.014  0.005  0.006  87.5  0.046  0.010  0.012  100 
CC1000  
C _{1}  0.528  0.032  0.013  0.667  0.033  0.015  0.670  0.031  0.014  
C _{2}  0.534  0.033  0.015  0.709  0.032  0.022  0.737  0.029  0.014  
Δ  0.006  0.010  0.010  4.7  0.043  0.011  0.017  100  0.067  0.012  0.012  100 
NRI  0.017  0.031  0.033  6.7  0.147  0.039  0.043  96.7  0.261  0.041  0.043  100 
IDI  0.002  0.001  0.003  15.2  0.058  0.009  0.014  100  0.114  0.011  0.017  100 
CC300  
C _{1}  0.523  0.034  0.013  0.620  0.037  0.016  0.620  0.034  0.015  
C _{2}  0.529  0.034  0.015  0.647  0.036  0.016  0.668  0.032  0.015  
Δ  0.006  0.010  0.009  3.6  0.027  0.011  0.011  83.3  0.048  0.013  0.013  99.8 
NRI  0.019  0.039  0.043  6.2  0.154  0.043  0.050  94.4  0.257  0.046  0.051  99.9 
IDI  0.002  0.001  0.003  13.9  0.040  0.008  0.014  99.8  0.078  0.010  0.017  100 
Predictive ability of the two models and of the phase2 variable.
Full cohort  Multiple imputation  

Est  $\stackrel{\wedge}{SE}$  SE  % H _{0} rejected  Est  $\stackrel{\wedge}{SE}$  SE  % H _{0} rejected  
Z_{2} normally distributed  
β _{1} = β _{2} = β _{3} = 0  
C _{1}  0.518  0.033  0.012  0.518  0.033  0.012  
C _{2}  0.524  0.033  0.013  0.526  0.033  0.013  
Δ  0.006  0.010  0.010  3.7  0.008  0.012  0.010  3.4 
β _{1} = β _{2} = β _{3} = log(2)  
C _{1}  0.733  0.029  0.014  0.733  0.029  0.014  
C _{2}  0.783  0.027  0.013  0.783  0.027  0.014  
Δ  0.049  0.010  0.010  100  0.049  0.010  0.011  100 
Z_{2} normally distributed  
β _{1} = β _{2} = β _{3} = 0  
C _{1}  0.518  0.033  0.012  0.518  0.033  0.012  
C _{2}  0.524  0.033  0.013  0.520  0.031  0.016  
Δ  0.006  0.010  0.009  5.5  0.002  0.013  0.012  4.2 
β _{1} = β _{2} = β _{3} = log(2)  
C _{1}  0.784  0.013  0.006  0.784  0.013  0.006  
C _{2}  0.881  0.011  0.006  0.866  0.011  0.006  
Δ  0.097  0.005  0.005  100  0.082  0.005  0.004  100 
Z_{2} uniformly distributed  
β _{1} = β _{2} = β _{3} = 0  
C _{1}  0.532  0.055  0.019  0.532  0.055  0.019  
C _{2}  0.540  0.055  0.019  0.541  0.055  0.020  
Δ  0.008  0.015  0.013  2.2  0.009  0.017  0.013  4.0 
β _{1} = β _{2} = β _{3} = log(2)  
C _{1}  0.733  0.029  0.014  0.733  0.029  0.014  
C _{2}  0.781  0.027  0.012  0.785  0.027  0.012  
Δ  0.048  0.009  0.009  100  0.052  0.010  0.010  100 
Mean of the predictive ability estimates (Est), mean of the standard error estimates $\stackrel{\wedge}{SE}$ and standard error of the estimates (SE), with a correctly specified normal imputation model (Z _{2} normally distributed), and with two misspecified normal imputation models (Z _{2} lognormally and uniformly distributed)
Application to the ThreeCity study
Estimates of hazard ratios (HR) and 95% confidence interval (CI) associated with Ddimer tertiles.
Multiple imputation estimates  Weighted estimates  

HR (95% CI)  HR (95% CI)  
Risk of CHD and DDimer^{a}  
T1  1.00 (reference)  1.00 (reference) 
T2  1.42 (0.992.04)  1.40 (0.972.04) 
T3  1.32 (0.891.97)  1.30 (0.841.99) 
Linear trend  1.14 (0.941.38)  1.13 (0.921.38) 
Risk of VaD and DDimer^{b}  
T1  1.00 (reference)  1.00 (reference) 
T2  1.57 (0.633.93)  1.60 (0.634.09) 
T3  2.77 (1.176.57)  2.93 (1.227.06) 
Linear trend  1.69 (1.132.53)  1.74 (1.132.67) 
Predictive ability and 95% confidence interval (CI) of DDimer tertiles on cardiovascular heart disease (CHD) and vascular dementia (VaD) risks.
CHD  VaD  

Estimate  95% CI  Estimate  95% CI  
C _{1}  0.693  (0.6220.764)  0.865  (0.7870.943) 
C _{2}  0.694  (0.6210.767)  0.874  (0.7980.950) 
Δ  0.002  (0.0040.008)  0.009  (0.0110.029) 
NRI  0.009  (0.0490.066)     
IDI  0.001  (0.0010.003)  0.0004  (0.00020.0010) 
Discussion
Use of a consistent estimator does not guarantee the absence of any bias for finite sample. We only showed that MI analysis of casecohort data provides unbiased estimates of the loghazard ratio when the imputation model and the proportional hazard model are correctly specified. The misspecification of the imputation model can originate from an erroneous choice of the distribution, or from wrongly assuming that the estimator of the imputation model is consistent and normal, or from the omission of some important explanatory variable. Imputations carried out using a misspecified distribution in the imputation model can provide biased estimates of hazard ratios, especially, if the specified distribution of the phase2 variable differs from the true one in terms of symmetry (lognormal versus normal distribution). The negative bias on a log hazard ratio of 0.69 was noticeable but not large when a lognormal variable was imputed according to a normal distribution (0.09 or 13%), but it is clearly a type of misspecification easily identified with diagnostic tools [20]. One can then transform the incomplete variable in order to obtain a symmetrical distribution, impute transformed values and apply the inverse transformation to the imputed values. Note that although a normal and a uniform distribution are quite different, both are symmetrical and the observed bias was quite smaller (only 5%). In the 3C study of the relationship between VaD and Ddimer, we observed slightly different estimates of the log hazard ratio when comparing the third to the first tertile (2.77 versus 2.93, i.e. a relative difference of 8% between the MI and the weighted estimates). This is probably because of the qualitative imputation of Ddimer, and thus, the use of a multinomial imputation model, which implied estimation of parameters in separate strata defined by Ddimer concentration tertiles, some of which had a small number of events. Due to these small numbers (only 51 VaD in total), asymptotic conditions might not have been fulfilled in at least some strata, and the estimated coefficients of the imputation model could have been biased and notnormally distributed. We give below some recommendations regarding the choice of explanatory variables in the imputation model. Since the potential bias of MI estimates can be detected by comparing them to weighted estimates, we suggest building the proportional hazard model by using only the casecohort data and weighted estimators. MI can eventually be used to reanalyze the data with the selected model to improve the precision of the results, while verifying that no bias was introduced.
In simulated data, for the phase1 variables, the precision of MI and full cohort estimates was similar and smaller than with the weighted estimator. For the phase2 variable, MI estimates were slightly more precise than weighted estimates. Globally, the mean squared errors were smaller with MI than with the weighted estimator, with one exception implying a normal imputation model for a lognormally distributed phase2 variable, an error which should easily be avoided.
There is no standard method for estimating the predictive ability of a model in the framework of casecohort surveys. We showed that the naive application of the C index to casecohort surveys yielded an underestimation of the predictive ability of the model that depended on the subcohort size when the phase2 variable had an effect on the risk. Similarly, the naive estimates of the predictive ability of an added phase2 variable differed notably from the full cohort values when the effect of the phase2 variable was not null. Harrell's C index could theoretically be estimated with a weighted approach, but this can be computationally difficult because it requires weighting each pair by the pairwise sampling probabilities, i.e., using a square matrix of size N'(N'1), where N' is the size of the casecohort sample. Computing the variance of this HorvitzThompson estimator requires either weighting each quadruplet by the quadruplewise sampling probabilities, i.e., working with a matrix of size N'(N'1)(N'2)(N'3), or bootstrapping the casecohort data. By contrast, MI easily allows estimation of the predictive ability of a model or of an additional phase2 variable and their variances in the context of casecohort data, only requiring bootstrapping to estimate the variance of the predictive ability of the phase2 variable. MI provided estimates of Harrell C, NRI and IDI indexes similar to those obtained with the full cohort analysis. Note, however, that the predictive abilities were always overestimated because the same data were used to estimate the model and its predictive ability.
Analysis of the ThreeCity casecohort study was in agreement with our previous work [10]. The weighted and the MI approaches yielded similar estimates of the hazard ratios and MI was slightly more precise, particularly for phase1 variables. The relative differences between both estimates was always below 2% for the hazard ratios related to CHD and Ddimer, but as early discussed, they could be slightly higher (8%) for a hazard ratio related to VaD and Ddimer. The precision was similar for both analyses.
The imputation model must reflect the association between the incomplete variable, the outcome and the other explanatory variables. Therefore, variables included in the proportional hazard model as well as the stratification variables must be included in the imputation model. If a surrogate of the phase2 variable is available, it should also be included in the imputation model. On the other hand, multiple imputation approach can provide unbiased and more efficient estimates than weighted analysis even when no strong predictor of the phase2 variable is available [10]. The inclusion of additional variables other than strongly predictive variables can lead to an increased interimputation variance. This prompted the use of different imputation models for Ddimer levels in the CHD and VaD analyses. However, we verified that adding the variables only used in the CHD analysis to the model used for VaD, did not modify the results observed in the former (data not shown).
The number of requested imputations depends on the proportion of missing information which, in casecohort studies, is considerably smaller than the percentage of incompletely observed subjects. Rubin showed that with as much as 40 per cent information missing, M = 5 imputations provides an asymptotic relative efficiency was 0.97, and, with 50 per cent missing information, M = 10 provides an asymptotic relative efficiency of 0.98. Thus, a small number of imputations, 510, should suffice [21]. In our analyses, we used 5 imputations to limit the computer time of the simulations, a reasonable choice since the proportion of missing information was always smaller than 30 per cent. However, a slightly larger number of imputations (e.g. 10) could have been performed on the 3C study data at a reasonable time cost; it would have provided a more precise estimate of the between imputation variance and of the percentage of missing information.
The VaD risk increased with Ddimer tertiles. However, Ddimer inclusion did not significantly improve the predictive ability of the model for VaD risk. Computations of the C and IDI index yielded the same conclusion. To our knowledge, no other results concerning the predictive ability of Ddimer on the risk of VaD have been published to date. The risk of CHD did not vary with Ddimer, so, not surprisingly, the predictive ability of this variable was negligible, regardless of the index used. Wang et al. [22] and Tzoulaki [23] reported that the use of 10 and 4 biomarkers respectively added only moderately to the overall risk prediction based on conventional cardiovascular risk factors.
Conclusions
MI is a simple alternative approach to weighted analysis for analyzing casecohort surveys, obtaining correct estimates of the log hazard ratios and their standard errors, improving precision for the phase1 variable estimates, and providing at least the same precision as weighted estimators for phase2 variable estimates. It allows an easy evaluation of the predictive ability of the model and, more generally, any tool proposed in the framework of cohort studies can be applied to casecohort data using MI.
Abbreviations
 MI:

Multiple imputation
 CHD:

Coronary heart disease
 VaD:

Vascular dementia
 NRI:

Net reclassification index
 IDI:

Integrated discrimination index.
Declarations
Acknowledgements
This study was supported by a grant from the Région ÎledeFrance. It used data from the ThreeCity study which is conducted under an agreement between the Institut National de la Santé et de la Recherche Médicale and the Université Victor SegalenBordeaux 2. This manuscript was not prepared in collaboration with the 3C study Steering Committee and does not necessarily reflect its opinions or views.
Authors’ Affiliations
References
 Prentice R: A casecohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986, 73: 111. 10.1093/biomet/73.1.1.View ArticleGoogle Scholar
 Chen K, Lo SH: Casecohort and casecontrol analysis with cox's model. Biometrika. 1999, 86 (4): 755764. 10.1093/biomet/86.4.755.View ArticleGoogle Scholar
 Therneau TM, Li H: Computing the cox model for case cohort designs. Lifetime Data Anal. 1999, 5 (2): 99112. 10.1023/A:1009691327335.View ArticlePubMedGoogle Scholar
 Borgan O, Langholz B, Samuelsen SO, Goldstein L, Pogoda J: Exposure stratified casecohort designs. Lifetime Data Anal. 2000, 6: 3958. 10.1023/A:1009661900674.View ArticlePubMedGoogle Scholar
 Kulich M, Lin D: Improving the efficiency of relativerisk estimation in caseCohort studies. J Am Stat Assoc. 2004, 99: 832844. 10.1198/016214504000000584.View ArticleGoogle Scholar
 Langholz B, Jiao J: Computational methods for casecohort studies. Comput Stat Data Anal. 2007, 51 (8): 37373748. 10.1016/j.csda.2006.12.028.View ArticleGoogle Scholar
 Scheike TH, Martinussen T: Maximum likelihood estimation for Cox's regression model under caseCohort sampling. Scand Stat Theory Appl. 2004, 31 (2): 283293.View ArticleGoogle Scholar
 Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B Stat Methodol. 1977, 39: 138.Google Scholar
 Breslow N, Lumley BCCLT, Kulich M: Using the whole cohort in the analysis of casecohort data. Am J Epidemiol. 2009, 169 (11): 13981405. 10.1093/aje/kwp055. [http://dx.doi.org/10.1093/aje/kwp055]View ArticlePubMedPubMed CentralGoogle Scholar
 Marti H, Chavance M: Multiple imputation analysis of casecohort studies. Stat Med. 2011, 30 (13): 15951607. 10.1002/sim.4130.View ArticlePubMedGoogle Scholar
 Alperovitch A, 3C Study Grp: Vascular factors and risk of dementia: Design of the threecity study and baseline characteristics of the study population. Neuroepidemiology. 2003, 22 (6): 316325.View ArticleGoogle Scholar
 Little R, Rubin D: Statistical analysis with missing data. 1987, New York: WileyGoogle Scholar
 Rubin DB, Schenker N: Multiple imputation in healthcare databases: an overview and some applications. Stat Med. 1991, 10 (4): 585598. 10.1002/sim.4780100410.View ArticlePubMedGoogle Scholar
 Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA: Evaluating the yield of medical tests. J Am Med Assoc. 1982, 247 (18): 25432546. 10.1001/jama.1982.03320430047030.View ArticleGoogle Scholar
 Harrell F, Lee K, Mark D: Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996, 15 (4): 361387. 10.1002/(SICI)10970258(19960229)15:4<361::AIDSIM168>3.0.CO;24.View ArticlePubMedGoogle Scholar
 Kremers WK: Concordance for survival time data: fixed and timedependent covariates and possible ties in predictor and time. Tech rep Mayo Fundation. 2007Google Scholar
 Pencina M, D'Agostino R, D'Agostino R, Vasan R: Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med. 2008, 27 (2): 157172. 10.1002/sim.2929.View ArticlePubMedGoogle Scholar
 Carcaillon L, Gaussem P, Ducimetiere P, Giroud M, Ritchie K, Dartigues JF, Scarabin PY: Elevated plasma fibrin Ddimer as a risk factor for vascular dementia: the threecity cohort study. J Thromb Haemost. 2009, 7 (12): 19721978. 10.1111/j.15387836.2009.03603.x.View ArticlePubMedGoogle Scholar
 Executive Summary of The Third Report of The National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, And Treatment of High Blood Cholesterol In Adults (Adult Treatment Panel III). J Am Med Assoc. 2001, 285 (19): 24862497. 10.1001/jama.285.19.2486.
 Goodnessoffit techniques. Edited by: D'Agostino RB, Stephens MA. 1986, New York: Marcel Dekker; IncGoogle Scholar
 Rubin DB: Multiple imputation for nonresponse in surveys. 1987, New York: WileyView ArticleGoogle Scholar
 Wang TJ, Gona P, Larson MG, Tofler GH, Levy D, NewtonCheh C, Jacques PF, Rifai N, Selhub J, Robins SJ, Benjamin EJ, D'Agostino RB, Vasan RS: Multiple biomarkers for the prediction of first major cardiovascular events and death. N Engl J Med. 2006, 355 (25): 26312639. 10.1056/NEJMoa055373.View ArticlePubMedGoogle Scholar
 Tzoulaki I, Murray GD, Lee AJ, Rumley A, Lowe GDO, Fowkes FGR: Relative value of inflammatory, hemostatic, and rheological factors for incident myocardial infarction and stroke  The Edinburgh artery study. Circulation. 2007, 115 (16): 21192127. 10.1161/CIRCULATIONAHA.106.635029.View ArticlePubMedGoogle Scholar
 The prepublication history for this paper can be accessed here:http://www.biomedcentral.com/14712288/12/24/prepub
Prepublication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.