### Incomplete observations and multiple imputation

Case-cohort surveys are a particular type of incomplete observations, in which data are missing at random [12] by design, as the probability of being completely observed depends only on the case status, with simple random sampling, and on some phase-1 variables with stratified sampling. MI is a simple and efficient method for analyzing incomplete observations, while taking into account all the levels of uncertainty regarding missing values. This provides an approximation of the maximumlikelihood estimator and thus enables the potential selection bias to be corrected. This method relies on the generation of several plausibly completed data sets (*M* ≥ 2), accounting for all levels of uncertainty concerning the missing values. A prediction model must be built, taking into consideration the relationships between the incomplete variable and the other variables, as observed in the complete part of the data. The missing data are not replaced by their expectation but by a value drawnfrom the distribution posited by the model. To take into account the uncertainty concerning the parameters of the imputation model, several imputations are performed with parameters drawn from the asymptotic distribution of their estimator. An estimate of the parameter of interest, {\widehat{\theta}}_{m},m=\left\{1,\dots ,M\right\}, and an estimate of the variance of the estimator, \widehat{V}\left({\widehat{\theta}}_{m}\right), are obtained from each completed data set. If the imputation model is correct, these estimators are not biased. The MI estimate, also unbiased, is the mean of the *M* estimates:

{\widehat{\theta}}_{MI}=\frac{1}{M}\sum _{m=1}^{M}{\widehat{\theta}}_{m}

(1)

where *M* is the number of completed data sets and {\widehat{\theta}}_{m}\mathsf{\text{,}}\phantom{\rule{0.3em}{0ex}}m\mathsf{\text{={1,}}\dots \mathsf{\text{,}}M\mathsf{\text{}}} is the estimate of the parameter of interest provided by the *m*
^{th} completed data set. The multiplicity of imputations enables correct estimation of the variance of this single estimator, which is the sum of 2 components: the within-imputations component, *W*
_{
MI
} , and the between-imputations component, *B*
_{
MI
} :

\begin{array}{cc}\hfill \widehat{V}\left({\widehat{\theta}}_{MI}\right)& ={\u0174}_{MI}+{\widehat{B}}_{MI}\hfill \\ =\frac{1}{M}\sum _{m=1}^{M}\widehat{V}\left({\widehat{\theta}}_{m}\right)+\left(1+{M}^{-1}\right)\frac{{\sum}_{m=1}^{M}\left({\widehat{\theta}}_{m}-{\widehat{\theta}}_{MI}\right){\left({\widehat{\theta}}_{m}-{\widehat{\theta}}_{MI}\right)}^{\prime}}{M-1}\hfill \end{array}

(2)

where the factor 1 + *M*
^{-1} is an adjustment for using a finite number of imputations [13].

MI requires a model correctly reflecting the relationship between the incomplete variable and the outcome of interest. In case-cohort surveys, we need to impute phase-2 variable values for the non-cases who do not belong to the subcohort. Under the rare disease assumption, we have shown that a simple generalized linear model, using all the complete data (cases and non-cases) and including the case indicator among the explanatory variables, has to be considered [10]. Practically, in addition to the case indicator and the stratification variables, when the subcohort was selected by stratified sampling, it is necessary to include in the imputation model all the variables appearing in the proportional hazard model. Because imputations are based on asymptotic distributions, caution is necessary, since if too few subjects present the event of interest, the distribution of the estimators can differ from the asymptotic one. As a consequence, the maximum likelihood estimator of the imputation model could be biased or not normally distributed.

### Predictive ability of a model and of a supplementary variable

Harrell *et al.* [14] proposed the *C* index to measure the predictive ability of a model in cohort studies as the agreement between the order of the predicted and observed survival times in any pair of subjects (the event of interest is assumed to be death, leading to the use of survival terminology). That is, the concordance probability using all pairs of subjects in the population. However, with censored data, it is not possible to consider all the pairs of subjects because survival time is not observed for censored subjects. Let *T*
_{
i
} be the survival time for subject *i*, *i* = 1,...,*N*, where *N* is the cohort size, and *C*
_{
i
} the censoring time for subject *i*. We observe *X*
_{
i
} = *min*(*T*
_{
i
} , *C*
_{
i
} ). Usable pairs are those for which the order of the predicted survival times can be compared to the order of the true survival times, i.e., pairs formed by 2 uncensored subjects or an uncensored subject and a subject censored after the uncensored subject's death. A pair of censored subjects carries no information about its agreement with the expected survival provided by the model since the order of the survival times is not known. Similarly a pair formed by a subject whose survival time is observed and a subject censored before this survival time provides no information on this agreement since the unknown survival time could be anterior or posterior to the observed one. Harrell *et al.* [15] showed that, in the common models used for survival analysis, such as the proportional hazard model, the predicted survival times and the predicted survival probabilities at a fixed time *t* can be interchanged for the comparison. The Harrell's *C* index is defined as:

C=\frac{{\pi}_{c}}{{\pi}_{c}+{\pi}_{d}}

(3)

where π _{
c
} is the probability of concordance for a pair *(i,j)* and π _{
d
} is the probability of discordance. We assume continuous survival times and continuous predicted survival probabilities, so *P*(*X*
_{
i
} = *X*
_{
j
} ) = *P*(*Y*
_{
i
} = *Y*
_{
j
} ) = 0, thus *π*
_{
c
} + *π*
_{
d
} = 1. *C* is estimated by the proportion of concordant pairs among the usable pairs. The estimated variance was given by Kremers [16].

In practice, we are often interested in estimating the predictive ability of an additional phase-2 variable. Let *M*
_{1} be a proportional hazard model including only phase-1 variables, and *C*
_{1} and *SE*
_{C1} respectively the *C* indexof *M*
_{1} and its standard error. Let *M*
_{2} be a proportional hazard model adding the phase-2 variable to *M*
_{1}, and *C*
_{2} and *SE*
_{
C2}, respectively, the *C* index of *M*
_{2} and its standard error. Harrell's predictive ability ofthe added phase-2 variable is Δ = *C*
_{2} - *C*
_{1} . Complementary measures of predictive ability of a new variable, such as the net reclassification improvement (NRI) and the integrated discrimination index (IDI), were proposed by Pencina [17]. NRI needs some a priori meaningful risk categories. It quantifies the correct reclassification introduced by using a model with the added variable as compared to the classification obtained without this variable. The IDI can be viewed as a continuous version of the NRI with probabilities used instead of categories. It can be defined as the discrimination-slope difference between the models with and without a quantitative variable. To estimate the predictive ability of a model or of an additional variable, we reconstructed plausible whole cohorts using MI. For each reconstructed whole cohort, we could then directly obtain *C*
_{1}, *SE*
_{
C1}, *C*
_{2}, *SE*
_{
C2}, Δ, NRI, IDI and their respective variances. Using equations (1) and (2), we obtained the MI estimates of these quantities. Concerning the variance of Δ, the between-imputation component is estimated by the empirical variance of the *M* estimates of Δ provided by the *M* completed data sets. However, for the within-imputation component, the asymptotic variance of the estimator provided by a complete data set, does not have an analytical form. With a fully observed cohort, bootstrapping is a way to estimate the variance of the corresponding Δ. Therefore, each whole cohort reconstructed by MI has to be resampled. In the simulations as in the real data analysis, we used 100 bootstrap samples.

### Simulation study

Two phase-1 variables were simulated: a binary variable, *Z*
_{1}, and a Gaussian variable, *Z*
_{3}, observed for the entire cohort. For the phase-2 variable, *Z*
_{2}, we considered three different distributions: normal, log-normal and uniform, all of them with unit variance, independent of *Z*
_{1}, but having a correlation coefficient of 0.2 with *Z*
_{3}. The survival time had an exponential distribution, with *λ* = *exp* (*β*
_{1}
*Z*
_{1} + *β*
_{2}
*Z*
_{2} + *β*
_{3}
*Z*
_{3} ). *β*
_{1}, *β*
_{2} and *β*
_{3} were fixed at the same value and set at 0 or log(2). The censoring time followed a uniform distribution over the interval [0,τ], where *τ* was chosen so that the probability of an event was approximately 0.03 (*τ*= 0.025). The cohort size was 10,000. We also simulated a phase-1 variable predictive of *Z*
_{2}, {\stackrel{\u0303}{Z}}_{2}\equiv {Z}_{2}+\epsilonwith *ε ~ N* (0, *σ*
^{2}) independent of *Z*
_{2}. The variance σ^{2} was fixed at 1 which corresponds to a correlation between {Z}_{2}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}{\stackrel{\u0303}{Z}}_{2} of approximately 0.7. We wanted to estimate the effect of *Z*
_{2} on survival time and its predictive ability. The cohort was divided into 9 strata based on the tertiles of {\stackrel{\u0303}{Z}}_{2}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{and}}\phantom{\rule{0.3em}{0ex}}{Z}_{3}, and the non-cases were chosen by stratified sampling. Case-cohort sampling was simulated with 1,000 subjects in each subcohort. The phase-2 variable was not available for non-cases not included in the subcohort, so MI was used to complete the data set. Thus, we built the same linear prediction model for each *Z*
_{2} based on the stratification phase-1 variable and the case indicator. *Z*
_{3} was not directly included in the imputation model to predict *Z*
_{2}, because it was a stratification variable included in the model and because of the weak correlation between *Z*
_{2} and *Z*
_{3}. The imputation model was: *Z*
_{2} = *α*
_{0} + *α*
_{1} *I*
_{
case
} + *α*
_{2} *Strata* + *ε* , where *α*
_{0} and *α*
_{1} are scalar, *α*
_{2} is a vector coefficient, *I*
_{
case
} is the case indicator, *Strata* is the vector of stratum indicators and *є* is the vector of errors independently and identically distributed ~ *N* (0, *σ*). Thus, the imputation model was correctly specified for *Z*
_{2} normally distributed but misspecified for *Z*
_{2} log-normally or uniformly distributed. One thousand cohorts were simulated for each scenario.

Five imputations were performed and 5 complete data sets were generated for each cohort. We estimated the log hazard ratios using MI and the "Borgan II" weighted estimator [4]. We used MI to estimate the predictive ability of models with and without the phase-2 variable, and the predictive ability of the phase-2 variable, NRI and IDI. We also studied the consistency of the naive estimator of Harrell's C index in case-cohort surveys by varying the subcohort size. Using the above simulation conditions and, exceptionally, a scenario with *β*
_{
1
} = *β*
_{
3
} = *log*(2) and *β*
_{
2
} = *log*(1.5), we simulated case-cohort samples with the subcohort size set at 300 or 1,000 subjects. We estimated the predictive ability in the case-cohort samples and in the multiply imputed data sets.

### Case-cohort survey from Three-City study

Briefly, the 3C-Study was designed to examine the relationship between vascular diseases and dementia in a community housing 9,294 persons aged 65 years and over between 1999 and 2001 in three French cities. The detailed methodology has been previously described [11]. A case-cohort substudy was conducted [18], to investigate the relationship between biomarkers, such as plasma levels of D-dimer (a marker of coagulation and fibrinolysis) and the 4-year incidence of coronary heart disease (CHD), stroke and all subtypes of dementia, including vascular dementia (VaD), in an elderly population. The phase-1 variables provided information on socio-demographic characteristics, education, medical history, diet, alcohol and tobacco use. Blood pressure, height and weight were also available. A subcohort of size n = 1,254, (13.5% of the full cohort) was randomly selected, stratifying on age, sex and recruitment center. Observed cumulated incidences of CHD and VaD were approximately 2% and 0.6%, respectively. Plasma D-dimer levels were only available for phase-2 subjects. Carcaillon *et al.* [18] treated quintiles of D-dimer level both qualitatively and linearly. They reported a linear increase in the risk of VaD according to D-dimer quintiles.

We re-assessed the relationship between plasma D-dimer levels and the risk of CHD and VaD, using MI and weighted estimators, and evaluated the predictive ability of D-dimer levels on both risks. We included the same explanatory variables as Carcaillon *et al.* [18] although we used tertiles of D-dimer rather than quintiles, to estimate CHD and VaD risks, due to the small number of events. Therefore, to estimate the risk of CHD, the proportional hazard model includedthe phase-1 variables: age, sex, center, body mass index, hypertension, hypercholesterolemia, diabetes, tobacco use, diabetes drugs, and as phase-2 variables, indicators of D-dimer tertiles. To estimate the risk of VaD, the proportional hazard model included the phase-1 variables: age, sex, centre, educational level, body mass index, the presence or absence of an apolipoprotein є4 allele and indicators of D-dimer tertiles.

For each outcome (CHD or VaD), it was necessary to reproduce the relationships among the incomplete variable, the outcomes and the confounder variables. For each outcome, we built an imputation model of tertiles of D-dimer levels, including the variables used in the proportional hazard model and the case-indicator. We estimated the predictive ability of proportional hazard models, without (*C*
_{1}) and with (*C*
_{2}) D-dimer levels, Δ = *C*2 - *C*1, and IDI for CHD and VaD risks. The NRI requires that some a priori meaningful risk categories be known. Based on the Third Adult Treatment Panel [ATP III] [19] risk classification for the 10-year risk of CHD, we adapted the cut-offs to 4-year risk. For VaD, we do not know a priori meaningful risk categories and did not compute NRI.