Skip to main content
  • Research article
  • Open access
  • Published:

Multiple imputation for estimating hazard ratios and predictive abilities in case-cohort surveys

Abstract

Background

The weighted estimators generally used for analyzing case-cohort studies are not fully efficient and naive estimates of the predictive ability of a model from case-cohort data depend on the subcohort size. However, case-cohort studies represent a special type of incomplete data, and methods for analyzing incomplete data should be appropriate, in particular multiple imputation (MI).

Methods

We performed simulations to validate the MI approach for estimating hazard ratios and the predictive ability of a model or of an additional variable in case-cohort surveys. As an illustration, we analyzed a case-cohort survey from the Three-City study to estimate the predictive ability of D-dimer plasma concentration on coronary heart disease (CHD) and on vascular dementia (VaD) risks.

Results

When the imputation model of the phase-2 variable was correctly specified, MI estimates of hazard ratios and predictive abilities were similar to those obtained with full data. When the imputation model was misspecified, MI could provide biased estimates of hazard ratios and predictive abilities. In the Three-City case-cohort study, elevated D-dimer levels increased the risk of VaD (hazard ratio for two consecutive tertiles = 1.69, 95%CI: 1.63-1.74). However, D-dimer levels did not improve the predictive ability of the model.

Conclusions

MI is a simple approach for analyzing case-cohort data and provides an easy evaluation of the predictive ability of a model or of an additional variable.

Peer Review reports

Background

Case-cohort surveys produce incomplete data by design. A subcohort is selected by simple or stratified random sampling, all subjects are followed up and the events of interest are recorded. The phase-1 variables are observed for the entire cohort, whilethe phase-2 variables are only known for the case-cohort sample, i.e., subjects belonging to the subcohort and all those presenting the event of interest [1]. Thus, in case-cohort studies, the non-cases who do not belong to the subcohort are incompletely observed by design, enabling cost reduction with a small loss of efficiency.

Various approaches have been described to estimate the proportional hazard model in case-cohort surveys: Weighted estimators [2–6] are classically used in these surveys, with analysis restricted to the completely observed subsample, so the information collected for incompletely observed non-cases is ignored and inefficient estimators for the effect of phase-1 variables are obtained. One of the most popular is the Borgan II estimator [4]. Scheike and Martinussen [7] proposed a maximum likelihood estimator based on proportional hazards assumption, using the EM algorithm [8], therebyincreasing efficiency as compared to weighted estimators when the relative risk and disease incidence are high. However, in general, the studied disease incidence in case-cohort surveys is low. Breslow et al. [9] suggested calibrating or estimating the weights a posteriori, using all the phase-1 information, to improve precision with respect to classical weighted estimators. Marti and Chavance [10] showed that multiple imputation (MI) is a good alternative to classical weighted methods for the analysis of case-cohort data. When the imputation model was correct, the MI approach provided unbiased estimators of the log hazard ratios and correctly estimated the variance of its estimators. As expected, the MI approach was more precise than the usual weighted estimators for the parameters associated with phase-1 variables. The former was also slightly more precise than the latter for the phase-2 variable. In Marti and Chavance [10] the imputations were performed according to a correctly specified imputation model. However, in practise, the distribution of the phase-2 variable is unknown and onemay wonder how MI compares to weighted estimators when the imputation model is misspecified.

No standard method exists for quantifying the usefulness or predictive ability of a model or an additional variable in the framework of case-cohort surveys. The predictive ability can be measured in terms of calibration, which refers to the ability of a model to match predicted and observed values, when we are interested in individual predictions; or in terms of discrimination, which refers to the ability of a model to distinguish between subjects with or without a binary event, when we are interested in identifying a group of high-risk subjects. In the present work, we focus on discrimination.

As shown below, a naive measurement of predictive ability from case-cohort data often leads to a biased estimate of the predictive ability because it varies with the censoring rate and thus depends on the subcohort size. Alternatively, because MI reconstitutes whole cohorts, any tool developed to estimate the predictive ability in the framework of cohort surveys can be applied to case-cohort data, so we propose using the MI approach to estimate the predictive ability ofa model or of an additional variable and their standard errors.

The objectives of this study were 1) to evaluate MI for estimating hazard ratios when the distribution of the phase-2 variable is misspecified; and 2) to present an adequate methodology for estimating the predictive ability of a model or of an additional variable in case-cohort surveys. We performed a simulation study to validate the MI approach for estimating the predictive ability of a model or of an additional variable and to assess its potential limits. As an illustration, we analyzed case-cohort data from the Three-City study [11] to estimate the predictive ability of the D-dimer plasma concentration, a marker of coagulation and fibrinolysis, on coronary heart disease (CHD) and on vascular dementia (VaD) risks.

Methods

Incomplete observations and multiple imputation

Case-cohort surveys are a particular type of incomplete observations, in which data are missing at random [12] by design, as the probability of being completely observed depends only on the case status, with simple random sampling, and on some phase-1 variables with stratified sampling. MI is a simple and efficient method for analyzing incomplete observations, while taking into account all the levels of uncertainty regarding missing values. This provides an approximation of the maximumlikelihood estimator and thus enables the potential selection bias to be corrected. This method relies on the generation of several plausibly completed data sets (M ≥ 2), accounting for all levels of uncertainty concerning the missing values. A prediction model must be built, taking into consideration the relationships between the incomplete variable and the other variables, as observed in the complete part of the data. The missing data are not replaced by their expectation but by a value drawnfrom the distribution posited by the model. To take into account the uncertainty concerning the parameters of the imputation model, several imputations are performed with parameters drawn from the asymptotic distribution of their estimator. An estimate of the parameter of interest, θ ^ m ,m= { 1 , … , M } , and an estimate of the variance of the estimator, V ^ ( θ ^ m ), are obtained from each completed data set. If the imputation model is correct, these estimators are not biased. The MI estimate, also unbiased, is the mean of the M estimates:

θ ^ M I = 1 M ∑ m = 1 M θ ^ m
(1)

where M is the number of completed data sets and θ ^ m , m  = { 1, … , M } is the estimate of the parameter of interest provided by the m th completed data set. The multiplicity of imputations enables correct estimation of the variance of this single estimator, which is the sum of 2 components: the within-imputations component, W MI , and the between-imputations component, B MI :

V ^ ( θ ^ M I ) = Ŵ M I + B ^ M I = 1 M ∑ m = 1 M V ^ ( θ ^ m ) + ( 1 + M - 1 ) ∑ m = 1 M ( θ ^ m - θ ^ M I ) ( θ ^ m - θ ^ M I ) ′ M - 1
(2)

where the factor 1 + M -1 is an adjustment for using a finite number of imputations [13].

MI requires a model correctly reflecting the relationship between the incomplete variable and the outcome of interest. In case-cohort surveys, we need to impute phase-2 variable values for the non-cases who do not belong to the subcohort. Under the rare disease assumption, we have shown that a simple generalized linear model, using all the complete data (cases and non-cases) and including the case indicator among the explanatory variables, has to be considered [10]. Practically, in addition to the case indicator and the stratification variables, when the subcohort was selected by stratified sampling, it is necessary to include in the imputation model all the variables appearing in the proportional hazard model. Because imputations are based on asymptotic distributions, caution is necessary, since if too few subjects present the event of interest, the distribution of the estimators can differ from the asymptotic one. As a consequence, the maximum likelihood estimator of the imputation model could be biased or not normally distributed.

Predictive ability of a model and of a supplementary variable

Harrell et al. [14] proposed the C index to measure the predictive ability of a model in cohort studies as the agreement between the order of the predicted and observed survival times in any pair of subjects (the event of interest is assumed to be death, leading to the use of survival terminology). That is, the concordance probability using all pairs of subjects in the population. However, with censored data, it is not possible to consider all the pairs of subjects because survival time is not observed for censored subjects. Let T i be the survival time for subject i, i = 1,...,N, where N is the cohort size, and C i the censoring time for subject i. We observe X i = min(T i , C i ). Usable pairs are those for which the order of the predicted survival times can be compared to the order of the true survival times, i.e., pairs formed by 2 uncensored subjects or an uncensored subject and a subject censored after the uncensored subject's death. A pair of censored subjects carries no information about its agreement with the expected survival provided by the model since the order of the survival times is not known. Similarly a pair formed by a subject whose survival time is observed and a subject censored before this survival time provides no information on this agreement since the unknown survival time could be anterior or posterior to the observed one. Harrell et al. [15] showed that, in the common models used for survival analysis, such as the proportional hazard model, the predicted survival times and the predicted survival probabilities at a fixed time t can be interchanged for the comparison. The Harrell's C index is defined as:

C = π c π c + π d
(3)

where π c is the probability of concordance for a pair (i,j) and π d is the probability of discordance. We assume continuous survival times and continuous predicted survival probabilities, so P(X i = X j ) = P(Y i = Y j ) = 0, thus π c + π d = 1. C is estimated by the proportion of concordant pairs among the usable pairs. The estimated variance was given by Kremers [16].

In practice, we are often interested in estimating the predictive ability of an additional phase-2 variable. Let M 1 be a proportional hazard model including only phase-1 variables, and C 1 and SE C1 respectively the C indexof M 1 and its standard error. Let M 2 be a proportional hazard model adding the phase-2 variable to M 1, and C 2 and SE C2, respectively, the C index of M 2 and its standard error. Harrell's predictive ability ofthe added phase-2 variable is Δ = C 2 - C 1 . Complementary measures of predictive ability of a new variable, such as the net reclassification improvement (NRI) and the integrated discrimination index (IDI), were proposed by Pencina [17]. NRI needs some a priori meaningful risk categories. It quantifies the correct reclassification introduced by using a model with the added variable as compared to the classification obtained without this variable. The IDI can be viewed as a continuous version of the NRI with probabilities used instead of categories. It can be defined as the discrimination-slope difference between the models with and without a quantitative variable. To estimate the predictive ability of a model or of an additional variable, we reconstructed plausible whole cohorts using MI. For each reconstructed whole cohort, we could then directly obtain C 1, SE C1, C 2, SE C2, Δ, NRI, IDI and their respective variances. Using equations (1) and (2), we obtained the MI estimates of these quantities. Concerning the variance of Δ, the between-imputation component is estimated by the empirical variance of the M estimates of Δ provided by the M completed data sets. However, for the within-imputation component, the asymptotic variance of the estimator provided by a complete data set, does not have an analytical form. With a fully observed cohort, bootstrapping is a way to estimate the variance of the corresponding Δ. Therefore, each whole cohort reconstructed by MI has to be resampled. In the simulations as in the real data analysis, we used 100 bootstrap samples.

Simulation study

Two phase-1 variables were simulated: a binary variable, Z 1, and a Gaussian variable, Z 3, observed for the entire cohort. For the phase-2 variable, Z 2, we considered three different distributions: normal, log-normal and uniform, all of them with unit variance, independent of Z 1, but having a correlation coefficient of 0.2 with Z 3. The survival time had an exponential distribution, with λ = exp (β 1 Z 1 + β 2 Z 2 + β 3 Z 3 ). β 1, β 2 and β 3 were fixed at the same value and set at 0 or log(2). The censoring time followed a uniform distribution over the interval [0,τ], where τ was chosen so that the probability of an event was approximately 0.03 (τ= 0.025). The cohort size was 10,000. We also simulated a phase-1 variable predictive of Z 2, Z ̃ 2 ≡ Z 2 + ε with ε ~ N (0, σ 2) independent of Z 2. The variance σ2 was fixed at 1 which corresponds to a correlation between Z 2 and Z ̃ 2 of approximately 0.7. We wanted to estimate the effect of Z 2 on survival time and its predictive ability. The cohort was divided into 9 strata based on the tertiles of Z ̃ 2 and Z 3 , and the non-cases were chosen by stratified sampling. Case-cohort sampling was simulated with 1,000 subjects in each subcohort. The phase-2 variable was not available for non-cases not included in the subcohort, so MI was used to complete the data set. Thus, we built the same linear prediction model for each Z 2 based on the stratification phase-1 variable and the case indicator. Z 3 was not directly included in the imputation model to predict Z 2, because it was a stratification variable included in the model and because of the weak correlation between Z 2 and Z 3. The imputation model was: Z 2 = α 0 + α 1 I case + α 2 Strata + ε , where α 0 and α 1 are scalar, α 2 is a vector coefficient, I case is the case indicator, Strata is the vector of stratum indicators and є is the vector of errors independently and identically distributed ~ N (0, σ). Thus, the imputation model was correctly specified for Z 2 normally distributed but misspecified for Z 2 log-normally or uniformly distributed. One thousand cohorts were simulated for each scenario.

Five imputations were performed and 5 complete data sets were generated for each cohort. We estimated the log hazard ratios using MI and the "Borgan II" weighted estimator [4]. We used MI to estimate the predictive ability of models with and without the phase-2 variable, and the predictive ability of the phase-2 variable, NRI and IDI. We also studied the consistency of the naive estimator of Harrell's C index in case-cohort surveys by varying the subcohort size. Using the above simulation conditions and, exceptionally, a scenario with β 1 = β 3 = log(2) and β 2 = log(1.5), we simulated case-cohort samples with the subcohort size set at 300 or 1,000 subjects. We estimated the predictive ability in the case-cohort samples and in the multiply imputed data sets.

Case-cohort survey from Three-City study

Briefly, the 3C-Study was designed to examine the relationship between vascular diseases and dementia in a community housing 9,294 persons aged 65 years and over between 1999 and 2001 in three French cities. The detailed methodology has been previously described [11]. A case-cohort substudy was conducted [18], to investigate the relationship between biomarkers, such as plasma levels of D-dimer (a marker of coagulation and fibrinolysis) and the 4-year incidence of coronary heart disease (CHD), stroke and all subtypes of dementia, including vascular dementia (VaD), in an elderly population. The phase-1 variables provided information on socio-demographic characteristics, education, medical history, diet, alcohol and tobacco use. Blood pressure, height and weight were also available. A subcohort of size n = 1,254, (13.5% of the full cohort) was randomly selected, stratifying on age, sex and recruitment center. Observed cumulated incidences of CHD and VaD were approximately 2% and 0.6%, respectively. Plasma D-dimer levels were only available for phase-2 subjects. Carcaillon et al. [18] treated quintiles of D-dimer level both qualitatively and linearly. They reported a linear increase in the risk of VaD according to D-dimer quintiles.

We re-assessed the relationship between plasma D-dimer levels and the risk of CHD and VaD, using MI and weighted estimators, and evaluated the predictive ability of D-dimer levels on both risks. We included the same explanatory variables as Carcaillon et al. [18] although we used tertiles of D-dimer rather than quintiles, to estimate CHD and VaD risks, due to the small number of events. Therefore, to estimate the risk of CHD, the proportional hazard model includedthe phase-1 variables: age, sex, center, body mass index, hypertension, hypercholesterolemia, diabetes, tobacco use, diabetes drugs, and as phase-2 variables, indicators of D-dimer tertiles. To estimate the risk of VaD, the proportional hazard model included the phase-1 variables: age, sex, centre, educational level, body mass index, the presence or absence of an apolipoprotein є4 allele and indicators of D-dimer tertiles.

For each outcome (CHD or VaD), it was necessary to reproduce the relationships among the incomplete variable, the outcomes and the confounder variables. For each outcome, we built an imputation model of tertiles of D-dimer levels, including the variables used in the proportional hazard model and the case-indicator. We estimated the predictive ability of proportional hazard models, without (C 1) and with (C 2) D-dimer levels, Δ = C2 - C1, and IDI for CHD and VaD risks. The NRI requires that some a priori meaningful risk categories be known. Based on the Third Adult Treatment Panel [ATP III] [19] risk classification for the 10-year risk of CHD, we adapted the cut-offs to 4-year risk. For VaD, we do not know a priori meaningful risk categories and did not compute NRI.

Results

Simulation study

The mean fraction of missing information about the effect of Z 2 ranged from 5 to 14 percent when β 2 = 0 and from 23 to 30 per cent when β 2 = log(2) (data not shown). For each estimator (full cohort, case-cohort with MI and case-cohort with weights), we give the mean of the estimated coefficients, the mean of their standard error estimates, the observed standard error of the estimated coefficients and the mean squared errors of 1000 simulations (Table 1). Not surprisingly, the full cohort estimates and the case-cohort weighted estimates of the log hazard ratios were unbiased. Similarly, with a correctly specified normal imputation model, all MI estimates were unbiased. With a misspecified normal imputation model, MI estimate of the effect β 2 = log(2) of Z 2 was biased (-13%) when Z 2 was log-normally distributed. When Z 2 was uniformly distributed, MI estimate of the effect of Z 2 was slightly biased (-5%). With a misspecified normal imputation model and β 2= 0, no bias was observed. TheMI variance and the weighted estimator variance agreed with the observed dispersions of the estimates. The observed dispersion was always smaller with MI than with the weighted estimator. For the phase-1 variables, this dispersion was similar for the entire cohort and with MI, whatever the distribution of the phase-2 variable. For the estimated effect of the phase-2 variable, the observed standard deviations were smaller with MI than with the Borgan II weighted estimator but, as expected, slightly larger with MI than in the full cohort analyses. Altogether, the mean squared errors were smaller with MI than with the weighted estimator, except for the effect of the phase-2 variable with β 2 = log(2) and Z 2 was log-normally distributed.

Table 1 Mean of the log hazard ratio estimates (Est), mean of the standard error estimates SE ∧ , standard error of the estimates (SE) and mean of the mean square error (MSE). Results of 1,000 simulations.

The results concerning the consistency of the naive estimator of Harrell's C index are reported in Table 2. In the scenario β 1 = β 2 = β 3 = 0, the mean C index was nearly 0.5 for both models, without and with Z 2, whatever the analysis performed. In the scenarios β 1 = β 3 = log(2) and β 2 = log(1.5) or β 2 = log(2), the naive computation of C with the case-cohort data led to lower predictive ability than with the full cohort, especially for the smaller subcohort. Bycontrast, the Harrell's C indexes estimated by MI were similar to those computed for the full cohort and did not depend on the subcohort size. The estimated dispersion of the C index was slightly greater than the observed dispersion of the estimates. The rejection percentage of the null hypothesis Δ = 0 was always similar to the full cohort analysis and to MI. As a consequence of the standard error overestimation, the observed first type error rate was slightly lower than 5%. Nevertheless, in the considered scenarios, the observed power was very high. As expected, the loss of power when comparing case-cohort with MI to full cohort analysis was small: with β 2 = log(1.5), the observed power was 84.6% with a subsample size of 300, and 90.6% with a subsample size of 1000 versus 91.6% with the full cohort. MI estimates of NRI and IDI indexes were close to those obtained with the full cohort analysis and did not depend on the subcohort size. As compared to the full cohort results, the rejection percentage of the null hypothesis NRI = 0 was smaller with MI analysis when β 2 = 0, larger when β 2 = log(1.5) and similar when β 2 = log(2). When the effect of the phase-2 variable was not null, the rejection percentage of the null hypothesis IDI = 0 was similar with MI and with full cohort analysis. By contrast, whatever the effect of the phase-2 variable, the estimation of NRI and IDI in the case-cohort sample provided larger measures of these indexes than the full cohort analysis.

Table 2 Mean of the predictive ability estimates (Est), mean of the standard error estimates SE ∧ and standard error of the estimates (SE).

Table 3 gives the results of the estimated predictive abilities for the correctly specified and the two misspecified normal imputation models. Full cohort analysis and MI provided similar predictive abilities estimates when the imputation model was correctly specified or when the phase-2 variable had no effect on the studied risk. In the scenario β 1 = β 2 = β 3 = log(2), when Z 2 was uniformly distributed, MI and full cohort analysis still provided similar estimates. However, when Z 2 was log-normally distributed, the MI estimate was slightly smaller than the full cohort estimate -15%).

Table 3 Predictive ability of the two models and of the phase-2 variable.

Mean of the predictive ability estimates (Est), mean of the standard error estimates SE ∧ and standard error of the estimates (SE), with a correctly specified normal imputation model (Z 2 normally distributed), and with two misspecified normal imputation models (Z 2 log-normally and uniformly distributed)

Application to the Three-City study

The mean fraction of missing information about the effect of D-dimer was 4.9 and 3.7 per cent for CHD and VaD risks, respectively. Table 4 gives the estimated hazard ratios associated with D-dimer tertiles. The MI and the weighted approaches yielded similar estimates and precision. The CI of the hazard ratio associated with the linear effect of a one-tertile difference were respectively (0.94-1.38) versus (0.92-1.38) for CHD and (1.13-2.53) versus (1.13-2.67) for VaD. For phase-1 variables, both estimators provided similar results, but MI was always the more precise (data not shown).

Table 4 Estimates of hazard ratios (HR) and 95% confidence interval (CI) associated with D-dimer tertiles.

Harrell's C for the models including only phase-1 variables were above 0.69 for CHD risk and above 0.86 for VaD risk (Table 5). Hence, CHD and VaD risks were largely explained by standard risk factors, and the inclusion of plasma D-dimer levels did not significantly improve the predictive ability of the model, despite the fact that elevated D-dimer levels significantly increased the VaD risk. For CHD as for VaD, the index did not significantly differ from 0.

Table 5 Predictive ability and 95% confidence interval (CI) of D-Dimer tertiles on cardiovascular heart disease (CHD) and vascular dementia (VaD) risks.

Discussion

Use of a consistent estimator does not guarantee the absence of any bias for finite sample. We only showed that MI analysis of case-cohort data provides unbiased estimates of the log-hazard ratio when the imputation model and the proportional hazard model are correctly specified. The misspecification of the imputation model can originate from an erroneous choice of the distribution, or from wrongly assuming that the estimator of the imputation model is consistent and normal, or from the omission of some important explanatory variable. Imputations carried out using a misspecified distribution in the imputation model can provide biased estimates of hazard ratios, especially, if the specified distribution of the phase-2 variable differs from the true one in terms of symmetry (log-normal versus normal distribution). The negative bias on a log hazard ratio of 0.69 was noticeable but not large when a log-normal variable was imputed according to a normal distribution (-0.09 or -13%), but it is clearly a type of misspecification easily identified with diagnostic tools [20]. One can then transform the incomplete variable in order to obtain a symmetrical distribution, impute transformed values and apply the inverse transformation to the imputed values. Note that although a normal and a uniform distribution are quite different, both are symmetrical and the observed bias was quite smaller (only 5%). In the 3C study of the relationship between VaD and D-dimer, we observed slightly different estimates of the log hazard ratio when comparing the third to the first tertile (2.77 versus 2.93, i.e. a relative difference of 8% between the MI and the weighted estimates). This is probably because of the qualitative imputation of D-dimer, and thus, the use of a multinomial imputation model, which implied estimation of parameters in separate strata defined by D-dimer concentration tertiles, some of which had a small number of events. Due to these small numbers (only 51 VaD in total), asymptotic conditions might not have been fulfilled in at least some strata, and the estimated coefficients of the imputation model could have been biased and notnormally distributed. We give below some recommendations regarding the choice of explanatory variables in the imputation model. Since the potential bias of MI estimates can be detected by comparing them to weighted estimates, we suggest building the proportional hazard model by using only the case-cohort data and weighted estimators. MI can eventually be used to reanalyze the data with the selected model to improve the precision of the results, while verifying that no bias was introduced.

In simulated data, for the phase-1 variables, the precision of MI and full cohort estimates was similar and smaller than with the weighted estimator. For the phase-2 variable, MI estimates were slightly more precise than weighted estimates. Globally, the mean squared errors were smaller with MI than with the weighted estimator, with one exception implying a normal imputation model for a log-normally distributed phase-2 variable, an error which should easily be avoided.

There is no standard method for estimating the predictive ability of a model in the framework of case-cohort surveys. We showed that the naive application of the C index to case-cohort surveys yielded an underestimation of the predictive ability of the model that depended on the subcohort size when the phase-2 variable had an effect on the risk. Similarly, the naive estimates of the predictive ability of an added phase-2 variable differed notably from the full cohort values when the effect of the phase-2 variable was not null. Harrell's C index could theoretically be estimated with a weighted approach, but this can be computationally difficult because it requires weighting each pair by the pairwise sampling probabilities, i.e., using a square matrix of size N'(N'-1), where N' is the size of the case-cohort sample. Computing the variance of this Horvitz-Thompson estimator requires either weighting each quadruplet by the quadruple-wise sampling probabilities, i.e., working with a matrix of size N'(N'-1)(N'-2)(N'-3), or bootstrapping the case-cohort data. By contrast, MI easily allows estimation of the predictive ability of a model or of an additional phase-2 variable and their variances in the context of case-cohort data, only requiring bootstrapping to estimate the variance of the predictive ability of the phase-2 variable. MI provided estimates of Harrell C, NRI and IDI indexes similar to those obtained with the full cohort analysis. Note, however, that the predictive abilities were always overestimated because the same data were used to estimate the model and its predictive ability.

Analysis of the Three-City case-cohort study was in agreement with our previous work [10]. The weighted and the MI approaches yielded similar estimates of the hazard ratios and MI was slightly more precise, particularly for phase-1 variables. The relative differences between both estimates was always below 2% for the hazard ratios related to CHD and D-dimer, but as early discussed, they could be slightly higher (8%) for a hazard ratio related to VaD and D-dimer. The precision was similar for both analyses.

The imputation model must reflect the association between the incomplete variable, the outcome and the other explanatory variables. Therefore, variables included in the proportional hazard model as well as the stratification variables must be included in the imputation model. If a surrogate of the phase-2 variable is available, it should also be included in the imputation model. On the other hand, multiple imputation approach can provide unbiased and more efficient estimates than weighted analysis even when no strong predictor of the phase-2 variable is available [10]. The inclusion of additional variables other than strongly predictive variables can lead to an increased inter-imputation variance. This prompted the use of different imputation models for D-dimer levels in the CHD and VaD analyses. However, we verified that adding the variables only used in the CHD analysis to the model used for VaD, did not modify the results observed in the former (data not shown).

The number of requested imputations depends on the proportion of missing information which, in case-cohort studies, is considerably smaller than the percentage of incompletely observed subjects. Rubin showed that with as much as 40 per cent information missing, M = 5 imputations provides an asymptotic relative efficiency was 0.97, and, with 50 per cent missing information, M = 10 provides an asymptotic relative efficiency of 0.98. Thus, a small number of imputations, 5-10, should suffice [21]. In our analyses, we used 5 imputations to limit the computer time of the simulations, a reasonable choice since the proportion of missing information was always smaller than 30 per cent. However, a slightly larger number of imputations (e.g. 10) could have been performed on the 3C study data at a reasonable time cost; it would have provided a more precise estimate of the between imputation variance and of the percentage of missing information.

The VaD risk increased with D-dimer tertiles. However, D-dimer inclusion did not significantly improve the predictive ability of the model for VaD risk. Computations of the C and IDI index yielded the same conclusion. To our knowledge, no other results concerning the predictive ability of D-dimer on the risk of VaD have been published to date. The risk of CHD did not vary with D-dimer, so, not surprisingly, the predictive ability of this variable was negligible, regardless of the index used. Wang et al. [22] and Tzoulaki [23] reported that the use of 10 and 4 biomarkers respectively added only moderately to the overall risk prediction based on conventional cardiovascular risk factors.

Conclusions

MI is a simple alternative approach to weighted analysis for analyzing case-cohort surveys, obtaining correct estimates of the log hazard ratios and their standard errors, improving precision for the phase-1 variable estimates, and providing at least the same precision as weighted estimators for phase-2 variable estimates. It allows an easy evaluation of the predictive ability of the model and, more generally, any tool proposed in the framework of cohort studies can be applied to case-cohort data using MI.

Abbreviations

MI:

Multiple imputation

CHD:

Coronary heart disease

VaD:

Vascular dementia

NRI:

Net reclassification index

IDI:

Integrated discrimination index.

References

  1. Prentice R: A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika. 1986, 73: 1-11. 10.1093/biomet/73.1.1.

    Article  Google Scholar 

  2. Chen K, Lo SH: Case-cohort and case-control analysis with cox's model. Biometrika. 1999, 86 (4): 755-764. 10.1093/biomet/86.4.755.

    Article  Google Scholar 

  3. Therneau TM, Li H: Computing the cox model for case cohort designs. Lifetime Data Anal. 1999, 5 (2): 99-112. 10.1023/A:1009691327335.

    Article  CAS  PubMed  Google Scholar 

  4. Borgan O, Langholz B, Samuelsen SO, Goldstein L, Pogoda J: Exposure stratified case-cohort designs. Lifetime Data Anal. 2000, 6: 39-58. 10.1023/A:1009661900674.

    Article  CAS  PubMed  Google Scholar 

  5. Kulich M, Lin D: Improving the efficiency of relative-risk estimation in case-Cohort studies. J Am Stat Assoc. 2004, 99: 832-844. 10.1198/016214504000000584.

    Article  Google Scholar 

  6. Langholz B, Jiao J: Computational methods for case-cohort studies. Comput Stat Data Anal. 2007, 51 (8): 3737-3748. 10.1016/j.csda.2006.12.028.

    Article  Google Scholar 

  7. Scheike TH, Martinussen T: Maximum likelihood estimation for Cox's regression model under case-Cohort sampling. Scand Stat Theory Appl. 2004, 31 (2): 283-293.

    Article  Google Scholar 

  8. Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B Stat Methodol. 1977, 39: 1-38.

    Google Scholar 

  9. Breslow N, Lumley BCCLT, Kulich M: Using the whole cohort in the analysis of case-cohort data. Am J Epidemiol. 2009, 169 (11): 1398-1405. 10.1093/aje/kwp055. [http://dx.doi.org/10.1093/aje/kwp055]

    Article  PubMed  PubMed Central  Google Scholar 

  10. Marti H, Chavance M: Multiple imputation analysis of case-cohort studies. Stat Med. 2011, 30 (13): 1595-1607. 10.1002/sim.4130.

    Article  PubMed  Google Scholar 

  11. Alperovitch A, 3C Study Grp: Vascular factors and risk of dementia: Design of the three-city study and baseline characteristics of the study population. Neuroepidemiology. 2003, 22 (6): 316-325.

    Article  Google Scholar 

  12. Little R, Rubin D: Statistical analysis with missing data. 1987, New York: Wiley

    Google Scholar 

  13. Rubin DB, Schenker N: Multiple imputation in health-care databases: an overview and some applications. Stat Med. 1991, 10 (4): 585-598. 10.1002/sim.4780100410.

    Article  CAS  PubMed  Google Scholar 

  14. Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA: Evaluating the yield of medical tests. J Am Med Assoc. 1982, 247 (18): 2543-2546. 10.1001/jama.1982.03320430047030.

    Article  Google Scholar 

  15. Harrell F, Lee K, Mark D: Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996, 15 (4): 361-387. 10.1002/(SICI)1097-0258(19960229)15:4<361::AID-SIM168>3.0.CO;2-4.

    Article  PubMed  Google Scholar 

  16. Kremers WK: Concordance for survival time data: fixed and time-dependent covariates and possible ties in predictor and time. Tech rep Mayo Fundation. 2007

    Google Scholar 

  17. Pencina M, D'Agostino R, D'Agostino R, Vasan R: Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med. 2008, 27 (2): 157-172. 10.1002/sim.2929.

    Article  PubMed  Google Scholar 

  18. Carcaillon L, Gaussem P, Ducimetiere P, Giroud M, Ritchie K, Dartigues JF, Scarabin PY: Elevated plasma fibrin D-dimer as a risk factor for vascular dementia: the three-city cohort study. J Thromb Haemost. 2009, 7 (12): 1972-1978. 10.1111/j.1538-7836.2009.03603.x.

    Article  CAS  PubMed  Google Scholar 

  19. Executive Summary of The Third Report of The National Cholesterol Education Program (NCEP) Expert Panel on Detection, Evaluation, And Treatment of High Blood Cholesterol In Adults (Adult Treatment Panel III). J Am Med Assoc. 2001, 285 (19): 2486-2497. 10.1001/jama.285.19.2486.

  20. Goodness-of-fit techniques. Edited by: D'Agostino RB, Stephens MA. 1986, New York: Marcel Dekker; Inc

    Google Scholar 

  21. Rubin DB: Multiple imputation for nonresponse in surveys. 1987, New York: Wiley

    Book  Google Scholar 

  22. Wang TJ, Gona P, Larson MG, Tofler GH, Levy D, Newton-Cheh C, Jacques PF, Rifai N, Selhub J, Robins SJ, Benjamin EJ, D'Agostino RB, Vasan RS: Multiple biomarkers for the prediction of first major cardiovascular events and death. N Engl J Med. 2006, 355 (25): 2631-2639. 10.1056/NEJMoa055373.

    Article  CAS  PubMed  Google Scholar 

  23. Tzoulaki I, Murray GD, Lee AJ, Rumley A, Lowe GDO, Fowkes FGR: Relative value of inflammatory, hemostatic, and rheological factors for incident myocardial infarction and stroke - The Edinburgh artery study. Circulation. 2007, 115 (16): 2119-2127. 10.1161/CIRCULATIONAHA.106.635029.

    Article  PubMed  Google Scholar 

Pre-publication history

Download references

Acknowledgements

This study was supported by a grant from the Région Île-de-France. It used data from the Three-City study which is conducted under an agreement between the Institut National de la Santé et de la Recherche Médicale and the Université Victor Segalen-Bordeaux 2. This manuscript was not prepared in collaboration with the 3C study Steering Committee and does not necessarily reflect its opinions or views.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Helena Marti.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HM conducted the literature review, simulations, data analyses and wrote the manuscript. LC conducted the analysis of the relationship between D-dimer levels and CHD and VaD risks and supervised the epidemiological aspects of the application to the Three-City study. MC conducted and supervised the writing of the manuscript. All authors have read the manuscript, are in agreement that the work is ready for submission to the journal, and accept responsibility for the manuscript's contents.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Marti, H., Carcaillon, L. & Chavance, M. Multiple imputation for estimating hazard ratios and predictive abilities in case-cohort surveys. BMC Med Res Methodol 12, 24 (2012). https://doi.org/10.1186/1471-2288-12-24

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2288-12-24

Keywords