 Research article
 Open Access
 Published:
Doubly robust estimator of risk in the presence of censoring dependent on timevarying covariates: application to a primary prevention trial for coronary events with pravastatin
BMC Medical Research Methodology volume 20, Article number: 204 (2020)
Abstract
Background
In the presence of dependent censoring even after stratification of baseline covariates, the Kaplan–Meier estimator provides an inconsistent estimate of risk. To account for dependent censoring, timevarying covariates can be used along with two statistical methods: the inverse probability of censoring weighted (IPCW) Kaplan–Meier estimator and the parametric gformula estimator. The consistency of the IPCW Kaplan–Meier estimator depends on the correctness of the model specification of censoring hazard, whereas that of the parametric gformula estimator depends on the correctness of the models for event hazard and timevarying covariates.
Methods
We combined the IPCW Kaplan–Meier estimator and the parametric gformula estimator into a doubly robust estimator that can adjust for dependent censoring. The estimator is theoretically more robust to model misspecification than the IPCW Kaplan–Meier estimator and the parametric gformula estimator. We conducted simulation studies with a timevarying covariate that affected both timetoevent and censoring under correct and incorrect models for censoring, event, and timevarying covariates. We applied our proposed estimator to a large clinical trial data with censoring before the end of followup.
Results
Simulation studies demonstrated that our proposed estimator is doubly robust, namely it is consistent if either the model for the IPCW Kaplan–Meier estimator or the models for the parametric gformula estimator, but not necessarily both, is correctly specified. Simulation studies and data application demonstrated that our estimator can be more efficient than the IPCW Kaplan–Meier estimator.
Conclusions
The proposed estimator is useful for estimation of risk if censoring is affected by timevarying risk factors.
Background
Establishment of the longterm effectiveness of primary prevention treatments often requires large randomized controlled trials (RCTs) over a long time period. In such RCTs, survival functions and risks between randomized groups are compared using the Kaplan–Meier estimator because censoring before the end of the followup cannot be avoided. This approach assumes independent censoring, such that censoring occurs randomly in each treatment group. The standardization approach can provide a consistent estimate of risks in each group even if censoring is not unconditionally independent, but the conditionally independence of potential survival time after stratification of treatment groups and baseline covariates [1,2,3,4]. In this paper, we call this type of censoring as baselineconditional independent censoring.
Even a baselineconditional independent censoring assumption can be dubious. Our motivating study is the Management of Elevated Cholesterol in the Primary Prevention Group of Adult Japanese (MEGA) study, which is a large primary prevention RCT for coronary heart disease (CHD) using pravastatin, where censoring before the end of followup occurred in about 10% of patients [5]. Patients enrolled in the MEGA study had hypercholesterolemia (total cholesterol (TC) level: 220–270 mg/dl), were 40–70 years old, and received daily clinical care during the followup period. When a patient with hypercholesterolemia received a medical checkup and found that their plasma lipids were worsening (e.g., increasing TC), they may have required other drugs that were not allowed in the study protocol. Patients who observed worsening of their symptoms might go to see a doctor other than their primary care doctor. These cases may have led to censoring dependent on midcourse clinical characteristics, and the censoring was correlated with future CHD events. If censoring is dependent on potential survival time even after stratification of treatment groups and baseline covariates, the Kaplan–Meier estimator provides inconsistent estimates of survival function [6]. In such a situation, one possibility to mitigate the dependency is to use timevarying covariates measured during the followup period.
The inverse probability of censoring weighted (IPCW) Kaplan–Meier estimator is a semiparametric method for estimation of risk that adjusts for censoring that may depend on the observed past [7]. It requires fitting a model for the probability of censoring at each time conditional on past covariates. Calculation of the IPCW Kaplan–Meier estimate needs to update censoring probability at each time and to weight each subject in the risk set. The weight depends on the timevarying covariates, but not on the future prognosis. The drawback of the IPCW estimator is that it can be statistically inefficient [8].
An alternative to IPCW methods is a gformulabased estimator, which can be estimated using two different principles. First representation of the gformula is an iterated conditional expectation, and targeted maximum likelihood estimation can be applied, which was first introduced by Bang and Robins [9]. Their method uses the weight of the IPCW method and regression models for the outcome process. It can produce doubly robust estimates, meaning that the estimator is consistent if either the regression model for the hazard of censoring or a regression model for the outcome process is correctly specified, but necessarily both [10,11,12]. However, only a few researchers have applied this method. One of the reasons may be that they are unintuitive because it requires recursive regression models for an iterated conditional expectation; first, we regressed the outcomes measured at t = K on the covariates measured up to t = K  1, second, we regressed the predicted outcome on the covariates measured up to t = K  2, and we continue these procedures until t = 1. The second representation of the gformula is the generalized version of standardization [1, 2], and the parametric gformula estimator (gcomputation algorithm formula) can be applied. The parametric gformula estimator requires models for the outcome and covariate process [13]. It can be regarded as a sequential, nonrecursive imputationbased methodology [14, 15], so it is intuitive for applied researchers. It is flexible because it can easily compare dynamic treatment regimens [16]. However, it requires a specification of fullmodel likelihood, and robustness regarding model correctness can be a concern. A doubly robust estimator for the parametric gformula estimator, involving the timevarying covariates, has not been proposed.
In this paper, we propose an extension of the parametric gformula estimator that is more robust at modeling misspecification. The key idea is to combine the IPCW estimator and the parametric gformula estimator into doubly robust estimators [9, 17,18,19] while incorporating timevarying covariates to adjust for dependent censoring.
The paper is organized as follows. In the next section, we briefly describe the MEGA study and introduce notations and assumptions. We also describe our proposed estimator, and we give settings and the results of simulation studies. Finally, the proposed estimator is applied to the MEGA study data.
Methods
Data, notations, and assumptions
The MEGA study is a prospective, randomized, openlabel, blindedendpointdesigned controlled trial conducted in Japan to evaluate the primary preventive effect of pravastatin against CHD in daily clinical practice. A total of 7832 men and postmenopausal women aged 40–70 years with hypercholesterolemia and no history of CHD or stroke were randomized to dietary therapy only (diet group) or dietary therapy plus 10–20 mg daily pravastatin (diet plus pravastatin group) between February 1994 and March 1999.
After randomization, laboratory tests were conducted at months 1, 3, and 6, and annually thereafter. The followup period was initially scheduled for 5 years. Table 1 shows the types and number of events within 5 years. Although there were three reasons for censoring during the study period (refusal of followup, death by causes other than CHD, and loss to followup), we collectively treated them as censoring before the end of the followup period.
Let t = 1, …, T denote month of followup where T + 1 = 60 months is the followup of interest. There were 7832 patients at baseline, and observations of patients were assumed to be independently identically distributed. R denotes the treatment assigned (R = 1 for assignment to the diet plus pravastatin group, and R = 0 for assignment to the diet group). C_{t} and Y_{t} denote the indicator of censoring and occurrence of a CHD event by time t, respectively, with C_{0} = Y_{0} = 0 by definition. L_{t} denotes timevarying covariates measured at time t, and V denotes baseline covariates that are timeindependent (e.g. sex, current smoker). We assumed that baseline covariates V and L_{0} are always observed. We denoted the history of a variable using overbars. For example, \( {\overline{L}}_t=\left({L}_0,\dots, {L}_t\right) \) is the covariate history through time t. We assumed the order (C_{t}, Y_{t}, L_{t}) within each interval (t  1, t); therefore, Y_{t} and following variables are missing if C_{t} = 1. We defined C_{T + 1} = 1 if C_{T} = Y_{T} = 0 (followup completed).
We wanted to estimate the marginal eventfree survival in each treatment group if any censoring was absent in the study population. However, observed data contains censoring, as in the MEGA study. The usual the Kaplan–Meier estimator assumes independent censoring, that is, the hazard of Y_{t} among subjects at risk is the marginal hazard of Y_{t} given the treatment group. The standardization approach, or a gformula that adjusts for baseline covariates, assumes baselineconditional independent censoring, that is, the hazard of Y_{t} among subjects at risk is the conditional hazard of Y_{t} given the treatment group and baseline covariates [1,2,3,4].
Even when these two assumptions are attainable, estimators discussed in the next section provide a consistent estimate of the marginal survival in each treatment group if any censoring was absent in the study population. These estimators assume positivity (Eq. 1) and are conditionally independent of censoring (Eq. 2);
The conditional independence of censoring assumption, Eq. (2), states that for t = 1, 2, … T, the variables (Y_{t}, …, Y_{T}) is independent of C_{t}, in other words, the distribution of (Y_{t}, …, Y_{T}) is the same between C_{t} = 1 and C_{t} = 0 among subjects who had a similar history of the covariates. The conditionally independence of censoring assumption is also referred to as no unmeasured confounders for the censoring assumption [7], which states that conditional on the treatment groups, baseline covariates, and the timevarying covariates measured until time t  1, the hazard of censoring at time t does not further depend on unmeasured confounders for censoring and unobserved CHD. In the next section, we describe the existing estimators and our proposed estimator for the hazard of Y_{t}.
Existing estimators and proposed estimator
Due to randomization, baseline factors are balanced between treatment groups. In this section, we focus on the diet plus pravastatin group (R = 1) and suppress R for notational simplicity. A similar argument holds for the diet group (R = 0).
Estimators of hazard at t = 1
At time t = 1, the observed data is n copies of (V, L_{0}, C_{1}, (1  C_{1})Y_{1}). We show three types of estimators for Pr(Y_{1} = 1); the IPCW estimator, the parametric gformula estimator, and the doubly robust estimator.
To obtain the IPCW estimate, we need to fit a model for C_{1} such as the logistic model Pr(C_{1} = 0V, L_{0}; α) = e(V, L_{0}; α) = {1 + exp(−α_{0} – α_{1}V – α_{2}L_{0})}^{− 1}. After fitting the model, the IPCW estimator for Pr(Y_{1} = 1) is expressed as \( {n}^{1}\sum \limits_i\left(1{C}_{i1}\right){Y}_{i1}/e\left({V}_i,{L}_{i0};\hat{\alpha}\right) \). The consistency of the IPCW estimator relies on the correct specification of e(V, L_{0}; α). If no censoring is observed at t = 1, we set \( e\left({V}_i,{L}_{i0};\hat{\alpha}\right)=1 \); therefore, the IPCW estimator equals the empirical risk.
To obtain the parametric gformula estimate, we need to fit a model for Y_{1} such as the logistic model Pr(Y_{1} = 1C_{1} = 0, V, L_{0}; β) = p(V, L_{0}; β) = {1 + exp(−β_{0} – β_{1}V – β_{2}L_{0})}^{− 1}. After fitting the model for the subjects not censored at t = 1 (subjects with C_{1} = 0), the parametric gformula estimator for Pr(Y_{1} = 1) is expressed as \( {n}^{1}\sum \limits_ip\left({V}_i,{L}_{i0};\hat{\beta}\right) \). The consistency of the parametric gformula estimator relies on the correct specification of p(V, L_{0}; β).
To obtain the doubly robust estimate, we need to fit a model for C_{1} and Y_{1} similarly as conducted for the IPCW estimator and the parametric gformula estimator, respectively. After fitting the models e(V, L_{0}; α) and p(V, L_{0}; β), the doubly robust estimator for Pr(Y_{1} = 1) is expressed as
The contributions of censored patients or patients with an event are different; for censored patients, their contribution is \( p\left({V}_i,{L}_{i0};\hat{\beta}\right) \) like the gformula estimator, and for patients with an event, their contribution is \( {Y}_{i1}/e\left({V}_i,{L}_{i0};\hat{\alpha}\right)\left\{1e\left({V}_i,{L}_{i0};\hat{\alpha}\right)\right\}p\left({V}_i,{L}_{i0};\hat{\beta}\right)/e\left({V}_i,{L}_{i0};\hat{\alpha}\right) \). The doubly robust estimator is consistent if either the model e(V, L_{0}; α) or p(V, L_{0}; β) is correctly specified [9, 17,18,19]. Intuitively, when the model for censoring is correctly specified, the term \( \left(1{C}_{i1}\right)e\left({V}_i,{L}_{i1};\hat{\alpha}\right) \) should be zero, so (3) reduces to the IPCW estimator and is, therefore, consistent. Inside the summation can be expressed as \( \left(1{C}_{i1}\right)\left\{{Y}_{i1}p\left({V}_i,{L}_{i0};\hat{\beta}\right)\right\}/e\left({V}_i,{L}_{i0};\hat{\alpha}\right)+p\left({V}_i,{L}_{i0};\hat{\beta}\right) \), and when the model for an event is correctly specified, the term \( {Y}_{i1}p\left({V}_i,{L}_{i0};\hat{\beta}\right) \) should be zero, so (3) reduces to the gformula estimator and is, therefore, consistent. Our proposed estimator utilizes this doubly robust estimator for the hazard of Y_{1}. In the next subsection, we show how to extend it to estimate the hazard of Y_{t} (t > 1) incorporating timevarying covariates.
We noted that with one categorical baseline covariate and no parametric model needed for outcomes and censoring, it can be shown that the IPCW estimator, the gformula estimator, and the doubly robust estimator are equivalent. Specifically, given n subjects, all of whom may be stratified into j levels of a baseline covariate, such that a_{j}, m_{j}, and n_{j} are the number observed (i.e. not censored), number of events, and overall number at level j of the covariate, respectively. The IPCW estimator can be written as (1/n)Σ_{j}m_{j}/ (a_{j}/ n_{j}) = (1/n) Σ_{j}n_{j}m_{j}/ a_{j}, because Pr(C_{1} = 0 level j) = a_{j}/ n_{j}. The gformula estimator can be written as (1/n) Σ_{j}n_{j} (m_{j}/ a_{j}), because Pr(Y_{1} = 1 level j) = m_{j}/ a_{j}. Finally, the doubly robust estimator can be written as (1/n) Σ_{j} [m_{j}/ (a_{j}/ n_{j}) – {(n_{j}  a_{j}) (0  a_{j}/ n_{j}) / (a_{j}/ n_{j}) + a_{j} (1  a_{j}/ n_{j}) / (a_{j}/ n_{j})} (m_{j}/ a_{j})] = (1/n) Σ_{j}n_{j}m_{j}/ a_{j}, which is exactly a common form as the IPCW estimator and the gformula estimator.
Estimators of hazard at t > 1
In this subsection, we show the estimators of the hazard of Y_{t} (t > 1), which are extended versions of the IPCW estimator and the parametric gformula estimators for Pr(Y_{1} = 1). Finally, we propose a doubly robust estimator that extends Eq. (3).
To obtain the IPCW Kaplan–Meier estimate, we need to fit a model for C_{t} such as the pooled logistic model,
In the model, it is possible to include L_{0}, …, L_{t – 2}, but in some cases, it may cause multicollinearity due to the correlation between L_{0}, …, L_{t  1}. After fitting the model using the maximum likelihood estimation, the IPCW Kaplan–Meier estimator for the hazard of Y_{t} is expressed as \( \hat{\Pr}\left({Y}_t=1{\overline{Y}}_{t1}=0\right)={\sum}_i{Y}_{it}{\pi}_{it}\left(\hat{\alpha}\right)/{\sum}_i{X}_{it}{\pi}_{it}\left(\hat{\alpha}\right) \), where \( {\pi}_t\left(\hat{\alpha}\right) \) is obtained as
and X_{t} is the atrisk indicator, which is 1 if the patient is atrisk at time t and is 0 otherwise. Finally, the risk at t can be obtained as \( 1\prod \limits_{j=1}^t\left\{1\hat{\Pr}\left({Y}_j=1{\overline{Y}}_{j1}=0\right)\right\} \). The consistency of the IPCW Kaplan–Meier estimator relies on the correct specification of the model for C_{t} (Eq. 4) [7]. Note that the IPCW Kaplan–Meier estimator reduces to the usual Kaplan–Meier estimator when α_{1} and α_{2} of Eq. (4) are 0, that is, the independent censoring assumption is true [20].
To obtain the parametric gformula estimate, we need to fit a model for Y_{t}. Unlike baseline covariates, timevarying covariates will not be measured for patients who were censored before time t. Thus, we need to specify the fullmodel likelihood (likelihood for conditional event probability and timevarying covariates) by fitting models for Y_{t} and L_{t} such as
and
After fitting the models using the maximum likelihood estimation, we sequentially imputed the conditional probability of CHD event and timevarying covariates from t = 1 to T. The parametric gformula estimator for the risk at t can be obtained as \( {n}^{1}{\sum}_{j=1}^t{\sum}_i{m}_{\mathrm{ij}}\left(\hat{\beta},\hat{\gamma}\right) \), where \( {m}_t\left(\hat{\beta},\hat{\gamma}\right) \) is obtained as \( {m}_t\left(\hat{\beta},\hat{\gamma}\right)=\Pr \left({Y}_t=1{\overline{Y}}_{t1}=0,V,{\overline{L}}_{t1};\hat{\beta},\hat{\gamma}\right){\prod}_{j=1}^{t1}\left\{1\Pr \left({Y}_j=1{\overline{Y}}_{j1}=0,V,{\overline{L}}_{j1};\hat{\beta},\hat{\gamma}\right)\right\} \). The consistency of the parametric gformula estimator relies on the correct specification of the model for Y_{t} (Eq. 5) and the model for L_{t} (Eq. 6) [16, 21, 22].
We propose an estimator of the hazard of Y_{t} that extends the doubly robust estimator (Eq. 3). To obtain the estimate, we need to fit models for C_{t}, Y_{t}, and L_{t} as conducted for the IPCW Kaplan–Meier estimator (Eq. 4) and the parametric gformula estimator (Eqs. 5 and 6). After fitting these models, the proposed doubly robust cf Y_{t} is expressed as,
where Z_{t} is the atrisk or censored indicator, which is 1 if the patient is atrisk at time t or censored by t and is 0 otherwise. The contributions of patients censored by t or patients with an event at t are different; for censored patients, their contribution is \( \Pr \left({Y}_t=1{\overline{Y}}_{t1}=0,V,{\overline{L}}_{t1};\hat{\beta},\hat{\gamma}\right) \). For patients with an event, their contribution of an event is weighted by the inverse probability of uncensored until t. Finally, the risk at t is obtained as \( 1{\prod}_{j=1}^t\left\{1\hat{\Pr}\left({Y}_j=1{\overline{Y}}_{j1}=0\right)\right\} \). The weights and predicted event probabilities are similar to the ones used in the IPCW Kaplan–Meier estimator and the parametric gformula estimator, but we need to calculate the function (7) and risk at t. As demonstrated in the Additional file 1 (Appendix A), this estimator is consistent if either the model for C_{t} (Eq. 4) or models for Y_{t} and L_{t} (Eq. 5 and 6) is correctly specified. In the IPCW Kaplan–Meier estimator, patients with C_{t} = 1 were out of the risk set; therefore, they do not contribute to the estimation of the hazard of Y_{t}. On the other hand, patients with C_{t} = 1 contribute to the estimation of the hazard of Y_{t} in Eq. (7), which might lead to statistical efficiency. The variance estimate of the proposed estimator can be obtained through a nonparametric bootstrap [23]. We have provided a SAS code for the proposed estimator in an Additional file 1 (Appendix B).
Comparison with existing doubly robust estimators
In this subsection, we briefly compare our proposed estimator (7) with existing doubly robust estimators [9, 24, 25]. Zhang et al. [24] and Bai et al. [25] proposed doubly robust estimators for survival functions, which can be summarized as follows:

Confounding between treatment groups: present due to the observational study setting

Censoring mechanism: baselineconditional independent censoring (censoring may depend only on the baseline covariates)
On the other hand, we proposed an estimator for survival functions,

Confounding between treatment groups: absent due to randomization

Censoring mechanism: conditional independent censoring (censoring may depend on timevarying covariates)
In RCT settings considered here, where no baseline confounding occurs between the treatment groups, the proposed estimator that specifies an empty set as L_{t} (thus models are unnecessary for the joint density of L_{t}) essentially results in the existing doubly robust estimators provided in [24, 25]. In other words, these existing estimators assume a baselineconditional independent censoring mechanism, although they also attempt to adjust for baselineconfounding between the groups in observationalstudy settings.
Bang et al. [9] proposed a doubly robust estimator for the gformula represented by an iterated conditional expectation. The estimator needs recursive fitting of the iterative conditional expectation. However, as Bang et al. [9] noted, the parametric models can be incompatible with each other, so it is difficult to specify all the models correctly.
Simulation study
To evaluate the performance of the proposed estimator, we carried out simulation studies with dependent censoring due to a timevarying covariate. We simulated data from two treatment groups, coded as R = 0 (control treatment) and R = 1 (test treatment). The simulations were based on 1000 replications. We considered the situation where baseline covariates were measured at time t = 0, and timevarying covariate and censoring were investigated at time t = 1, … 4, on the other hand, event time was measured from time t = 0 to t = 5 on a continuous time scale. We were interested in the treatment groupspecific risks and the risk ratio at t = 3 and t = 5.
For each patient i (= 1, …, 8000), a baseline covariate V was generated from the Bernoulli distribution of success probability 0.5. Independently, the timevarying covariate at t (= 0, …, 4) was generated from the following mixed effect model,
Random variables (b_{i0}, b_{i1}) were generated from a bivariate normal distribution with means of 0 and variance of 1.0 and 0.5, respectively, with a covariance of 0.5. The random error ϵ_{it} was generated from the standard normal distribution. Distributions of L_{t} were the same in both treatment groups at t = 0 but declined more steeply in the test treatment group such that L_{t} mimicked TC in the MEGA study.
First, we generated a time to event T_{i} from the piecewise exponential model, whose hazard function was, for t > 0 and k ≤ t < k + 1 (k = 0, …, 4),
where U_{t} = 1 if L_{t} < 0, otherwise U_{t} = 0. Therefore, potential event time was shorter in the control treatment group through the effect of group and timevarying covariate.
Next, we generated censoring C_{t} at t (= 1, …, 4) from the Bernoulli distribution, whose probability was generated using the following logistic model,
\( {\overline{C}}_4 \) = 0 and T_{i} > 5 indicates that the followup was completed. The direct dependence between the event and the censoring time is shown in Additional file 1 (Appendix C).
We considered three scenarios for α_{0} and α_{R.}: censoring probabilities in the control and test treatment groups are both 30% (scenario 1), both 20% (scenario 2), and 9 and 12%, respectively (scenario 3). The probabilities in scenario 3 were derived from Table 1.
We created 20,000,000 simulated patients without censoring to calculate the true value of survival probability using their empirical distribution. To understand the performance of estimators, we considered eight situations: all combinations of correct or incorrect censoring models, event models, and covariate models. We defined correct models for censoring, event, and covariate as a model that specified the same covariates with the datagenerating model. We defined incorrect models for censoring and event as a model that specified by replacing U_{t} by exp(L_{t}) without incorporating V. An incorrect covariate model was specified without incorporating the interaction term of b_{i1} and t.
Simulations were evaluated in terms of the bias (mean difference between estimated and true parameter value) and relative efficiency (the ratio of the Monte Carlo standard deviation of the IPCW Kaplan–Meier estimator to that of the estimator) of the estimated survival probabilities at time t = 3 and t = 5.
Results
Simulation results
We present our simulation results in Table 2. In Table 2, if the bias exceeded half of the standard error of the estimates, the printed bias was is shown in bold. In scenario 1, the bias for each group at t = 5 was seen for the IPCW Kaplan–Meier estimator when the censoring model is incorrect, for the parametric gformula estimator when one of the event model or covariate model is incorrect. However, our proposed estimator is unbiased when at least one of the censoring model or event model is correctly specified. This result reflected the double robustness of our proposed estimator; when the censoring model or set of event and covariate models are correct, the estimate is unbiased. Unexpectedly, our proposed estimator is less biased than the parametric gformula estimator, even when the covariate model was incorrect. We consider that this property is only in this simulation because if the covariate model is incorrect, the estimated event probability is also incorrect for true probability. At t = 3, the parametric gformula estimator showed less bias for the test treatment group even when the event model is incorrect. Regarding the bias, similar results can be seen in the other two scenarios.
Regarding the relative efficiency using the IPCW Kaplan–Meier estimator as the reference, both the parametric gformula estimator and our proposed estimator were more efficient at t = 3 than the reference in scenarios 1 and 2. The parametric gformula estimator was more efficient than the reference even at t = 5. However, our proposed estimator had a similar standard error to the reference. In scenario 3, where the censoring probability was the lowest among the scenarios, our proposed estimator had a similar standard error as the reference at both t = 3 and 5. The coverage probability of the proposed estimator using the bootstrap method with the correctly specified models was close to the nominal level of 95%. In summary, the efficiency recovery of our proposed estimator may be affected by the censoring probabilities (comparing between the scenarios) and the number of time points (comparing t = 3 and t = 5). When the censoring probability is high but the number of time points is less than five, our proposed estimator might be more efficient than the IPCW Kaplan–Meier estimator.
Data applications
Our proposed estimator was applied to the MEGA study data to estimate treatment groupspecific risks at 5 years after randomization. As baseline covariates, we included age (years), gender, body mass index, history of hypertension and diabetes, hypercholesterolemia medication history, current smoking, current alcohol drinking, triglyceride, highdensity lipoprotein cholesterol, and lowdensity lipoprotein cholesterol. As the timevarying covariate, we used recent TC.
After transforming our data into one record per persontime, we estimated the survival curve using our proposed estimator. First, we fitted the models for censoring C_{t}, event Y_{t}, and covariate L_{t}. We fitted pooled logistic models for C_{t} and Y_{t}, where the timevarying intercept was included as a restricted cubic spline with 4 knots at 1–4 years after randomization. We fitted a linear model for L_{t}. By fitting the pooled logistic model for Y_{t}, classical risk factors for CHD (age, male, hypertension, and diabetes) were found to be the prognostic factors (Additional file 1, Appendix D). By fitting the model for C_{t}, those without hypertension, diabetes, or no history of medication for hyperlipidemia, tended to be censored before the end of the followup period (Additional file 1, Appendix E). Unexpectedly, timevarying TC hardly affected the event or censoring after adjusted for those important baseline covariates; therefore, baselineconditional independence assumption rather than conditional independence assumption might be plausible in the MEGA study.
We estimated the risk of CHD incidence at 5 years from randomization using the Kaplan–Meier estimator, IPCW Kaplan–Meier estimator, the parametric gformula estimator, and our proposed estimator. The results are shown in Table 3. In the MEGA study dataset, the risk of CHD estimated using the usual Kaplan–Meier estimator and the risk estimated by other estimators were very similar. This may be due to the small impact of dependent censoring in the MEGA study and correctness of model specification for censoring and events. Because the ordinal Kaplan–Meier estimator showed similar results as the other three estimators that adjust for the possible dependent censoring, the impact of dependent censoring must be very mild. If the censoring model or event model was misspecified, the results from the other three estimators might be more different. Therefore, the results from the three estimators may indicate that the postulated models were nearly correctly specified. The estimated confidence interval of the parametric gformula estimator was narrower than the other estimators. The estimated confidence interval for the risk of diet + pravastatin group of our proposed estimator was narrower than the IPCW Kaplan–Meier estimator.
Discussion
In this paper, we proposed a doubly robust estimator of risk that adjusts for dependent censoring due to timevarying covariates in RCT settings. The novelty of our proposed estimator is as an extension of the existing estimator [9, 19] for more complex data with t > 1 and timevarying covariates. The IPCW Kaplan–Meier estimator is routinely used in the analysis of RCTs for the purpose of adjusting for dependent censoring with timevarying covariates measured throughout the followup period. We have also provided SAS codes (Additional file 1, Appendix B) and an example of simulation data (Additional file 2). The important property of our proposed estimator is the double protection against model misspecification. Because risk factors for the endpoints are often identified before the beginning of the RCT, by measuring them longitudinally at as many time points as possible and by using them when constructing the models, we are in a better position to approximate the true regression function.
The second property of our proposed estimator is the efficiency recovery over the IPCW Kaplan–Meier estimator, as shown in the simulation study. The degree of efficiency recovery could depend on either the censoring probability, event probability, the dependency of variables, or all of these factors combined. As studied previously [8], we considered that the censoring probability is an important factor. Further studies will involve understanding the factors that affect the degree of efficiency recovery using further simulations. In the simulation study and analysis of the MEGA study, the parametric gformula estimator outperformed regarding efficiency. This phenomenon was expected because the asymptotic variance of the classical doubly robust estimator is no smaller than that of the gformula estimator [26].
Our estimator relies on the assumption that censoring and event time are independent conditional on observed covariates including timevarying ones. However, in a situation that censoring and event time are not independent even if we condition on timevarying covariates, our proposed estimator and other existing estimators cannot correct for selection bias. We also need the assumption of correct model specification. We need to incorporate the covariates that affect both event and censoring probabilities, and moreover, we need to specify the model form that approximates the true regression function.
In this study, we considered the estimation of a survival function in a specific group. If we compare two or more survival functions that may be observed with different interventions, we also need an additional exchangeability assumption (or the nounmeasured confounders assumption) between the intervention groups [27]. In the simulations and data analysis, the exchangeability assumption is satisfied at baseline owing to the randomized design. In a future study, it will be interesting to extend our estimator into the observational study setting [24, 25].
All the estimators in this study can be applied to rightcensored data. We consider that our proposed estimator cannot be applied to the data with interval or leftcensored data in its current form. With those censoring, we know that an event has occurred only before a specific time. In this situation, how to predict event probability and how to weight uncensored subjects are not obvious. Note that the MEGA study corrects exact event time, so we consider that interval censoring or left censoring is absent in the real data.
There are several reasons for censoring in the MEGA study, as shown in Table 1. We treated refusal of followup, death by causes other than CHD, and loss to followup as reasons for censoring in the censoring model. Three estimators, including our proposed estimator, assessed the hypothetical survival function when there was no censoring. It may be meaningful to consider whether a survival function can be obtained if refusal to followup and loss to followup did not occur. When we separately accounted for the two reasons for dropouts, the survival curve was similar to the one using the Kaplan–Meier method [28]. However, death by causes other than CHD needs additional consideration, because it is difficult to cease such competing risks for CHD without lowering the risk of CHD. Therefore, if there was no death by causes other than CHD, the survival function would be slightly lower than we estimated. Because in the MEGA study the proportion of censoring due to death by causes other than CHD was less than 1.5%, we believe the estimated survival functions are close to the true survival function, which would be obtained if these censorings had not occurred.
There are two limitations in this study. First, we were not able to verify the assumptions with the measured data. The positivity assumption will be satisfied unless the conditional probabilities of censoring are zero for all patients at t = 1,…, T. In the analysis of the MEGA study data, there were no patients who had an estimated probability of censoring near 1 (data not shown); therefore, we considered that the positivity assumption is acceptable. Conditional independence assumption implies that the treatment group, measured baseline, and timevarying covariates can completely explain censoring. However, given a rich collection of measured prognostic factors, the conditional independence assumption can be approximated. Several clinically important prognostic factors were measured in the MEGA study, and we used all of the baseline covariates and a timevarying covariate, TC. We considered timevarying TC was important for event and censoring probability, but the hazard ratio was close to 1; therefore, the impact of dependent censoring was very mild. In the future, we need to apply our estimator to data with censoring dependent on timevarying factors. The second limitation was the range of the simulation study. Because we were interested in the statistical properties of the estimators with fitted correct/incorrect models, the behavior of the estimators when other assumptions, such as positivity, were violated is unknown. We need further simulation studies to understand the performance of the estimators.
Conclusions
The proposed estimator is useful for the estimation of risk if censoring affected by timevarying risk factors occurred because of the doubly robust property and statistical efficiency over the IPCW Kaplan–Meier method.
Abbreviations
 CHD:

Coronary heart disease
 IPCW:

Inverse probabilityofcensoring weighted
 MEGA:

Management of Elevated Cholesterol in the Primary Prevention Group of Adult Japanese
 RCT:

Randomized controlled trial;
 TC:

Total cholesterol
References
 1.
Cole SR, Hernán MA. Adjusted survival curves with inverse probability weights. Comput Methods Prog Biomed. 2004;75:45–9.
 2.
Xie J, Liu C. Adjusted KaplanMeier estimator and logrank test with inverse probability of treatment weighting for survival data. Stat Med. 2005;24:3089–110.
 3.
Shinozaki T, Matsuyama Y. Doubly robust estimation of standardized risk difference and ratio in the exposed population. Epidemiology. 2015;26:873–7.
 4.
Komukai S, Hattori S. Doubly robust estimator for net survival rate in analyses of cancer registry data. Biometrics. 2017;73:124–33.
 5.
Nakamura H, Arakawa K, Itakura H, Kitabatake A, Goto Y, Toyota T, et al. Primary prevention of cardiovascular disease with pravastatin in Japan (MEGA study): a prospective randomised controlled trial. Lancet. 2006;368:1155–63.
 6.
Kleinbaum DG, Klein M. Survival analysis—a self learning text. 2nd ed. New York: Springer; 2005.
 7.
Robins JM, Finkelstein DM. Correcting for noncompliance and dependent censoring in an AIDS clinical trial with inverse probability of censoring weighted (IPCW) logrank tests. Biometrics. 2000;56:779–88.
 8.
Stitelman OM, De Gruttola V, van der Laan MJ. A general implementation of TMLE for longitudinal data applied to causal inference in survival analysis. Int J Biostat. 2012;8:1–37.
 9.
Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61:962–72.
 10.
Schnitzer ME, van der Laan MJ, Moodie EEM, Platt RW. Effect of breastfeeding on gastrointestinal infection in infants : a targeted maximum likelihood. Ann Appl Stat. 2014;8:703–25.
 11.
Petersen M, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan MJ. Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. J Causal Inference. 2014;2:147–85.
 12.
Schnitzer ME, Lok JJ, Bosch RJ. Double robust and efficient estimation of a prognostic model for events in the presence of dependent censoring. Biostatistics. 2016;17:165–77.
 13.
Robins J. The control of confounding by intermediate variables. Stat Med. 1989;8:679–701.
 14.
Westreich D, Edwards JK, Cole SR, Platt RW, Mumford SL, Schisterman EF. Imputation approaches for potential outcomes in causal inference. Int J Epidemiol. 2015;44:1731–7.
 15.
Wang A, Nianogo RA, Arah OA. Gcomputation of average treatment effects on the treated and the untreated. BMC Med Res Methodol. 2017;17:3.
 16.
Taubman SL, Robins JM, Mittleman MA, Hernán MA. Intervening on risk factors for coronary heart disease : an application of the parametric gformula. Int J Epidemiol. 2009;38:1599–611.
 17.
Lunceford JK, Davidian M. Stratifcation and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med. 2004;23:2937–60.
 18.
Tsiatis AA. Semiparametric theory and missing data. New York: Springer; 2006.
 19.
Kang JDY, Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci. 2007;22:523–39.
 20.
Satten GA, Datta S. The KaplanMeier estimator as an inverseprobabilityofcensoring weighted average. Am Stat. 2001;55:207–10.
 21.
Westreich D, Cole SR, Young JG, Palella F, Tien PC, Kingsley L, et al. The parametric gformula to estimate the effect of highly active antiretroviral therapy on incident AIDS or death. Stat Med. 2012;31:2000–9.
 22.
Young JG, Cain LE, Robins JM, O’Reilly EJ, Hernán MA. Comparative effectiveness of dynamic treatment regimes: an application of the parametric gformula. Stat Biosci. 2011;3:119–43.
 23.
Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: Chapman and Hall; 1993.
 24.
Zhang M, Schaubel DE. Contrasting treatmentspecific survival using doublerobust estimators. Stat Med. 2012;31:4255–68.
 25.
Bai X, Tsiatis AA, O’Brien SM. Doublyrobust estimators of treatmentspecific survival distributions in observational studies with stratified sampling. Biometrics. 2013;69:830–9.
 26.
Tan Z. Comment: understanding OR. PS DR Stat Sci. 2007;22:560–8.
 27.
Hernán MA, Robins JM. Causal inference: what if. Chapman & Hall/CRC: Boca Raton; 2020.
 28.
Yoshida M, Matsuyama Y, Ohashi Y, For the MEGA study group. Estimation of treatment effect adjusting for dependent censoring using the IPCW method: an application to a large primary prevention study for coronary events (MEGA study). Clin Trials. 2007;4:318–28.
Acknowledgments
This work was presented at the Joint Statistical Meeting (JSM) 2015 (https://ww2.amstat.org/meetings/jsm/2015/onlineprogram/AbstractDetails.cfm?abstractid=315584). The authors thank Daiichi Sankyo Co. Ltd. for providing invaluable MEGA study data and Dr. Koji Oba for reading an earlier draft of this article. Finally, we are grateful to the MEGA study group.
Funding
This work was supported by the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number JP18K17314 in writing the manuscript.
Author information
Affiliations
Contributions
TK, TS, and YM designed the concept of this research. TK conducted the simulation study and analyzed the MEGA study data. TK and TS drafted the manuscript. YM supervised this study and critically reviewed the manuscript. All the authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable, because this paper focuses on the development of statistical methods.
Consent for publication
Not applicable.
Competing interests
None declared.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Kawahara, T., Shinozaki, T. & Matsuyama, Y. Doubly robust estimator of risk in the presence of censoring dependent on timevarying covariates: application to a primary prevention trial for coronary events with pravastatin. BMC Med Res Methodol 20, 204 (2020). https://doi.org/10.1186/s12874020010878
Received:
Accepted:
Published:
Keywords
 Double robustness
 Dependent censoring
 Prediction
 Timevarying covariate