Comparing survival functions with interval-censored data in the presence of an intermediate clinical event

Background In the presence of an intermediate clinical event, the analysis of time-to-event survival data by conventional approaches, such as the log-rank test, can result in biased results due to the length-biased characteristics. Methods In the present study, we extend the studies of Finkelstein and Nam & Zelen to propose new methods for handling interval-censored data with an intermediate clinical event using multiple imputation. The proposed methods consider two types of weights in multiple imputation: 1) uniform weight and 2) the weighted weight methods. Results Extensive simulation studies were performed to compare the proposed tests with existing methods regarding type I error and power. Our simulation results demonstrate that for all scenarios, our proposed methods exhibit a superior performance compared with the stratified log-rank and the log-rank tests. Data from a randomized clinical study to test the efficacy of sorafenib/sunitinib vs. sunitinib/sorafenib to treat metastatic renal cell carcinoma were analyzed under the proposed methods to illustrate their performance on real data. Conclusions In the absence of intensive iterations, our proposed methods show a superior performance compared with the stratified log-rank and the log-rank test regarding type I error and power.


Background
In clinical trials and longitudinal studies, a subject under study may experience an intermediate clinical event (IE) before the event of interest. The occurrence of the IE may induce changes in the survival distribution. An example of a length-biased problem due to the IE is the heart transplantation study [1]. It is necessary to know whether a heart transplant would be beneficial. The waiting time of subjects who eventually have a heart transplant must be long enough to receive treatment, whereas there is no requirement for not having a heart transplant.
To resolve length-biased problems due to the IE, the time-dependent Cox regression and landmark studies were conducted [1,2]. The score tests based on counterfactual variables were derived by Lefkopoulou and Zelen [3] and Nam and Zelen [4]. Moreover, when the primary previous study but used different covariance matrix estimator [11]. Kim et al. [12] studied another log-rank type test that did not use an iterative algorithm. A uniform weights algorithm was proposed where a subject contributed uniformly to each mass point s k ; point of the set, which consisted of all the distinct endpoints of the observed intervals.
A few methods have been suggested for left truncated and interval-censored (LTIC) data. Turnbull's characterization was corrected to accommodate both truncation and interval-censoring time points [13]. It was extended to the regression model under the proportional assumption [14]. Pan and Chappell noted that NPMLE is inconsistent for the early times with LTIC data, while conditional NPMLE is consistent [15]. The estimation of the parameters in the Cox model with LTIC data and a rank-based test of survival function in LTIC were studied [16,17]. However, the length-biased problem was not considered in those methods.
Most existing methods for interval-censored data use intensively iterative computation. To avoid this, an imputation method was considered in this study. We can obtain complete or (left-truncated and) right-censored data after imputation of the (left-truncated and) interval-censored data. Subsequently, standard statistical methods can be applied to the imputed data. For right-censored data, a semiparametric algorithm was proposed [18], motivated by the data augmentation algorithm [19]. Pan proposed a MI using Cox regression for interval-censored data by adapting previous method [20]. They repeated the algorithm until the coefficient β h converged, where h denotes the number of iterations. A two-sample test with interval-censored data was studied via MI based on the approximate Bayesian bootstrap [21]. The MI for intervalcensored data with auxiliary variables was studied [22]. Zhao and Sun [10] and Kim et al. [12] used MI techniques for computing the variance of test statistics. A log-rank test via MI was proposed [11]. After estimating the NPMLE using Turnbull's algorithm, they imputed the exact time for all data points including right-censored data from the conditional probability of NPMLE. The methods of MI using Cox regression were extended to accommodate left-truncation [23,24].
The purpose of this paper is to suggest new methods for analyzing LTIC data using MI.
This study is organized as follows. First, we introduce the notations and framework for interval-censored survival data. In the theoretical model and study hypotheses section, we explain a statistical procedure to compare two survival functions in the presence of the IE. Then, we propose our method with extensive simulation studies. The simulations are conducted to evaluate the properties of multiple imputation. An analysis of the Randomized Phase III SWITCH study was undertaken in the real example section, and we conclude the study with a short discussion.

Notation and framework
The survival time of a subject who experienced the IE implied that the survival time should exceed the waiting time for the IE. This reflects the length bias phenomenon; namely, a subject has to live long enough to experience the IE. We assume that the IE is binary and that only two treatment groups exist. Let W and T be positive real-valued random variables representing the waiting time until the occurrence of the IE and the time to an event of interest, respectively. We assume the independent of the event time T and waiting time W. Define a binary random variable Z to be Z = I{W ≤ T}. The random variables T 0 and T 1 are defined as the times to the event of interest conditional on Z = 0 and 1, respectively, namely, The density probability functions of W, T 0 , and T 1 are defined as g(w), q 0 (t), and q 1 (t), respectively; moreover, the corresponding survival distribution , respectively. The model with Z = 1 implied that the waiting time was observed before the failure time T. Therefore, T 1 was left truncated at the waiting time W. {B i , 1 ≤ i ≤ N} were considered as the truncation sets, specifically, B i = (W i , ∞), where N is number of total subjects. We further assume that the time to the event of interest T is interval-censored. Therefore, for the ith subject, we did not observe T exactly but observed T ∈ A i , where A i = (L i , R i ] is the interval in which the event of interest occured. If R i = ∞, we call it a right-censored observation. If L i = R i , we call it an exact observation. Let δ i = 1, if the ith subject has experienced the event of interest; otherwise, it was considered 0. We consider the set of N independent pairs {A i , B i }. We assume A i ⊆ B i . We now characterize the following union setC k with all observed points including left-truncated points, which may have a positive mass as mentioned by Frydman [13], where k = 0, 1. For the survival distribution of T 0 , L i and R i of a subject who does not experience the IE is included in the setC 0 . When the IE occurs (Z = 1), the waiting time W is a change point of distribution for survival. Thus, the information of the event exceeding W can no longer be observed. Therefore, the waiting time W for the IE is included inC 0 for T 0 as the right-censoring time, but the event time exceeding W is not included in setC 0 .
For the survival distribution of T 1 , L i and R i of a subject who experienced the IE and the waiting time W as a lefttruncated time are included in the setC 1 . The subject who does not experience the IE is not included in setC 1 .

Theoretical model and study hypotheses
Nam and Zelen [4] studied a length-biased problem with right-censored data in the presence of the IE. A subject who does not experience the IE means that the waiting time W for the IE has been right-censored; namely, A subject experiences the IE at W, the survival distribution is changed at w and the event occurs at t; namely, versus the general alternative, which is the complement of H 0 , could be considered, where A, B are two populations. Notably, the hypotheses were independent of the waiting time distribution.
They derived the score test using a proportional hazards model for comparing two sample survival functions. The score test could be written using the counting process , where x = 1 if the observations were from A; otherwise, it was 0. The statisticsŜ 1 can be written aŝ and under the null hypothesis has mean zero and variance The statisticsŜ 0 can be written aŝ Hence, an appropriate chi-square statistic with 2 degrees of freedom for testing H 0 is given by

Proposed methods
Multiple imputation converts interval-censored data to right-censored data so that standard methods can be applied. This method can simplify complicated situations. We propose two methods: 1) uniform weight method and 2) weighted weight method. The uniform method closely follows the method of Kim et al. [12] and the weighted method closely followed that of Huang et al. [11] to accommodate for left truncation. After imputation, the score statistics χ 2 2 were used [4].

Uniform weight method
Kim et al. [12] assumed that the true failure time of a subject may be uniformly distributed over {s j , L i < s j ≤ R i , for j = 1, ..., m}. They calculated a pseudo-risk and failure set based on uniform weights. They used the MI techniques to estimate the variance matrix. In this study, we used the MI techniques for deriving the test statistics and their variance-covariance matrix including the imputation of a true failure time under the same assumption. We used a moderate imputation number (M = 10) [20].
Step 0: Set r = 1, where r denotes an imputation number.
Step 1. Characterize the setC k i for each of T k for k = 0, 1. The distinct endpoints set C k Step 2: If the ith observation is interval-censored, a value randomly sampled from a set C k i is generated. Notably, after imputing the exact time, T (r) 0 is the right-censored data, while T 1 , we only used the data with Z i = 1.
Step 5: Compute the sum of the average withinimputation covariance associated with S k and the between-imputation variance of S k .
In the present study, we applied two types of variances. The first is as described above: adding withinand between variances. The second is the subtraction of the two variances, which works well when the rate of follow-up loss is high [11]. The second term is formed as Thus, we can test H 0 based on where the distribution follows a chi-square with 2 degrees of freedom.

Weighted weight method based on NPMLE
We propose another weighted weight method based on NPMLE. We estimated the NPMLE from the original data set by Turnbull's algorithm and used the NPMLE as weights for the imputation. The data were LTIC when having the IE; therefore, we characterized the set that may have a positive mass including truncated points, same as the above method.
Step 1. Estimate the NPMLE from the original data set.
Step 2. Using the NPMLE as weight, impute the data conditional on sample from the distribution NPMLE using the NPMLE as weight Steps 3-5. Same as the part of the uniform weight method. Based on the rth imputed (left-truncated) right-censored data, we can calculate the average Nam and Zelen statistics and variance using the weighted weight method.

Data generation
We generated the true failure time T 0 and waiting time W from the survival distribution below: Note that the probability of experiencing the IE is δ that takes values 0 or 1 and follows a Bernoulli distribution with a censoring probability c p . c p is set as 0 or 0.3. We could obtain the data set as To generate interval-censored data, we first generated (T i , δ i ) as above, where T i and δ i are independent. We assumed that each subject was scheduled to be examined at p different visits. The first scheduled visit time E is generated from U(0, ψ). For a subject having the IE, the first scheduled visit time E is equal to or greater than the waiting time W (E ∼ U(W , W + ψ)). The length of the time interval between two follow-up visits was assumed as a constant, ψ = 0.5. The survival time T i is observed in one of intervals (0, Let E k denote the kth scheduled visit. At each of these time points, it was assumed that a subject could miss the scheduled visit. In such cases, L i is defined as the largest follow-up visit E k among scheduled visit points less the T i . Also, R i is defined as the smallest follow-up visit E i among scheduled visit points greater than T i . If δ i = 0, the observation on T i is right-censored. If δ i = 1, the observation on T i is observed on (L i , R i ]. For right-censored data (δ i = 0), we set L i as it is, but R i is set to infinity.
In the present study, we did not restrict the number of follow-up visits because a subject having the IE should survive during the waiting time and have more chance to follow up for longer. We assumed that every subject visits at the first visit time point, E. After that, there is a probability that a subject might not comply with the follow-up visits. We assume that a subject might miss any of the follow-up visits and is more likely to miss later visits (such as 0.1 for the first year, and 0.2 thereafter, using the Bernoulli distribution).
For comparison, we included the log-rank test and the stratified log-rank test (the stratum is experiencing the IE or not) along with our proposed tests. For the log-rank and stratified log-rank test, the true failure times were used rather than the interval-censored ones. We used two variance forms, which were formed by (1) adding and (2) subtracting within and between variance. The sample sizes were selected as 50, 100 and 200 for each group. The results reported are based on 1000 replications for each scenario.

Simulation results
The results of the simulations are summarized from Tables 1, 2 and 3. Tables 1 and 2 show the estimate of the upper 5% of each of the five tests under the null hypothesis, whereas Table 3 shows the power under the alternative hypothesis for each scenario. The proposed methods show the appropriate 5% significant level under all scenarios. For the variance with adding form (1), the methods marginally overestimate the variance; thus, the effect sizes are less than 0.05 for most of scenarios. For the variance with subtracting form (2), the methods slightly underestimate the variance. The stratified log-rank test was unsatisfactory if the proportion of experiencing the IE was different between the two groups (such as θ A is not equal to θ B .). The log-rank test satisfied the nominal significance level if the survival functions were not changed after experiencing the IE regardless of the proportion. The change in survival distribution after experiencing the IE (such as, m 0A was not equal to m 1A .) in addition to the difference in the proportion of the IE, which caused the log-rank test to be inappropriate. The comparison of uniform and weighted weights multiple imputation methods did not show significant differences.
When θ A = θ B = 0.5, the simulation results confirmed that all tests gave the correct 5% significance level. Hence, the power calculations were restricted to this case. The value of the other parameters was m 0A = m 0B = 1, m 1A = 2. Only the mean time to failure was changed for m 2B . The increase in sample size or a decrease in the value of the censoring fraction c p caused increase in the difference of mean time to failure, thus indicating that the power of the tests could be improved. In all cases, the proposed methods have superior power by taking advantage of the knowledge of the IE.

Real data example
In this section, we illustrate the proposed method using real data from a randomized clinical trial evaluating the efficacy of tyrosine kinase inhibitors sorafenib and sunitinib in the treatment of patients with metastatic renal cell carcinoma. The primary endpoint was total progression-free survival (PFS), which was defined as the interval between the randomization (the start date of first-line therapy) to disease progression or death during second-line therapy. For subjects who did not switch to per-protocol second-line therapy, the first-line events were used. Subjects without tumor progression or death during second-line therapy were censored. The details of the study have been published [25].
We chose this study to illustrate our methods because it presented interesting aspects of IE. The proportion that was administered a second-line therapy was higher Table 2 Empirical 5%-level tests by varying θ B , m 1A , and m 1B with θ A = 0.5 when censoring fraction is 0.3, and there are some missed visits with a probability of 0.1 for the first year and then of 0.2 thereafter in sorafenib-sunitinib (So-Su) compared with sunitinibsorafenib (Su-So) (57% vs 42%, P value <0.01). The total PFS and PFS of first-line treatment did not show a significant difference (So-Su vs. Su-So: 12.5 mo vs. 14.9 mo (P value = 0.5), 5.9 mo vs. 8.5 mo (P value = 0.9), respectively), whereas the PFS of second-line therapy showed a shorter duration in Su-So (5.4 mo vs. 2.8 mo, P value <0.001). Receiving the second-line therapy might be considered as experiencing the IE to compare the difference in survival functions by utilizing the knowledge of the proportion of having second-line therapy and the duration of first-and second-line therapy with different hazards assumption.
Since it is difficult to obtain raw data in this study, we extracted numerical data from the Kaplan-Meier (KM) graph on the total, first-line, and second-line PFS [25] by using WebPlotDigitizer v.3.9 (http://arohatgi.info/ WebPlotDigitizer/). With the obtained proportion and numbers at risk tables, we can obtain the observed data as [26]. Similar KM graphs were obtained with the regenerated data. The interval of radiological assessment follow-up was 12 weeks. As in simulations, we assumed several scheduled visits and loss rates of radiological assessment to make interval-censored data of (L i , R i ]. The proposed methods show a significant difference between the two arms (P value <0.01) unlike the log rank test and the stratified log rank test (P value >0.5). We also applied the methods based on the Cox model and obtained similar results [23,24].
The hypothesis on (β 0 , β 1 ) is separable as noted [4]. Therefore, we can test differences in the distributions for each parameter, namely, H 0 : β 1 = 0 versus H 1 : β 1 = 0. One degree of freedom is used in a chi-square test χ 2 1 = S 2 1 /V Ŝ 1 of this hypothesis. In this case, we do not reject the null hypothesis of β 0 = 0 (P value = 0.6) but reject the null hypothesis of β 1 = 0 (P value <0.001), which is similar to a previous study [25].

Discussion
We propose a general method of comparing two interval-censored samples in the presence of the IE. The occurrence of IE occurs may change the survival distribution. The focus of the current study is to compare two survival functions incorporating the information of the IE.
In the present study, we propose non-iterative multiple imputation methods for the analysis of left-truncated and interval-censored survival data. In the uniform weight method, the true failure time of a subject is assumed uniformly distributed over {s j , L i < s j ≤ R i , for j = 1, ..., m} [12]. We used an MI technique for the derivation of test statistics and its variance-covariance matrix including imputing a true failure time, while Kim et al. used a MI technique to estimate variance matrix. Uniform weight assumption in the characterized set is convenient to implement in practice. We also propose a weighted weight method based on NPMLE. After characterizing the set that may have a positive mass including truncated points [13], Turnbull's algorithm was used to estimate the NPMLE. The performance of imputation procedures highly depends on the performance of the NPMLE. In the case of left-truncated and interval-censored data, NPMLE is not consistent, whereas conditional NPMLE is still consistent [15]. However, the problem is limited to the early time point. In the present study, we did not use any special correction because our purpose was not to obtain the exact NPMLE. The simulation did not show considerable differences compared with the uniform weight methods.
We applied the methods based on the Cox model to the real example, and the results were similar to the proposed methods [23,24]. We applied two forms of variance that were formed by addition and subtraction. Both variance methods were efficient, but the first one was marginally overestimated, and the second one was slightly underestimated. This phenomenon is the same as described by Huang et al. [11] since the follow-up loss rate in each visit was not high.
We assumed that the IE was exactly as observed. Further studies are needed if the IE is considered as intervalcensored.

Conclusions
To avoid the length-biased problem, we recommend incorporating the information of the IE in the analysis. In the absence of intensive iterations, our proposed method exhibits a superior performance compared with the stratified log-rank and the log-rank test regarding the type I error and power.