The use of restricted mean time lost under competing risks data

Background Under competing risks, the commonly used sub-distribution hazard ratio (SHR) is not easy to interpret clinically and is valid only under the proportional sub-distribution hazard (SDH) assumption. This paper introduces an alternative statistical measure: the restricted mean time lost (RMTL). Methods First, the definition and estimation methods of the measures are introduced. Second, based on the differences in RMTLs, a basic difference test (Diff) and a supremum difference test (sDiff) are constructed. Then, the corresponding sample size estimation method is proposed. The statistical properties of the methods and the estimated sample size are evaluated using Monte Carlo simulations, and these methods are also applied to two real examples. Results The simulation results show that sDiff performs well and has relatively high test efficiency in most situations. Regarding sample size calculation, sDiff exhibits good performance in various situations. The methods are illustrated using two examples. Conclusions RMTL can meaningfully summarize treatment effects for clinical decision making, which can then be reported with the SDH ratio for competing risks data. The proposed sDiff test and the two calculated sample size formulas have wide applicability and can be considered in real data analysis and trial design.


Background
Competing risks arise frequently in many applications in medical studies. In a competing risks setting, patients may fail due to multiple causes. The most commonly researched endpoint is recorded as the event of interest; other endpoints, whose occurrence may preclude the occurrence of the event of interest, are recorded as competing events [1]. When competing risks exist, the Kaplan-Meier estimation tends to overestimate the cumulative incidence function, which may cause large errors and lead to incorrect conclusions [2,3]. The commonly used measures in the present competing risks data analysis are the cumulative incidence function (CIF), sub-distribution hazard (SDH), and cause-specific hazard (CSH) [4,5]. CIF curves are used to describe or explore patients' trend of survival in cases of competing risks, and the measures of treatment effect corresponding to the SDH and CSH are the sub-distribution hazard ratio (SHR) and cause-specific hazard ratio (CHR), respectively. Lau et al. [3] pointed out that the CHR regards competing events as right censored and is more suitable for epidemiologic studies, while the SHR is good at estimating risk factors and treatment effects, which makes it more applicable in clinical studies. Thus, the SHR is given as the commonly used descriptive index for the comparison of CIFs between groups. However, the SHR also has limitations in some applications: i) the most commonly used method, the Gray test [6], needs to satisfy the proportional SDH assumption [7]; ii) normally, the descriptive statistic used for competing risks data is the CIF curve; however, the statistical inference is the Gray test based on the SDH; thus, the statistical description and statistical inference do not match exactly; iii) when the SHR is used to summarize the treatment effect, the test framework contains only an SHR of the treatment group vs. the control group instead of the SDH for each group. Without baseline (control group) information, the SHR may be a relatively abstract concept for patients [8,9]; iv) the estimation of the SDH is based on conditional probability, so that the SHR does not reflect the risk ratio of two groups, which complicates the interpretation of survival outcomes [10].
Based on the above limitations of the SHR, the median time can reflect the effect of survival, but only on a single time point, which is not a meaningful way to summarize the effect on patients over a time period. Calkins et al. [11] referred to the concept of the restricted mean survival time under competing risks (RMSTc), which is based on the restricted mean survival time (RMST) [12,13] spent in a state free of composite events. However, the simple use of composite endpoints may not have clinical meaning [14]. In addition, the RMSTc causes the loss of accuracy regarding the event of interest, and the result can be simplified to the RMST based on the single endpoint by taking all events as one composite event.
Anderson [15] defined the number of life years lost under competing risks settings and proposed a regression model based on pseudovalue observations. Zhao et al. [16] introduced the restricted mean time lost (RMTL), which corresponds to the area under the CIF curve of the event of interest and represents the average of the lost time for the event of interest within a specific restricted period of time.
This paper develops statistical methods based on the RMTL that can avoid the limitations of the SHR and RMSTc. The paper is organized as follows. Section 2 presents the definition and estimation of measures based on the hazard, the RMTL, and the corresponding hypothesis tests and sample size formulas. In Section 3, we conduct simulation studies to assess the impact of the proposed tests. Two examples are used to illustrate the proposed methods in Section 4. Section 5 provides a discussion of our research.

Methods
Assume a randomized study with n patients in two groups (k = 1,2). The time-to-event and censoring times are denoted by T = t i , i = 1, 2, …, n and C, respectively. For simplicity, we assume that C is independent of T. τ is the truncation time point, also called the cut-off time point. Without loss of generality, two endpoints are assumed in this research: one event of interest (j = 1) and one competing event (j = 2). Let I 1 (t) and I 2 (t) be the CIFs for the event of interest and competing event, respectively. Based on the nonparametric maximum likelihood estimation of the CIF, the estimate of CIF I j (t) iŝ where d ij is the number of events of type j that occur at time t i , the number of individuals at risk at t i is denoted by n i , andŜðtÞ is the Kaplan-Meier estimate when all events (both j = 1 and j = 2) are considered.

Descriptive analysis
Existing measure based on the CSH and SDH Both the CSH and the SDH are hazard-related measures. Under competing risks, the CSH of an event of interest is defined as which indicates the patients' hazard of the event of interest at t without any prior event. The corresponding descriptive measure CHR is the ratio of the CSHs of two groups.
The SDH of an event of interest is given by which describes the patients' hazard of the event of interest at t only without previously having experienced the event of interest. The corresponding descriptive measure SHR is the ratio of the SDHs of two groups. In formulas (1) and (2), the main difference between the CSH and SDH is the number of individuals at risk. For the CSH, the number at risk at t includes only patients who do not experience any type of event, while the number at risk of the SDH includes patients who do not experience the event of interest while still including patients who have experienced competing events.

Alternative measure based on the RMTL
The RMTL of the event of interest is defined as RMTL ¼ R τ 0 I 1 ðtÞdt [15,16]. Then, based on the CIF estimation of the event of interest,Î 1 ðtÞ, the estimate of the RMTL is given by and the estimated variance in d RMTL is where Eð d RMTL 2 Þ is estimated by and f 1 (t) is the density function of I 1 (t).
The descriptive measure of the RMTL is the RMTL difference between two groups. From formula (3), the effect size of the difference in the RMTLs (RMTLd) is related to the difference between the two areas under the CIF curves.

Hypothesis Test Procedures Existing test procedures based on the CSH and SDH
The log-rank test can be directly used as the test corresponding to the CSH [17].
The most commonly used test for the SDH is the Gray test (Gray) [6], the test statistic of which is defined as where λ ðkÞ SDH ðtÞ is the estimate of the SDH for group k. The weight function is defined as ϖ k ðtÞ ¼ n k ðtÞ 1−Î k ðt−Þ S k ðt−Þ , n k (t) is the number of individuals at risk at time t in group k,Î k ðt−Þ is the left-hand limit of the CIF for the event of interest in group k, andŜ k ðt−Þ is the left-hand limit of the probability of being free of any event in group k, as estimated by the Kaplan-Meier method.

New tests based on the RMTLd
Basic difference test Assuming that Δ is the RMTLd between two groups, then the estimates of Δ,Δ, areΔ ¼ R τ 0 ½Î 12 ðtÞ−Î 11 ðtÞdt , whereÎ 1k ðtÞ is the CIF estimate for the event of interest in group k. Thus, we present a basic test procedure based on the RMTLd. Under the null hypothesis H 0 : Δ = 0, the basic difference test (Diff) statistic is given by Z ¼Δ ffiffiffiffiffiffiffiffiffiffiffiffi varðΔÞ q $ Nð0; 1Þ; and the estimate variance varðΔÞ is derived by the delta method; that is, according to formula (3), and n k is the sample size in the k th group.
Supremum difference test We refer to the supremum difference test (sDiff) statistics [18] based on the RMTLd. The test statistic is given by Q S ¼ supfjΔðt r Þj; t r ≤ τg=σðτÞ, whereΔðt r Þ is calculated bŷ The standard error ofΔðt r Þ is solved by based on Aalen's variance [19] of the CIF estimator, where ρ is the correlation coefficient betweenÎ 12 ðt i Þ−Î 11 ðt i Þ, andÎ 12 ðt i 0 Þ−Î 11 ðt i 0 Þ, where i ≠ i'. ρ is difficult to estimate because it involves the assumption of an unknown underlying CIF distribution of the actual data. Lyu et al. [20] found that when ρ = 0.50, the test statistic does not inflate the type I error rate and maintains high power. Hence, we fixed ρ at an acceptable value of 0.5 in this article. Under the null hypothesis, the distribution of Q S can be approximated by the distribution of sup{|M(x)|, 0 ≤ x ≤ 1}, where M is a standard Brownian motion process. According to Billingsley [21], the probability distribution of sup|M(t)| is given by Assuming that formula (7) converges when a → m [22], then m is solved as where ⌈·⌉ refers to the minimum positive integer of this value, and ε is the permissible error for estimating P[sup|M(t)| > x]..

The sample size formula based on the RMTLd
Under competing risks, the use of Gray is limited by the proportional SDH assumption. Hence, the corresponding sample size formula is not always available. In addition, the estimated Gray sample size (based on the Gray) depends on the incidence of the event of interest; that is, a large deviation between the actual incidence and the assumed incidence results in a broad range of estimated sample sizes. This paper does not discuss the sample size formula for the RMSTc (which can be estimated by the method based on the single endpoint), as our focus is on the event of interest. The sample size formulas based on Diff and sDiff are proposed in the following section.

Method based on the basic difference test
According to Diff, as shown in Section 2.2.2, the following hypotheses are considered: Then, under the alternative hypothesis with a desired power of 1 − β, the sample size can be obtained by n Þ , where n 2 n 1 ¼ r . Thus, the total required sample size is

Method based on the Supremum difference test
In the sample size calculation of the supremum test, the main purpose is to obtain ξ in function n ¼ ξ Áñ, wherẽ n is the calculated sample size based on Diff.
As with H 0 : Δ = 0; H 1 : Δ ≠ 0, we assume that Δ = η ≠ 0 under the alternative hypothesis. Then, we write the expression where M(⋅) is a standard Brownian motion process and u(t) is a time function. Then, U(t) is a standard Brownian motion process that deviates with a mean of η. Here, with a fixed effect size λ, where R = nR(τ), and R(t) is the probability that the event of inter-est happened before t. Then, the relation of R and η is given by Assume that V 1 − α/2 is the critical value of the supremum value of the standard Brownian motion process, i.e., P sup By the symmetry of Brownian motion, both probabilities, Pf sup  [23] obtained the following function after integration: Thus, the sample size needed to achieve a desired power of 1 − β with a two-sided type I error of α can be obtained by Under the alternative hypothesis, we obtain the limiting distribution of U n (τ) as M η (1), which is a normal deviation with mean η and variance 1. With a critical value Z 1 − α/2 , we solve forη in the following expression: Because formulas (8) and (11) have the same effect size λ, the denominators cancel, and the ratio becomes where only η remains to be solved. First, we need to estimate the critical value V in formula (9). From the cumulative probability distribution [21].

Hypothesis test procedures Simulation design
To compare the performance of the above tests, Monte Carlo simulations were carried out to study the type I error and the statistical power under a variety of situations. The following procedures were performed to test the hypotheses: Gray, Diff, and sDiff. The performance of these tests was evaluated by using 6 alternative scenarios ( Fig. 1): (A) two groups with no difference (the comparison for type I error); (B) two groups with a proportional SDH difference; (C) two groups with a nonproportional SDH difference; (D) an early difference in the CIFs; (E) CIFs with a late difference; (F) two CIFs with a cross difference. Let τ 1 and τ 2 be the last event of interest time in the two groups. Here, we considered a commonly used option, the minimum of the last event in the time of interest in two groups (τ = min(τ 1 , τ 2 )), as τ. The event of interest and the competing event were generated from CIFs with piecewise Weibull distributions. The specific parameter settings are presented in Web Table A1. The distribution of events was based on the binomial distribution B (N, p 1 ), where N represents the sample size of each group and p 1 is the maximum cumulative incidence of events of interest, which is set as p 1 = I 1 (∞) = 0.7. The censored times C in the two groups were generated from uniform distributions. Then, each individual was assigned an observed time t = min(T, C) and the event indicator δ = 0[T > C]. By changing the distribution parameters of C, both groups were set to have the same censoring rates of approximately 0, 15, 30 and 45%. We also considered equal group sizes (n 1 = n 2 = 50, 100, 150) and unequal group sizes (n 1 = 50, n 2 = 100; n 1 = 50, n 2 = 150; n 1 = 50, n 2 = 200). All simulations were performed Fig. 1 Six scenarios in the simulation study -CIF curves for the event of interest using 5000 iterations. The nominal significance level of each method was fixed at the conventional level of 0.05.

Simulation results
As Table 1 shows, the type I error rates for Diff are stable under 0 censoring but gradually inflate with increasing censoring rates, which represent the most radical test. As the type I error rates of Diff are inflated, this test is not included in the comparison of test power. Compared to Diff, Gray is steadier. The type I error rates of sDiff are relatively conservative for light censoring but increase with increasing censoring rates.
The power results are shown in Table 1. When two CIF curves have a proportional SDH (Fig. 1b), the powers of all the tests increase with increasing sample size but decline with increasing censoring rates. Gray demonstrates the optimal power in this situation, followed by sDiff. For the non-proportional SDH difference (Fig.  1c), sDiff is the most powerful test, while Gray has the lowest power in this situation. For the early difference in the CIF curves (Fig. 1d), with increasing censoring rates, sDiff becomes much more powerful, and Gray exhibits the lowest power in this situation. When considering the late difference in the CIF curves (Fig. 1e), the powers of all tests decline with increasing censoring rates. In this situation, Gray is more powerful, followed by sDiff. In the case of a cross difference in the CIF curves (Fig. 1f), Gray has the lowest power. With increasing censoring rates, sDiff is much more powerful than Gray.
Note that in situation C (non-proportional SDH), situation D (early difference), and situation F (cross difference), all tests exhibit gradually increasing power with increasing censoring rates. This result occurs because the two CIF curves are not convergent in the later period but diverge with the increased censoring, which makes the increased difference between the two CIF curves proportional.
To summarize the simulation results, we applied the analysis of variance (ANOVA) technique [24] to evaluate the type I error and power. For type I error, the absolute small and close-to-zero estimates indicate that rates are close to 0.05. For power, good performance is indicated   Table 2 shows that sDiff corrects the inflated type I error of Diff when censoring occurs. In Table 3, sDiff is slightly lower than Gray when there is a proportional SDH (situation B), and when there is a late difference (situation E), the difference between Gray and sDiff is approximately only 2.242%, whereas the powers of sDiff are much higher than those of Gray in other situations. Considering all situations, combinations of sample sizes, and censoring rates, sDiff performs better in most situations.

Calculations of sample size Simulation design
A simulation study was also performed to evaluate the proposed sample size formula. Two scenarios were considered ( Fig. 1 b-c): (B) two groups with a proportional SDH difference and (C) two groups with a nonproportional SDH difference. Both scenarios were examined under four scenarios with either 0.05 or 0.01 for the two-sided type I error and with either 0.8 or 0.9 as the power. The follow-up time, τ, which is also the truncation time point, was set as the minimum of the last observed time of the pilot study for two groups. Assume two groups with an equal sample size, i.e., r = 1. Then, based on situation B and situation C, the necessary parameters were estimated by simulation, and finally, we obtained the calculated sample size with the given parameters. In addition, Monte Carlo simulations were used to examine the observed power. The simulations were performed using 1000 replications.

Simulation results
As shown in Table 4, the calculated sample size of all the tests increases with a decreasing type I error rate and with an increasing target power. When the CIFs satisfy the proportional SDH assumption (situation B, Fig. 1b), the calculated sample sizes of Gray, Diff and sDiff are close to each other, with Diff having the highest observed power. In this situation, Gray and sDiff have a similar observed power, which is close to the target power. When there is a non-proportional SDH (situation C, Fig. 1c), the calculated sample sizes of Gray are much higher than those of Diff and sDiff, and sDiff has a relatively high observed power. In addition, the observed powers of Gray do not reach the target power in this situation.
In addition, the comparison of power for Gray, Diff and sDiff with a fixed sample size (calculated by sDiff) is shown in Web Table A2. The results show that the power under situation B for Diff is larger than that for Gray and sDiff, but the three tests have similar power.   However, in situation C, the power of Gray is much lower than that of Diff and sDiff. As the type I error rates of Diff are inflated with censoring, the corresponding sample size formula is considered unstable. In general, when the CIFs satisfy the proportional SDH assumption, both Gray and sDiff can be considered; when the SDH is non-proportional, sDiff is considered more adaptive.

Example 1: Bone marrow transplantation data
The data used to evaluate the effect of T-cell depletion on bone marrow transplantation [25] came from 408 patients divided into a T-cell depleted group (Yes) with 354 cases and a T-cell not depleted group (No) with 54 cases. The censoring rates for the two groups were approximately 41% and 28%, respectively. The study included two types of events: death from treatment-related causes, which was defined as the event of interest, and relapse, which was set as a competing event. At the end of follow-up, 161 patients (146 from the Yes group and 15 from the No group) experienced an event of interest, and 87 patients (70 from the Yes group and 17 from the No group) experienced competing events. A test of the proportional SDH assumption yielded a result of P = 0.264.
The descriptive statistics and the hypothesis test results for the examples are shown in Table 5. In the hazard-related measures, the CHR and SHR showed that the ratios of the Yes group vs. the No group were 0.86 (0.59, 1.25) and 0.60 (0.36, 1.00), respectively. However, the log-rank test, which is based on the CHR, showed no significant differences (P = 0.053), while Gray based on the SHR indicated that there were significant differences between the two groups (P = 0.049). In addition, we could not obtain the estimated CSH or SDH for either group, which led to a lack of descriptive information for either group; only a CHR or an SHR could be obtained. This outcome led to difficulty in clinical interpretation.
For the composite endpoint, the RMSTc showed that the mean survival time of the patients in the Yes group was 1.83 (− 5.03, 8.69) months longer than that of the patients in the No group within the truncation time point of 41.8 months, and there were no significant differences (P = 0.601). Additionally, the RMSTc could not provide information regarding treatment-related death.
Let τ =41.8 months; Table 5 shows that the RMTL of treatment-related death in the Yes group was 9.57 (5.18, 13.96) months, which corresponds to the area under the CIF curve, i.e., S2 in Fig. 2a. In the No group, the RMTL corresponds to the area under the CIF curve, i.e., S1 + S2 = 15.49 (13.53, 17.45) months. Hence, the difference in RMTL between the two groups has an area of S1, which means that the patients in the Yes group took 5.92 (1.11, 10.72) months longer to succumb to treatment-related death. According to Table 5, the RMTL-based tests (Diff and sDiff) showed significance at the conventional level of 0.05.
As shown in Fig. 2b, a selection of different τ values led to a difference in the calculated sample size for Diff and sDiff: the calculated sample size increased with increasing τ and became steady after 20 months. The calculated sample sizes at τ= 41.8 months were 280 and 298 for Diff and sDiff, respectively, both of which were close to the sample size calculated by Gray (n = 300).

Example 2: Lymphocytic leukemia data
A previous study compared the effects of radiotherapy in the treatment of patients with lymphocytic leukemia (LL). A total of 1400 patients were randomly extracted from the Surveillance, Epidemiology, End Results (SEER) Program. Among these patients, two groups were included: the no radiotherapy group (No RT) consisted of 1318 cases, and the radiotherapy group consisted of 82 cases. The censoring rates in the two groups were Regarding the hazard-related indexes, Table 5 shows that the No RT group had a lower hazard ratio than the RT group (CHR = 1.14 (0.78,1.65); SHR = 1.45 (0.98, 2.14)). However, in this example, the SHR varied with time (P = 0.006) instead of being constant. Therefore, the CHR and SHR may not be available for this example.
When considering the composite endpoint, the RMSTc showed that the mean survival time of the RT patients was 0.66 (− 1.13,2.28) years longer than that of the No RT patients within the truncation time point of 15.3 years (Table 5), which reflected the overall survival but could not reflect the survival rates of patients who died of LL.
Let τ= 15.3 years; Table 5 shows that the RMTL of LLrelated death in the RT group was 4.68 (3.34, 6.03) years, which is equal to the area of S1 + S2 in Fig. 2c and corresponds to the area under the CIF curve. In the No RT group, the RMTL was S2 = 2.96 (2.69, 3.24) years. The difference in the RMTL between the two groups was S1 = 1.72 (0.35, 3.09) years, which is the delay time until the patients in the No RT group succumbed from LL-related death. As shown in Table 5, for all test procedures, only Diff and sDiff, which were based on the RMTL, showed significance at the conventional level of 0.05.
As Fig. 2d shows, with increasing τ, the calculated sample sizes showed a trend of decreasing first and then increasing, and they reached the smallest estimation of sample size at approximately 7 years, which was much less than the results found with Gray (n = 886). The calculated sample sizes at τ= 15.3 years were 344 and 364 for Diff and sDiff, respectively.

Discussion
When dealing with competing risks datasets, the SHR is often used as a typical descriptive method with the test procedures. However, because the SHR lacks baseline information (a control group) and does not directly reflect the risk ratio of the two groups, it may complicate the interpretation of the survival outcome and may be a relatively abstract concept for patients. The RMSTc can directly describe patient survival and does not depend on the proportional SDH assumption, but the simple use of composite endpoints does not always have clinical meaning and degrades the accuracy of patient information [14]. Conversely, the RMTL can avoid some limitations of the above methods. Moreover, the relationship between the RMTL and RMSTc can be derived as RMTL 1 + RMTL 2 + ⋯ + RMTL j + RMST c = τ, where RMTL j means the area under the CIF curves for cause j. As RMTL j is interpreted as the average survival time lost due to cause j within τ, the RMTL can be observed from the CIF curves directly, while the SHR cannot directly reflect the CIF curves. In addition, Gray, which corresponds to the SHR, needs to satisfy the proportional SDH assumption, while the RMTL-related tests do not. From the simulation results of the hypothesis testing procedures, sDiff, which is based on the RMTLd, corrects the severe skewness of Diff under high censoring and has improved power under various scenarios compared to Gray. In general, sDiff maintains good performance compared to Gray and Diff. In addition, this paper also contains sample size formulas based on the RMTLd. When the proportional SDH assumption is satisfied, the calculated sample sizes of Diff and sDiff are close to that of Gray, while Diff still has the highest power. Because the type I error rates of Diff are inflated with censoring, we still suggest that Gray and sDiff be used in this situation; however, when the SDH is non-proportional, the sample sizes estimated by Gray are much larger than those estimated by Diff and sDiff with the lowest observed power. Hence, in this situation, sDiff seems more adaptable for use.
The sample sizes calculated in the examples (Fig. 2b,  d) suggest that different choices of τ may have a large influence on the calculation of the sample size. In example 1, the calculated sample size increases with increasing τ and is similar to the sample size estimated by Gray after 20 months (Fig. 2b), while in example 2, the calculated sample sizes show a trend of decreasing first and then increasing (Fig. 2d). Hence, it is important to choose an appropriate τ for the calculated sample size of Diff and sDiff. In practical research, τ is always determined as the follow-up time in the study design. If all patients in one of the groups experience an endpoint during the followup period, then the study is stopped, and this time point is determined as the final analysis of the study, i.e., τ; otherwise, if patients in either group do not have a completely observed endpoint until the end of the follow-up period, then the designed follow-up period is regarded as the truncation time point. In this paper, the calculated sample sizes in simulations are based on the minimum time of the last observation of the event of interest in the two groups as τ. The issue of how to define an appropriate τ in a specific research context will be considered in a future study.

Conclusions
The RMTL can meaningfully summarize treatment effects for clinical decision making, which can be reported with the SDH ratio for competing risks data. The proposed sDiff test is robust and can be considered for statistical inference in real data analysis; the two proposed calculated sample size formulas have wide applicability and can also be applied to real trial designs.

Additional file
The additional file contains the theory of variance techniques (ANOVA) used to evaluate the type I error and power, the parameter settings of the two CIFs for the simulations and the comparison of power with a fixed sample size.