 Research
 Open Access
 Published:
Weighted composite time to event endpoints with recurrent events: comparison of three analytical approaches
BMC Medical Research Methodology volume 22, Article number: 38 (2022)
Abstract
Background
In clinical trials the study interest often lies in the comparison of a treatment to a control regarding a time to event endpoint. A composite endpoint allows to consider several time to event endpoints at once. Usually, only the time to the first occurring event for a patient is thereby analyzed. However, an individual may experience more than one nonfatal event. Including all observed events in the analysis can increase the power and provides a more complete picture of the disease. Thus, analytical methods for recurrent events are required. A challenge is that the different event types belonging to the composite often are of different clinical relevance. In this case, weighting the event types according to their clinical relevance is an option. Different weightbased methods for composite time to event endpoints were proposed. So far, there exists no systematic comparison of these methods.
Methods
Within this work we provide a systematic comparison of three methods proposed for weighted composite endpoints in a recurrent event setting combining nonfatal and fatal events of different clinical relevance. We consider an extension of an approach proposed by Wei and Lachin, an approach by Rauch et al., and an approach by Bakal et al.. Comparison is done based on a simulation study and based on a clinical study example.
Results
For all three approaches closed formula test statistics are available. The WeiLachin approach and the approach by Rauch et al. show similar results in mean squared error. For the approach by Wei and Lachin confidence intervals are provided. The approach by Bakal et al. is not related to a quantifiable estimand. The relevance weights of the different approaches work on different level, i.e. either on causespecific hazard ratios or on event count.
Conclusion
The provided comparison and simulations can help to guide applied researchers to choose an adequate method for the analysis of composite endpoints combining (recurrent) events of different clinical relevance. The approach by Wei and Lachin and Rauch et al. can be recommended in scenarios where the composite effect is timeindependent. The approach by Bakal et al. should be applied carefully.
Background
The focus of many cardiovascular or oncologic trials lies in the comparison of a treatment to a control intervention with regard to a time to event endpoint like time to myocardial infarction, time to stroke, time to relapse, or time to death. Including only one of those event types can result in a large number of patients that need to be observed to gain an effect with sufficient power. To overcome this issue and decrease the required sample size, composite endpoints can be considered alternatively [1, 2]. Thereby, several events of interest can be combined and analyzed at once. Commonly, methods for analyzing the time to the first occurring event of an individual are applied, like the logrank test or the Cox proportional hazards model [3]. Thus, it is neglected that an individual may experience more than one event, e.g. several myocardial infarctions or a myocardial infarction followed by death. Incorporating all events experienced by an individual increases the amount of information used for effect estimation and can further decrease the sample size due to the expected higher amount of events. It further provides a more complete picture of the disease process. Cox proportional hazards based models were introduced for the analysis of recurrent time to events like the AndersenGill model [4], the marginal model by Wei, Lin and Weissfeld [5], and conditional models by Prentice, Williams and Peterson [6]. In those models only one event type is considered and thus, when applied to a composite endpoint, it is implicitly assumed that a myocardial infarction has the same clinical relevance as death and the treatment effect is the same in both endpoints [7]. An alternative modelling approach for the combination of a recurrent event process and a fatal event process are socalled joint frailty models [8, 9]. Thereby, a correlation between events can be modelled and two effects are estimated, one for each event type. Although this seems to be an appealing approach, results are more difficult to interpret because they are conditioned on the socalled frailty parameter and a single allcause effect is not provided. Such an allcause effect should be able to ease the interpretation if the components are events of different clinical relevance. Weighted effect measures were proposed to consider the clinical relevance of the combined event types [10–13]. The common idea of these approaches is that a relevance weight is assigned to each event type with the aim to make the comparison between different events more fair. However, most of these weighted approaches were only described for the time to first event endpoint analysis.
Rauch et al. recently introduced the weighted allcause hazard ratio where predefined relevance weights are multiplied to the causespecific hazards [14, 15]. A corresponding closed formula test statistic was also provided [15]. Although the method was described for a time to first event analysis it can be easily extended to the situation of a time to recurrent event analysis as it is shown in the present work. Other weighting approaches for the analysis of a composite endpoint combining a recurrent nonfatal event with other fatal or nonfatal events were proposed: Bakal et al. proposed a weighted nonparametric approach [16, 17] and Wei and Lachin described a multivariate approach [18, 19] which is extended to recurrent events in this work. So far, the performance of these three methods in different clinical data scenarios was not analyzed and compared systematically. This would help to better understand the properties of the different approaches and to gain recommendations for or against their application. Therefore, in this work we provide a systematic comparison of the three methods (approach by Wei and Lachin, approach by Rauch et al., approach by Bakal et al.) with the help of a MonteCarlo simulation study.
Methods
We consider a twoarm clinical study with an intervention (I) and a control (C), where the primary endpoint is a composite time to event endpoint combining two event types. Throughout this work, it is assumed that there is one fatal event “death” (D) and the other nonfatal event is “myocardial infarction” (M). The nonfatal event might occur more than once per individual. An individual might also experience no event in the observational period. We consider classical continuous time to event data which are right censored. Although, we illustrate the approaches based only on two different event types, they can easily be applied to scenarios with e.g. more than one nonfatal event.
The total number of individuals n are randomized in a 1:1 allocation to the two groups. We consider a onesided test problem, where the null hypothesis states that the control is better or equal to the intervention and the alternative states that the intervention group is superior. The test hypothesis are fomulated in terms of the underlying estimand for the specific model as specified below. Only for the approach by Bakal et al. there is no formal estimand and therefore no formal null hypothesis can be formulated.
Formulation of the test problem and the estimand
In the following, the underlying test problems and the corresponding estimands will be formulated for the three weighted approaches under comparison. The test hypotheses are similar across methods, however it is important to highlight the differences in the underlying modelling approaches (see also Table 1).
Approach by Wei and Lachin
In the works by Wei and Lachin [19] and Lachin et al. [18] only the time to the first event is considered. However, the approach can be easily extended to recurrent events by defining the hazard functions as stratified hazards, where the strata j=1,...,J define the subgroup of all first, second, third events etc.. The stratified hazards read as
where X is the group indicator and X=1 refers to the intervention group. This model implies that the causespecific baseline hazards (λ_{D,j0}(t),λ_{M,j0}(t)) are strataspecific, i.e. hazards can change for subsequent events, but the causespecific effect (exp(β_{D}),exp(β_{M})) remain the same over all strata. This model moreover suggests proportional hazards for both event types within the strata.
Wei and Lachin [19] then define a socalled “weighted hazard ratio” as
where the index L denotes the WeiLachin weighting approach and \(w^{L}_{D}\) and \(w^{L}_{M}\) are the prespecified relevance weights which are described to reflect the “relative importance or severity” [18]. The weights are working on the logarithmized causespecific hazard ratios, but not directly on the hazard function. This implies, that the influence of the weight is independent from the underlying number of events, and as a consequence a high weight has a large impact even if the corresponding causespecific hazard ratio is estimated based on a low number of events. The corresponding hypotheses are then formulated as follows:
To test the above null hypothesis (4) the following test statistic was proposed [18]:
where the estimators for the causespecific logarithmic effects \(\hat \beta _{D}\) and \(\hat \beta _{M}\) can be obtained by using a stratified Coxmodel for each cause. The corresponding variance estimators of β_{D} and β_{M} are denoted by \(\hat \sigma ^{2}_{D}\) and \(\hat \sigma ^{2}_{M}\), respectively and the covariance estimator of β_{D} and β_{M} as \(\hat \sigma _{D,M}\). Lachin and Bebu [18] show in their supplement how \(\hat \sigma ^{2}_{D}, \hat \sigma ^{2}_{M}\), and \(\hat \sigma _{D,M}\) can be calculated. Further the function mmm in the R package multicomp also provides these values [20–22]. The test statistic T^{L} is asymptotically standard normally distributed under the null hypothesis. Thus, the null hypothesis is rejected if T^{L}≤−z_{1−α}, where z_{1−α} is the (1−α)quantile of the standard normal distribution and α is the onesided significance level.
By means of the estimators for the causespecific logarithmic effects and their variances, the estimated weighted hazard ratio is given as:
The corresponding (1−2·α)confidence interval is given as:
Approach by Rauch et al.
Rauch et al. [14] recently described the socalled “weighted allcause hazard ratio” for a composite time to first event endpoint which we here extend to recurrent time to event analysis. A nonparametric estimator for this approach was already described [15] and is now extended within this work to allow multiple events per patient. As before for the Wei and Lachin approach, the stratified causespecific hazards given in (1) and (2) are considered. Thereby, it is assumed that if e.g. death occurs as a second event this event belongs to the second stratum.
The newly adapted definition by Rauch et al. [14] for the “weighted allcause hazard ratio” is given as
where the index R denotes the weighting approach by Rauch et al. and \(w^{R}_{D}\) and \(w^{R}_{M}\) are the prespecified relevance weights. Note that in contrast to the Wei and Lachin approach the weights are not forced to sumup to 1 since they are implemented in the numerator and the denominator. The weights are working on the hazard functions and not on the hazard ratios. As the hazard function estimator depends on the number of observed events, a high weight can still have a low impact if the underlying event rate is small. This is a fundamental difference to the approach of Wei and Lachin. Ozga and Rauch [15] proposed a guidance for the choice of weights where a weight of 1 is assigned to the most clinical relevant event. For all other event types a weight ≤ 1 is assigned. The weighted allcause hazard ratio can be interpreted as the weighted average of causespecific hazards/hazard ratios. In contrast, the weighted hazard ratio by Wei and Lachin does not directly transfer to the common allcause hazard ratio.
The weighted allcause hazard ratio defines a simple extension of the common allcause hazard ratio, i.e. the common allcause hazard ratio is gained if all weights are equal to 1.
The corresponding hypotheses for the weighted allcause hazard ratio can be formulated as follows:
To test the above null hypothesis (10), Ozga and Rauch [15] proposed a (stratified) weightbased logrank test statistic T^{R}. The test statistic formula is given in the Additional File.
The test statistic T^{R} is approximately standard normal distributed. Thus, the null hypothesis is rejected if T^{R}≤−z_{1−α}, where z_{1−α} is the (1−α)quantile of the standard normal distribution and α is the onesided significance level.
Ozga and Rauch [15] described a nonparametric estimator for the weighted allcause hazard ratio. The idea of the nonparametric estimator is to replace the hazard functions in (8) by the cumulative hazard functions, which results in the same estimator under the assumptions of equal baselinehazards for the different event types:
where \(\hat \Lambda _{D,j}^{I}(t), \hat \Lambda _{M,j}^{I}(t), \hat \Lambda _{D,j}^{C}(t)\), and \(\hat \Lambda _{C,j}^{M}(t)\) are the cause, group, and strataspecific NelsonAalen estimators for the cumulative hazards at time t. This nonparametric estimator was recently shown to be robust under deviations from the equal baselinehazards assumption [15].
Because a variance estimator cannot be derived for the weighted allcause hazard ratio, confidence intervals can only be gained via bootstrap sampling.
Approach by Bakal et al.
The method described by Bakal et al. [16, 17] is a nonparametric weighted estimation approach for the survival probabilities, i.e. a weighted procedure for the KaplanMeier estimate. However, they do not define any underlying model and as a consequence the estimand is unspecified. By this, there naturally also is no effect estimator. The approach is based on socalled “weighted survival functions”, however the weighting scheme is only described on the estimation level. Therefore, the formulation of formal test hypotheses is not possible.
The weights proposed by Bakal et al. [16, 17] are denoted by \(w^{B}_{M}\) and \(w^{B}_{D}\) ∈[0,1] where for fatal events or the most relevant event a weight of 1 is assigned and for nonfatal events a weight < 1 is used. They are working recursively on the observed event counts where the recursion is with respect to all previous events for an individual. The other event types are then set in relation to this most relevant event type. This choice of the weights is similar to the approach of Rauch et al. [14].
The estimated weighted survival probabilities can be gained in a twostage process (an example can be found in the Additional File).
Thereby for each individual i, i=1,..,n, a weight \(w_{i}^{B}(t_{k})\) corresponding to the observed individual event at time t_{k} is assigned where t_{k} are the ordered (not strataspecific) distinct event times for k=0,..,K, where K ist the maximum number of events per individual and t_{0}=0. In our scenario \(w_{i}^{B}(\cdot)\) can either be \(w^{B}_{M}(\cdot)\) or \(w^{B}_{D}(\cdot)\). All observations per individual are included with the respective weight.
Using this, the first step is to assign an individual score for each patient at all event time points. This score is used for calculating the net impact with which the individual events are included in the estimation of the weighted survival probability. The weighted survival probability thereby depends on the weighted event count and on a weighted number at risk. The idea is that instead of considering an event as either present or not, in the approach by Bakal et al. a patient can experience a partial event counting less than a full event which, as a consequence, reduces the risk set by an amount lower than 1.
Each individual starts with a score of 1, i.e. the individual is fully at risk for an event. This score is subsequently reduced as follows: if the patient experienced a nonfatal event (weight smaller than 1) the patient remains partly at risk and if a fatal event was observed (weight equal to 1) the patient is removed from the risk set. Formally, this reads as:
1. Assign an individual score s_{i}(·),i=1,...,n, for all observed event times t_{k},k=1,...,K:
2. As a second step the weighted survival probabilities are calculated by replacing the event counts by the above defined scores.
For this we define the total number of weighted events at t_{k} as:
Further the total number of individuals at risk at t_{k} are defined as:
Note, individuals can be only partly at risk as long as they are still under observation, i.e. had no fatal event or were censored but had a nonfatal event.
Analogously, the groupspecific number of weighted events and number of individuals at risk can be defined, denoted by an additional upper index I or C.
Using this, the survival probabilities can be calculated (recursive formula for KaplanMeier estimate):
For groupwise calculation of these weighted survival probabilities only the corresponding individuals and weights within the groups are used. As mentioned in the publication of Westerhout et al. [17] the common logrank test can be used in a modified version to test the hypothesis whether these weighted survival probabilities for the groups are the same.
The teststatistic is given as follows:
The test statistic T^{B} is approximately standard normal distributed. Thus, the hypothesis of equal weighted survival probabilities between the groups is rejected if T^{R}≤−z_{1−α}, where z_{1−α} is the (1−α)quantile of the standard normal distribution and α is the onesided significance level.
Simulation study
To provide a systematic comparison of the methods described in the previous section, we conducted a simulation study. As before, we consider a composite endpoint combining two event types; one fatal event given by death (D) and one nonfatal given by myocardial infarction (M). For all scenarios 200 individuals per data set were generated with 100 in each treatment group. A followup of three years was assumed, i.e. adminstrative censoring for an individual followup after three years. Hence, the maximum number of events is limited by this observational period and impacted by the underlying event distribution. The mean event count per scenario is given in Table 3. In the simulation, we additionally limited the maximal event count per individual to 100. Patients who do not have an event up to that time point remain in the analysis with a censored time point. The effect estimates and tests will be evaluated at three years, i.e. at the end of the study period.
In Table 2 the simulation scenarios are listed. Columns 2 to 5 show the assumed underlying hazard functions. The hazards are displayed as products of the baseline hazards and the causespecific effects to underline the assumption of equal baseline hazards. The causespecific hazard are assumed to be either exponentially or Weibull distributed. The continuous event times are generated as described by Bender et al. [23] for the fatal event and as described by JahnEimermacher et al. [24] for the nonfatal recurrent event. To gain first insights into the performance of the three methods we consider scenarios where the baseline hazards and hazard ratios do not change dependent on previous events, i.e. there are also no strataspecific effects.
The considered weights for the different weighting approaches are listed in columns 6 to 9. For the WeiLachin approach the weights for the fatal and nonfatal event are chosen to sum up to 1 and such that the ratio between the weights is equal to the weight ratio of the other two approaches \(\frac {w^{*}_{M}}{w^{*}_{D}}\) as given in Column 9. For the fatal event the weight is set to 1 for the approach by Rauch et al. and Bakal et al.. The weights for the nonfatal event are ranging between 0.1 and 0.9 for the approach by Rauch et al. and Bakal et al.. Scenarios a−e depict those weight changes.
For Scenario 1 equal time independent baseline hazards for the event types are assumed as well as equal causespecific effects. In Scenario 2 to 5 different causespecific effects are assumed. In the Scenarios 4 and 5 the causespecific effects of the two event types point into opposite directions. In the Scenarios 6 to 9 one baseline hazard is time dependent but the causespecific effects and weights are as for the Scenarios 2 and 3. For Scenarios 10 and 11 nonproportional causespecific hazards are considered, resulting in a time dependent effect estimand.
For each scenario 2000 data sets were simulated and analyzed. In case of nonconvergence for an approach the data set will be excluded.
We used the statistic software R (Version 3.6.1 and 4.0.3) [20] for the simulation study. R uses the Mersenne twister [25] for generating random numbers.
Example data
To illustrate the methods further we apply all three methods to an open source clinical study data set available within the R package frailtypack [26] named readmission. This data is taken from a study published by Gonzales et al. in 2005 [27]. They analyzed 403 patients with a new diagnosis of colorectal cancer who had a surgery between January 1996 and December 1998. They were actively followed up until 2002. Time to rehospitalization and time to death after surgery were included in the dataset. A total of 458 readmssions were observed and 112 patients died within the study period. The maximal event count for a patient in the data set is 23 and the mean individual event count is 2.6 (± 2.8). The primary study aim is to compare the number of observed fatal and nonfatal events between patients who received chemotherapy (217 (53.8%)) and those who did not (186 (46.2%)). Since the event death as a fatal event is assumed to be more clinical relevant a higher weight will be assigned to death as compared to readmission. However, results of different weighting schemes will be shown for illustration. In clinical practice and confirmatory trials the weighting scheme should be prespecified and other weighting schemes as well as the unweighted case can be chosen as sensitivity analysis.
Results
Results of simulation study
In Table 3 the results of the simulation study are displayed.
We start by looking at the estimands, estimator, and corresponding root mean squared error for the WeiLachin approach and the approach by Rauch et al. since the deviation from the true simulated values is of primary interest. Recall that for the approach by Bakal et al. there is no estimand and thus no estimator.
The true effects (estimands) for the WeiLachin approach and the approach by Rauch et al. are in most scenarios similar in magnitude and even equal in some cases (if causespecific hazards and hazard ratios are equal between event types). With less influence of the recurrent event (i.e. a smaller weight; going from scenario a to e) the composite effect gets closer to the effect of the terminal event that is the effect of the terminal event tends to suppress the effect of the recurrent nonfatal event. This effect is more or less prominent depending on the underlying causespecific hazards.
The estimators and corresponding standard deviations, and thus the mean squared errors, are also similar (or equal) for the two approaches within all scenarios. The estimators also depict that with less influence of the recurrent event the composite effect gets closer to the effect of the terminal event.
For the approaches by WeiLachin and by Rauch et al. it is seen that with the decreasing weight for the recurrent event the variability in the estimator increases (i.e. higher mean squared error is observed when changing from Scenarios a to e). The mean squared error is highest (mostly due to higher variability in estimation) in scenarios with time dependent hazards (Scenarios 6 to 11). The root mean squared error is best to compare the bias and variability of the estimators. Since they are mostly almost the same between the methods, the WeiLachin approach and approach by Rauch et al. perform equally well in terms of mean squared error.
For the Scenarios 10 and 11, the composite effect is time dependent but in our Scenarios we only evaluate and test the effect at a given time point, i.e. three years. In this case the estimated effect might be closer to the true underlying effect at some time points but at other time points estimation might result in major bias. In Scenario 5 a composite estimand greater than 1, i.e. effect in favor of the control, is given. The estimators capture this. Since we consider a onesided nullhypothesis the power observed within Scenario 5 is almost 0. In Scenario 4 the composite estimand is closer to 1 than in other scenarios (except Scenario 5). Hence, smaller power values are observed due to the onesided study design.
The following observatios are made for the power values: The power for the approach by Bakal et al. is the lowest in most scenarios. In some scenarios the power for the approach by Bakal et al. is similar to the power observed within the WeiLachin approach. For the approach by Rauch et al. the highest power is seen in most scenarios. For Scenario 1ae where the estimand remains the same for all weighting schemes it is seen that the power decreases with decreasing weight for the nonfatal event (i.e. from Scenario 1a to 1e). In Scenarios 3 and 7 the power decreases although the estimands increase. In these scenarios a smaller effect for the recurrent event is assumed and while decreasing the weight its influence on the effect estimate decreases as well and hence the power is based on the less occurring fatal event which leads to more variability. In scenarios where the composite effect approaches 1 with a smaller weight for the recurrent event (i.e Scenarios 2 and 6) the power decreases radically.
Results of application
Table 4 shows the results for the example dataset. For different weighting schemes the pvalues are given for a onesided test for all three approaches. For the method by Wei and Lachin and Rauch et al. the result for the estimated weighted effect measure is shown.
The estimated unweighted causespecific hazard ratios comparing patients with chemotherapy to patients without chemotherapy are 0.77 for the event readmission and 1.44 for the event death. Note, that they point into opposite directions, i.e. patients who received chemotherapy have a higher chance to die compared to patients who did not receive chemotherapy. In contrast, the patients who are treated with chemotherapy have a lower chance to experience readmission compared to those with chemotherapy. This can also be seen in the results of all three methods since with a lower weight for hospitalization the difference between the patients with chemotherpy and those without increases, i.e. depict more and more the difference seen for the death event alone as seen in the estimator which becomes larger. In the example, the difference between the estimated weighted effect measure for the approach by Wei and Lachin and Rauch et al. is more prominent than in the simulation study which might be due to the higher event count for the nonfatal event. The pvalue within the approach by Bakal et al. is always the highest and hence shows only a significance if readmission is ingored, i.e. has a really low weight, in the analysis.
Discussion
The analysis of composite endpoints combining events of different clinical relevance with potentially recurrent events is a challenging task in cardiovascular or oncologic trials. Therefore, we are the first to compare three methods that were proposed in the literature to give an overview of their properties in different clinical data situations. This should help the applied researcher to choose an adequate method in future clinical trials. The proposed methods differ in their properties and assumptions. However, for all approaches the choice of the weighting scheme should be based on clinical relevance of event types.
Wei and Lachin proposed an approach where the prespecified relative weights work on the causespecific loghazard ratios. For this approach not only an estimand is given but also a closed formula for a corresponding variance and thus confidence intervals. The power within this approach gained via the multivariate testing procedure was mostly between the power of the other two approaches in our simulation study but more similar to those gained for the approach by Bakal et al.. This can be explained by the fact the weights work on the causespecific effects, which are thus estimated separately. The combined effect is then a weighted average of the individually estimated causespecific effects. The estimation is thus based on a smaller event count which results in a higher variability for each causespecific effect, i.e. higher variances are combined in the multivariate procedure. Furthermore, because the weights work only on the causespecific effects the event count and distribution of events is not considered. Thus, a high causespecific effect which is based on a low event number has still a great impact on the weighted composite effect which might be questionable as an effect based on a small event count has a high standard error. On the other hand, also an effect estimated based on high uncertainty can be relevant for clinical practice, so there are several views on this aspect.
Rauch et al. proposed an approach that extends the common allcause hazard ratio and thereby naturally proposed an underlying estimand. Although an estimand is given, no closed formula for a corresponding variance and thus no confidence intervals could be derived. However, the corresponding weight based logrank test (which was extended to a stratified approach in the present study to account for recurrent events) showed the highest power in our simulation study with similar properties (e.g. mean squared error) as compared to the approach by Wei and Lachin. Prespecified relevance weights work on the causespecific hazards and thus on the event count. Hence, the weighted allcause effect does not exclusively rely on the causespecific effects. This is an advantage because in a situation where a low event number goes along with an observed high causespecific effect, the influence on the weighted composite effect is reduced, i.e. a more reliable effect estimate can be gained.
Bakal et al. proposed a weighted estimate for survival probabilities in a KaplanMeier type estimation approach. They did not provide an estimand and thus no effect estimator can be reasonably reported. Predefined relevance weights within this approach work on the event count as well as on the number of patients at risk. Although, the principle concept of Bakal’s approach seems appealing, the methods lacks a theoretical foundation, an underlying model and a prespecified estimand. Our results moreover show the lowest power for this approach in most scenarios. We therefore cannot recommend to use the approach by Bakal et al..
For the approaches by Wei and Lachin and Rauch et al. however, the results should be interpreted with care if the proportional hazards assumption is not met for the components. In this case the composite effect is time dependent which is not captured whithin these approaches, i.e. they assume constant effects. Hence the estimated effect might be correctly estimated at some time points but at others major bias might be observed. For the nonparametric approach by Bakal et al. there is no assumption about proportional hazards but since they did not state a theoretical model it is not possible to evaluate the performance in terms of bias.
The WeiLachin approach assumes a constant composite effect over time. Within the approach by Bakal et al. time dependence is also not considered. Although for the approach by Rauch et al. a time dependent estimate can be gained the stratified weight based logrank test does not incorporate this time dependence.
This means, that the approach by Wei et al. as well as the approach by Rauch et al. make strong assumptions. Proportional hazards are needed for the different causes and on strata level which is usually not met in clinical practice. Rauch et al. developed their estimand based on the assumption of a specific underlying survival distribution (parametric model). To derive a nonparametric formulation equal causespecific baseline hazards are needed. However, it was shown that this nonparametric approach is robust against a missclassification [15].
Furthermore, a disadvantage of all three methods is that the dependence between the fatal event and the recurrent event process is not modeled, which could be addressed by joint frailty models [8], [9].
In future studies the evaluation of the illustrated methods within a twosided test problem might be of interest to confirm our results for the onesided case (we do not assume that there will be any differences). Furthermore, the evaluation of the type one error in different scenarios should be evaluated since this was only marginally captured within this work, i.e. only once when the weighted composite estimand was 1 in the WeiLachin approach (Scenario 4a). Thereby, it should be noted that there are several constellations which yield a weighted estimand of 1. Robust standard errors should mostly be applied within recurrent time to event analysis, which might also influence statistical significance and type one error and hence it should be evaluated how they can be incorporated within a logrank type test statistic, since the logrank type test statistics (Rauch et al., Bakal et al.) do not allow such an extension at the moment. More complex scenarios should also be evaluated, i.e. where a correlation between event types is simulated or where more than two event types are considered. We considered only the three methods evaluated in this work where it was originally described that for the weighted components within a composite endpoint an extension to multiple events per patient is possible. However, it still might be useful to compare other methods for weighted composite endpoints, e.g. by Buyse [10]. Buyse described how to perform generalized pairwise comparisons between two groups of observations with prioritized outcome. As this approach is not based on a time to event model, we neglected it within this paper.
We were only interested in the estimation of the composite effect, but in clinical studies the causespecific effects should also be reported as recommended by several guidelines [28–30]. It should also be noted that the events considered in the composite endpoint should all be harmful or all be favorable, a mixture of harmful and favorable events must be avoided.
Conclusion
In conclusion, for clinical studies where a two groups comparison with respect to a composite endpoint combining (recurrent) events of different clinical relevance is of interest two approaches might be recommended which have different pros and cons: The approach by Rauch et al. can be recommended due to its intuitive interpretation although it provides only bootstrap confidence intervals for the effect estimate. The approach by Wei and Lachin might be preferred, when all event types show a reasonable event count and when the derivation of confidence intervals is central. The approach by Bakal et al. in its current form should be applied with care as a theoretical foundation is lacking.
Availability of data and materials
The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.
Abbreviations
 B:

Bakal
 C:

Control
 D:

Death
 I:

Intervention
 M:

Myocardial infarction
 R:

Rauch
 L:

WeiLachin
 w:

weight
References
Lubsen J, Kirwan BA. Combined endpoints: can we use them?. Stat Med. 2002; 21(19):2959–7290.
Rauch G, Beyersmann J. Planning and evaluating clinical trials with composite timetofirstevent endpoints in a competing risk framework. Stat Med. 2013; 32(21):3595–608.
Cox DR. Regression models and lifetables. J R Stat Soc Ser B Methodol. 1972; 34(2):187–220.
Andersen PK, Gill RD. Cox’s regression model for counting processes: A large sample study. Ann Stat. 1982; 10(4):1100–20.
Wei LJ, Lin DY, Weissfeld L. Analysis of multivariate incomplete failure time data by modeling marginal distributions. Biometrika. 1989; 84(408):1065–73.
Prentice RL, Williams BJ, Peterson AV. On the regression analysis of multivariate failure time data. Biometrika. 1981; 68(2):373–79.
Ozga A, Kieser M, Rauch G. A systematic comparison of recurrent event models for application to composite endpoints,. BMC Med Res Methodol. 2018; 18(2):1–12.
Mazroui Y, MathoulinPelissier S, MacGrogan G, Brouste V, Rondeau V. Multivariate frailty models for two types or recurrent events with a dependent terminal event: Application to breast cancer data. Biom J. 2013; 55(6):866–84.
Rondeau V, MathoulinPelissier S, JacqminGaddda H, Brouste V, Soubeyran P. Joint frailty models for recurring events and death using maximum penalized likelihood estimation: application on cancer events. Biostatistics. 2007; 8(4):708–21.
Buyse M. Generalized pairwise comparisons of prioritized outcomes in the twosample problem. Stat Med. 2010; 29(30):3245–57.
Bebu I, Lachin JM. Large sample inference for a win ratio analysis of a composite outcome based on prioritized components. Biostatistics. 2016; 17(1):178–87.
Péron J, Buyse M, Ozenne B, Roche L, Roy P. An extension of generalized pairwise comparisons for prioritized outcomes in the presence of censoring. Stat Methods Med Res. 2016; 27(4):1230–39.
Pocock S, Ariti C, Collier T, Wang D. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities. Eur Heart J. 2012; 33(2):176–82.
Rauch G, Kunzmann K, Kieser M, Wegscheider K, Koenig J, Eulenburg C. A weighted combined effect measure for the analysis of a composite timetofirstevent endpoint with components of different clinical relevance. Stat Med. 2018; 37(5):749–67.
Ozga A, Rauch G. Introducing a new estimator and test for the weighted allcause hazard ratio. BMC Med Res Methodol. 2019; 19(118):1–16.
Bakal J, Westerhout C, Armstrong P. Impact of weighted composite compared to traditional composite endpoints for the design of randomized controlled trials. Stat Methods Med Res. 2015; 24(6):980–88.
Westerhout C, Bakal J. Novel approaches to composite endpoints in clinical trials. EuroIntervention. 2015; 11(1):122–24.
Lachin JM, Bebu I. Application of the wei lachin multivariate onedirectional test to multiple eventtime outcomes. Clin Trials. 2015; 12(6):627–33.
Wei L, Lachin J. Twosample asymptotically distributionfree tests for incomplete multivariate observations. J Am Stat Assoc. 1984; 79(387):653–61.
R Core Team. R: A language and environment for statistical computing. 2018. https://www.rproject.org/. Accessed Aug 2021.
Hothorn T, Bretz F, Westfall RM, Heiberger P, Schuetzenmeister A, Scheibe S. Package ‘multcomp’: Simultaneous Inference in General Parametric Models. 2020. https://cran.rproject.org/web/packages/multcomp/multcomp.pdf. Accessed 2020.
Bressen Pipper C, Ritz C, Bisgaard H. A versatile method for confirmatory evaluation of the effects of a covariate in multiple models. J R Stat Soc. 2012; 61(2):315–26.
Bender R, Augustin T, M. B. Generating survival times to simulate cox proportional hazards models. Stat Med. 2005; 24(11):1713–23.
JahnEimermacher A, Ingel K, Ozga A, Preussler S, Binder H. Simulating recurrent event data with hazard functions defined on a total time scale. BMC Med Res Methodol. 2015; 15(16):1–9.
Matsumoto M, Nishimura T. Mersenne twister. a 623dimensionally equidistributed uniform pseudorandom number generator. ACM Trans Model Comput Simul. 1998; 8(1):3–30.
Rondeau V, Mazroui Y, Gonzalez J. Frailtypack: an R package for the analysis of correlated survival data with frailtymodels using penalized likelihood estimation or parametrical estimation. J Stat Softw. 2012; 47(4):1–28.
Gonzalez J, Fernandez E, Moreno V, Ribes J, Peris M, Navarro M, Cambray M, Borras J. Sex differences in hospital readmission among colorectal cancer patients. J Epidemiol Community Health. 2005; 59(6):506–11.
ICH Guideline. Statistical principles for clinical trials (E9). https://www.ema.europa.eu/en/documents/scientificguideline/iche9statisticalprinciplesclinicaltrialsstep5en.pdf. Accessed 11 Dec 2021.
Institut für Qualität und Wirtschaftlichkeit im Gesundheitswesen. General Methods  Version 5.0. https://www.iqwig.de/methoden/allgemeinemethodenversion50.pdf. Accessed 11 Dec 2021.
CPMP. Points to consider on multiplicity issues in clinical trials. https://www.ema.europa.eu/en/documents/scientificguideline/pointsconsidermultiplicityissuesclinicaltrialsen.pdf. Accessed 11 Dec 2021.
Acknowledgments
This work was supported by the Research Promotion Fund of the Faculty of Medicine (FFM); University Medical Center HamburgEppendorf.
Funding
AO received a promotion for young scientists at the University Medical Center HamburgEppendorf; Research Promotion Fund of the Faculty of Medicine (FFM). Open Access funding enabled and organized by Projekt DEAL.
Declarations
Author information
Authors and Affiliations
Contributions
AO implemented the simulations, produced the results and wrote the first draft of the manuscript. GR contributed to all parts of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
Additional file for: ’Weighted composite time to event endpoints with recurrent events: comparison of three analytical approaches’.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Ozga, AK., Rauch, G. Weighted composite time to event endpoints with recurrent events: comparison of three analytical approaches. BMC Med Res Methodol 22, 38 (2022). https://doi.org/10.1186/s12874022015111
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874022015111
Keywords
 Composite endpoint
 Time to event
 Recurrent events
 Relevance weights