Introducing a new estimator and test for the weighted all-cause hazard ratio

Background The rationale for the use of composite time-to-event endpoints is to increase the number of expected events and thereby the power by combining several event types of clinical interest. The all-cause hazard ratio is the standard effect measure for composite endpoints where the all-cause hazard function is given as the sum of the event-specific hazards. However, the effect of the individual components might differ, in magnitude or even in direction, which leads to interpretation difficulties. Moreover, the individual event types often are of different clinical relevance which further complicates interpretation. Our working group recently proposed a new weighted effect measure for composite endpoints called the ‘weighted all-cause hazard ratio’. By imposing relevance weights for the components, the interpretation of the composite effect becomes more ‘natural’. Although the weighted all-cause hazard ratio seems an elegant solution to overcome interpretation problems, the originally published approach has several shortcomings: First, the proposed point estimator requires pre-specification of a parametric survival model. Second, no closed formula for a corresponding test statistic was provided. Instead, a permutation test was proposed. Third, no clear guidance for the choice of the relevance weights was provided. In this work, we will overcome these problems. Methods Within this work a new non-parametric estimator and a related closed formula test statistic are presented. Performance of the new estimator and test is compared to the original ones by a Monte-Carlo simulation study. Results The original parametric estimator is sensible to miss-specifications of the survival model. The new non-parametric estimator turns out to be very robust even if the required assumptions are not met. The new test shows considerably better power properties than the permutation test, is computationally much less expensive but might not preserve type one error in all situations. A scheme for choosing the relevance weights in the planning stage is provided. Conclusion We recommend to use the non-parametric estimator along with the new test to assess the weighted all-cause hazard ratio. Concrete guidance for the choice of the relevance weights is now available. Thus, applying the weighted all-cause hazard ratio in clinical applications is both - feasible and recommended. Electronic supplementary material The online version of this article (10.1186/s12874-019-0765-1) contains supplementary material, which is available to authorized users.


Background
In many clinical trials, the aim is to compare two treatment groups with respect to a rarely occurring event like myocardial infarction or death. In this situation, a high number of patients has to be included and observed over a long period of time for a demonstration of a relevant treatment effect and to reach an acceptable power. Combining several events of interest within a so-called the effects of the individual components which can differ in magnitude or even in direction [5][6][7]. Second, the distinct event types could be of different clinical relevance. For example, the fatal event 'death' is more relevant than a non-fatal event like 'cardiovascular hospital admission' . Moreover, the less relevant event often contributes a higher number of events and therefore has a higher influence on the composite effect than the less relevant event.
Current guidelines on clinical trial methodology hence recommend to combine only events of the same clinical relevance [3,8]. However, this is rather unrealistic in clinical practice, as important components like 'death' cannot be excluded from the primary analysis if a fatal event is clearly more relevant than any other non-fatal event. Therefore, to address the problems that arise within the analysis of a composite endpoint other methods to ease the interpretation of results are needed. An intuitive approach could be to define a weighted composite effect measure with weights that reflect the different levels of clinical relevance of the components. Weighted effect measures have been proposed and compared by several authors [9][10][11][12][13][14]. Some of the main disadvantages of these approaches include the high dependence on the censoring mechanism and on competing risks [13,14]. Recently, Rauch et al. [15] proposed a new weighted effect measure called the 'weighted all-cause hazard ratio' . This new effect measure is defined as the ratio between the weighted average of the cause-specific hazards for two groups. Thereby, the predefined weights are assigned to the individual cause-specific hazards. With equal weights for the components the weighted all-cause hazard ratio corresponds to the common all-cause hazard ratio and thus defines a natural extension of the standard approach.
Although this new weighted effect measure seems an elegant solution to overcome interpretation problems, the originally published approach has several shortcomings: 1. The proposed original estimator for the weighted all-cause hazard ratio requires pre-specification of a parametric survival model to estimate the individual causespecific hazards. The form of the survival model, however, is usually not known in the planning stage of a trial. 2. No closed formula for a corresponding test statistic was introduced but a permutation test was used instead which comes along with a high computational effort. 3. No clear guidance for the choice of the relevance weighting factors was provided. In this work, we want to address these issues to make the weighted all-cause hazard ratio more appealing for practical application. In particular we will provide answers to the following questions: • How robust is the original estimator for the weighted all-cause hazard ratio against miss-specifications of the underlying parametric survival model?
• How robust is the new alternative non-parametric estimator for the weighted all-cause hazard ratio? • How can we derive a closed formula test statistic for testing the weighted all-cause hazard ratio? • How do the different estimators and tests behave in a direct performance comparison? • What are the required steps when choosing adequate weighting factors in the planning stage?
This paper is organized as follows: In the Methods Section, we start by introducing the standard unweighted approach for analysing a composite time-to-first event endpoint. In the same section, the weighted all-cause hazard ratio is introduced as well as the original parametric estimator and the permutation test as recently proposed by Rauch et al. [15]. A new non-parametric estimator for the weighted all-cause hazard ratio and a related closed formula test is introduced subsequently. Next, we provide a step-by-step guidance on the choice of the relevance weighting factors. In the Results Section, the different estimators and tests for the weighted all-cause hazard ratio are compared by means of a Monte-Carlo simulation study to evaluate their performance for various data scenarios, in particular those who meet and those who violate the underlying model assumptions. We discuss our methods and results and we finish the article with concluding remarks.

Methods
The standard all-cause hazard ratio The interest lies, throughout this work, in a two-arm clinical trial where an intervention I shall be compared to a control C with respect to a composite time-to-event endpoint. A total of n individuals are randomized in a 1 : 1 allocation to the two groups. The composite endpoint consists of k components EP j , j = 1, ..., k. It is assumed that a lower number of events corresponds to a more favourable result. The observational period is given by the interval [ 0, τ ]. The study aim is to demonstrate superiority of the new intervention and therefore a one-sided test problem is formulated.

Definitions and test problem
The all-cause hazard function for the composite endpoint is parametrized as where X i is the treatment indicator which equals 1 when the individual i belongs to the intervention group and 0 when it belongs to the control. Equivalently, the causespecific hazards for the components are given as Note that the hazard for the composite endpoint is the sum of the cause-specific hazards for the components [15] λ The all-cause hazard ratio for the composite is given as where the indices I and C denote the group allocation and proportional hazards are assumed so that θ CE is constant in time. Note that the proportional hazards assumption can only hold true for both the composite and for the components if equal cause-specific baseline hazards are assumed across all components. As motivated above, a one-sided test problem for the allcause hazard ratio is considered. The hypotheses thus read as (2)

Point estimator and test statistic
For estimating the all-cause hazard ratio, a semiparametric estimator for the all-cause hazard ratio θ CE can be obtained by means of partial maximum-likelihood estimator from the well-known Cox-model [1].
The most common statistical test to assess the null hypothesis stated in (2) is the log-rank test. Let t l , l = 1, ..., d, denote the distinct ordered event times for the pooled sample of both groups, where d is the total number of observed events irrespective of its type within the observational period [ 0, τ ]. Moreover, let d EP j ,l = d I EP j ,l + d C EP j ,l , l = 1, ..., d, j = 1, ..., k denote the observed amount of individuals that experience an event of type j at time t l in the pooled sample given as the sum of the specific group-wise number of events. Similarly, let d C EP j ,l , denote the observed amount of individuals that experience an event of any type until time t l in the pooled sample given as the sum of the group-wise number of events. The number of individuals at risk just before time t l is denoted as n l = n I l + n C l . The Nelson-Aalen estimators for the cumulative all-cause hazard functions over the entire observational period are given asˆ Under the null hypothesis stated in (2), the cumulative allcause hazards of both groups are equivalent. This means that the sum of the cause-specific hazards are assumed to be equivalent. This does not automatically imply that the cause-specific hazards are also equivalent. However, this more specific assumption is required to deduce the test statistic of the weight-based log-rank test. Under the null hypothesis (2) and the additional assumption that the cause-specific hazards are equivalent, the random variable D I l , l = 1, ..., d I , for randomly sampling d I l events from n I l patients where n I l is a subset of the pooled sample with n I l + n C l individuals including a total of d l events at a fixed time point t l is hypergeometrically distributed as Then the expectation of the additive D I l over all distinct n I l n C l (n I l + n C l − d l )d l (n I l + n C l ) 2 (n I l + n C l − 1) .
The corresponding log-rank test thus reads as [16] LR := t l ≤τ d I l − n I l d l n I l +n C l t l ≤τ The test statistic LR is approximately standard normally distributed under the null hypothesis given in (2). Negative values of the test statistic favour the intervention and therefore the null hypothesis is rejected if LR ≤ −z 1−α , where z 1−α is the corresponding (1 − α)-quantile of the standard normal distribution and α is the one-sided significance level.

The weighted all-cause hazard ratio Definitions and test problem
The idea of the weighted all-cause hazard ratio is to replace the standard all-cause hazard given in (1) by a weighted sum of the cause-specific hazards using predefined relevance weights for the individual components that refer to their clinical relevance. The weighted allcause hazard is then given as where the non-negative weights w EP j ≥ 0, j = 1, ..., k, are reflecting the clinical relevance of the components EP j , j = 1, ..., k. If the weights are all equally set to 1 (w EP 1 = w EP 2 = ... = w EP k = 1), then the weighted all-cause hazard corresponds to the standard all-cause hazard.
The 'weighted all-cause hazard ratio' as proposed by Rauch et al. [15] is then given as where the indices I and C denote the group allocation.
Note that the weighted all-cause hazard ratio is a timedependent effect measure except for the case of equal baseline hazards across the components [15] which refers to The weighted all-cause hazard ratio can also be integrated over the complete observational period In the remainder of the work, we will concentrate on the weighted all-cause hazard ratio at a predefined time-point for the sake of simplicity. Again, a one-sided test problem for the weighted all-cause hazard ratio is considered The hypotheses to be assessed in the confirmatory analysis are thus equivalent to the common unweighted approach.

Original point estimator and test statistic
In order to estimate the weighted all-cause hazard ratio Rauch et al. [15] proposed to identify and estimate the cause-specific hazards via a parametric survival model. Rauch et al. [15] thereby focused on the Weibull model. This approach is thus based on the assumption that the cause-specific hazards for each component are proportional. With the estimated cause-specific hazardŝ λ I EP j andλ C EP j derived from the Weibull model, a parametric estimator for the weighted all-cause hazard ratio is given byθ The pre-specification of a survival model to identify the cause-specific hazard must be seen as a considerable restriction as the shape of the survival distribution is usually not known in advance. Thus, it is of interest to evaluate how sensible the parametric estimator reacts when the survival model is miss-specified. Moreover, there is the general interest in deriving a less restrictive nonparametric estimator. A related variance estimator for (8) cannot easily be deduced and thus an asymptotic distribution of the parametric estimator given in (8) is not available. Therefore, Rauch et al. [15] considered a permutation test to test the null hypothesis specified above. For the permutation test the sampling distribution is built by resampling the observed data. Thereby, the originally assigned treatment groups are randomly assigned to the observation without replacement in several runs. Although this is an elegant option without the need to make further restrictive assumptions, the disadvantage is that such a permutation test is not available as a standard application in statistical software but requires implementation. Moreover, depending on the trial sample size and the computer capacities, this is a very time consuming approach.

New point estimator and closed formula test statistic
To derive the new point estimator, we will assume in the following that the baseline hazards for all individual components and for the composite are equivalent within each group, meaning that .., n, and thus (4) reads as This is a very restrictive assumption usually not met in practice.
The assumption is only required to formally derive the new non-parametric estimator. We do not generally focus on data situations were this assumption is fulfilled. The estimator is only relevant for practical use if deviations from this assumptions produce no relevant bias. This will be investigated in detail in the sections Simulation scenarios and Results.
Under this assumption, the baseline hazards in the representation of the weighted all-cause hazard ratio cancel out. By this, the weighted all-cause hazard ratio is no longer time-dependent. It is therefore possible to replace the cause-specific hazards by the cumulative cause-specific hazards: where EP j (t), j = 1, ..., k, refer to the corresponding cause-specific cumulative hazards over the period [ 0, t].
With X i equal to 1 if the individual i belongs to the intervention and 0 otherwise. This representation can be used to derive a non-parametric estimator for the weighted all-cause hazard ratio using the corresponding non-parametric Nelson-Aalen estimators given aŝ using the notations given in section Point estimator and test statistic. By this a non-parametric estimator for the weighted all-cause hazard ratio is given by In contrast to the parametric estimatorθ w CE (t) given in (8), the non-parametric estimator θ w CE (t) given in (10) does not require the pre-specification of a survival model. However, the correctness of the non-parametric estimator is still based on the assumption of equal cause-specific baseline hazards. In case the baseline hazards differ, θ w CE (t) can be calculated but represents a biased estimator for θ w CE (t). Therefore, it is of interest to evaluate how sensible the non-parametric estimator reacts when the equal baseline hazards assumption is violated.
An alternative testing procedure to the discussed permutation test can be formulated by a weight-based logrank test statistic derived from a modification of the common log-rank test statistic given in (3). We use the expression 'weight-based log-rank test' instead of 'weighted log-rank test' , as in the literature the weighted log-rank test refers to weights which are assigned to the different observation time points whereas we aim to weight the different event types of a composite endpoint.
Under the null hypothesis (7) and under the assumption that the weighted all-cause hazards are equal between groups, the random variable D I EP j ,l , j = 1, ..., k, for randomly sampling d I EP j ,l events of type EP j from n I l patients where n I l is a subset of the pooled sample with n I l +n C l individuals including a total of d l events at a fixed time point t l is hypergeometrically distributed as The expectation of the additive weighted D I EP j ,l over all distinct t l ≤ τ is given as and the variance is given as assuming that no events of different types occur at the same time point. Thus, the weight-based log-rank test for the proposed weighted effect measure can be defined analogous to (3) as Under the null hypothesis of equal weighted composite (cumulative) hazards the test statistic (11) is approximately standard normal distributed. Hence, the null hypothesis is rejected if LR w ≤ −z 1−α , where z 1−α is the corresponding (1 − α)-quantile of the standard normal distribution and α is the one-sided significance level.
Note that the common weighted log-rank test can be shown to be equivalent to the Cox score test [16] because the weights are working on the coefficient β and thus the partial likelihood and its logarithm can be easily deduced. The intention of the common weighted log-rank test is to weight the time points. However, in our weight-based log-rank test, the weights have another meaning and are working on the whole hazard not only on the coefficient. Thus, the log-likelihood translates to a form were the weights are additive and therefore the score test does not translate to the test statistic proposed in this work. This was also the reason why we called our test 'weight-based' and not 'weighted' log-rank test. Our test is valid but must be interpreted as a Wald-type test statistic.

Step-by-step guidance for the choice of weights
When using the weighted all-cause hazard ratio as the efficacy effect measure for a composite endpoint it is important to fix the weights in the planning stage of the study. This can be seen as a quite challenging task, as the choice of the weights importantly influences the final outcome and the interpretation of the results. Thus, it is important to choose the weights in a well-reflected way and not arbitrarily. To help researchers with this task, we provide detailed steps on how to choose appropriate weights for a specific clinical trial situation. When discussing the choice of weights, it must be kept in mind that by using the standard all-cause hazard, i.e. the unweighted scenario, this corresponds to equal weights for all components implying that event types with a higher event frequency are naturally up-weighted. Therefore, equal weights for all components can be considered at least as arbitrary as predefined weights according to relevance considerations. To define reasonable weights, we first recall the weighted all-cause hazard function as introduced in (4) The weighted all-cause hazard can also be interpreted as the standard all-cause hazard based on modified cause- Thus, by introducing the component weights we implicitly modify the event time distribution that is the corresponding survival function. When choosing a weight unequal to 1, the survival distribution changes its shape. For a weight larger than 1, the number of events artificially increases and as a consequence, the survival function decreases sooner. In contrast, for a weight smaller than 1 the survival distribution becomes more flat as the number of events is artificially decreased. Whereas the all-cause hazard ratio can be heavily masked by a large cause-specific hazard of a less relevant component, a more relevant component with a lower number of events can only have a meaningful influence on the composite effect measure, when it is up-weighted (or if the less relevant component is down-weighted accordingly). On the contrary, if a large cause-specific hazard is down-weighted this can result in a power loss. Therefore, weighting can improve interpretation but the effect on power can be positive or negative, depending on the data situation at hand. In order to preserve the comparability to the unweighted all-cause hazard ratio, we recommend to fix the weight of the most important component, which is often given by 'death' , to 1. All other weights should then be chosen smaller or equal to 1. When considering a weight for the most relevant component larger than 1, this results in endless possibilities and it becomes more difficult to set the weights for the other less relevant events in an adequate relation. The general recommendation of fixing all weights w EP j ≤ 1, j = 1, ..., k is moreover reasonable because choosing a set of weights which are both -smaller and greater than 1 -can cause a situation where the weighted all-cause hazard is equivalent to the standard all-cause hazard. This is problematic because in this case we cannot differentiate if the effect is due to the weighting scheme or due to the underlying cause-specific hazards. For illustration of the latter problem, consider two event types EP 1 and EP 2 with exponentially distributed event times, where EP 1 corresponds to the more relevant endpoint This leads to the standard all-cause hazard If the weights are chosen as w EP 1 = 1.3 and w EP 2 = 0.8 the weighted cause-specific hazards are given as and therefore, the weighted all-cause hazard is equivalently given by λ w CE (t) = 0.26 + 0.24 = 0.5. Choosing the weights w EP 1 = 1 and w EP 2 = 0.6 gives the weighted cause-specific hazards and therefore λ w CE (t) = 0.2 + 0.18 = 0.38, where the influence of the weights is now visible. Instead of interpreting the weighted hazards, for the applied researcher it might be easier to consider the corresponding weighted composite survival function S w CE (t) given as It can be seen that the weights are still working multiplicatively on the cumulative cause-specific hazards and the event time distributions for the different event types are also connected multiplicatively. By the introduction of the weights we still assume that an individual can only experience one event but (for weights smaller than 1) less individuals experience the event. This means that the expected number of events decreases with a weight smaller than 1. Therefore, the weighted survival function for the composite still corresponds to a time to first event setting but with a proportion of events which is lower compared to the unweighted approach.
Comparing the graphs of the weighted and unweighted event time distributions can be a helpful tool for choosing the weights as shown in Fig. 1 for the exemplary setting discussed above. It can be seen that both weighting schemes yield a larger difference between the event time distributions when comparing the intervention versus the control, however the second weighting scheme shows the larger difference.
In conclusion, we recommend to proceed as follows in order to choose the weights 1. Identify the clinically most relevant event type (e.g. 'death') and assign a weight of 1. 2. Choose the order of clinical relevance for the remaining event types. For each event type EP j you should answer the question "How many events of type EP j can be considered as equally harmful than observing one event (or any other amount of reference events) in the clinically most relevant endpoint?". For example, if in the example given above 5 events of type EP 2 are considered as equally harmful as one event of EP 1 , then the weighting scheme proposed in Scenario B might be preferred. If instead the researcher arguments that 5 events of type EP 2 are considered as equally harmful as 3 events of EP 1 , then the weighting scheme proposed in Scenario A should be preferred. The weights are thus mend to bring all events to the same severity scale. By assigning a weight of 1 to the most relevant event type, this event type acts as the reference event. Therefore, the weighted survival function and its summarizing measures (median survival, hazard ratio) can be interpreted as a standard survival function for the reference event. For example, if 'death' is the reference event, on a population and on an individual patient level, the weighted survival function then expresses the probability to be neither dead nor in a condition considered as equally harmful. The median weighted survival can be interpreted as the time when half of the population is either dead or in an equally harmful condition. 3. If there are some assumption about the form of the underlying event time distributions, then the functional form of the cause-specific hazards is known. The weighted cause-specific hazards are obtained by simple multiplication with the weighting factors. We recommend to choose several weighting scenarios and to plot the resulting weighted and unweighted event time distributions and to investigate graphically how different weights would affect the expected survival time and median survival per group. Moreover, the weighted and unweighted hazard ratio can be analytically deduced and compared. By this, the impact of the weighting scheme becomes more explicit.

Simulation scenarios
To provide a systematic comparison of the original parametric estimator θ w CE (t) to the new non-parametric estimator θ w CE (t) for the weighted all-cause hazard ratio and in order to analyse the performance of the weight-based log-rank test compared to the originally proposed permutation test we performed a simulation study with the software R Version 3.3.3 [17].
Within our simulation study, we investigate various data scenarios for a composite endpoint composed of two components EP 1 and EP 2 . We restrict our simulations to weights given by w EP 1 = 1, w EP 2 = 0.1 or w EP 1 = 0.1, w EP 2 = 1 where the two event types are thus considered to show a considerable difference in clinical relevance. The results for another less extreme weighting scheme are provided as Additional file 1 (i.e. weights 1 and 0.7). A total of 10 scenarios based on different underlying hazard functions were considered in order to mimic situations where the underlying assumptions of both approaches are fulfilled and those where they are (partly) violated. For the original parametric estimator, the cause-specific hazards were estimated by fitting Weibull models. A total of 1000 data sets each with n = 200 patients (100 patients per group) were simulated for each scenario. The amount of simulated data sets was limited to 1000 because of the time-consuming runtime of the permutation test which was based on 1000 runs. We used the pseudo-random generator Mersenne Twister [18]. For simulating the underlying event times, the approach described by Bender et al. [19] was used. The minimal follow-up was either fixed to τ = 1 or τ = 2 year(s). For each scenario the methods were compared on the same data sets. In case of non-convergence of a model, the data set was excluded. Table 1 lists the underlying hazard functions for the different simulation scenarios and summarizes briefly which assumptions are met. In Fig. 2 the corresponding weighted and unweighted event time distributions for the composite for the intervention and the control group are graphically displayed for all 10 scenarios. In addition, the related weighted and unweighted hazard ratios for the composite are visualized along with unweighted cause-specific effects.
For Scenario 1-7 the cause-specific hazards are Weibull, or exponentially, distributed with the hazard of the form [20] λ(t) = κ · ν · t ν−1 . Thereby, κ > 0 is the scale parameter and ν > 0 is the shape parameter. The investigated scenarios show to some extend the flexibility of the Weibull model. Situations with earlier occurring events for one event type (higher cause-specific hazard) and later occurring events for the other event type (lower cause-specific hazards) are capture as well as situations where the difference in hazards is smaller. In the scenarios 1-6 at least one causespecific hazard is time-dependent whereas in Scenario 7 the hazards are constant. The hazard for the composite increases over time for the Scenarios 1 and 3 and decreases for Scenario 2. For the Scenarios 4 and 5 the hazard first decreases and then increases after a while. For the Scenarios 1 and 3 the proportional hazards assumption is fulfilled for each of the event types simultaneously. Also note that in Scenario 3 and partly in the Scenarios 4 and 5 the effects for the event types point into opposite directions. Scenario 6 depicts a situation where no treatment effect for the individual components and the composite exists. In Scenario 7 there are opposite effects for the individual components which cancel out in the combined composite for one weighting scheme.
As we aim to quantify how robust the original parametric estimator for the weighted all-cause hazard ratio based on the Weibull model is when the event times for the components in fact do not follow a Weibull distribution, Scenarios 8 to 10 are based on a Gompertz distribution. Like the Weibull model the Gompertz model fulfils the proportional hazards assumptions where the hazard is parametrized as which is also referred to the Gompertz-Makeham distributed hazard [20,21]. Again, κ > 0 is a scaling parameter and ν > 0 a shape parameter. In addition, a more general term ≥ −κ defining the intercept is formulated. For all Scenarios with Gompertz distributed event times the hazard for the composite increases over time. For the situation where the shape parameters are equal across all event types the proportional hazards assumption does apply to the composite. This is the case for Scenario 8 but not for the Scenarios 9 and 10. The proportional hazards assumption also holds true for each event type separately for the Scenarios 8 and 9. In Scenario 10 the proportional hazards assumption is violated for all event types and for the composite. Table 2 displays the results of the simulation study for all Scenarios 1 to 10. Columns 2 and 3 present check boxes for the underlying model assumptions of the two estimators. Especially for those scenarios where some assumptions are violated, we are interested in the (standardized) bias, the relative efficiency, and the coverage of the corresponding confidence interval for the different estimators (see Table 3). Thereby. the bias is quantified by comparing the mean logarthmized (natural) estimators (Table 2 Columns 10 and 11) to the corresponding natural logarithm of the true effect (Table 2 Column 7) which is fixed by the simulation setting. For the standardized bias the bias is divided by the corresponding standard error of the estimated effects. The relative efficiency is the quotient of the mean square error of the original estimator divided by the mean square error of the non-parametric estimator. A relative efficiency smaller than one is in favour of the original estimator. The mean square error is the sum of the quadratic bias and the quadratic standard error for the logarithmized estimators. The coverage is the proportion of times the 95% confidence interval for each estimated effect includes the true effect. To determine the confidence intervals, the standard error for all estimated effects is required which we obtained by the permutation distribution. Thereby, again logarithmic scale is used so that the estimators' distribution is not skewed and thus their standard deviation and the performance measures in Table 3 can be interpreted. In Column 4 of Table 2 the time point τ at which the estimators are evaluated is shown and Columns 5 and 6 show the underlying component weights. Note that by switching the weights between the two components, we implicitly investigate the influence of all hazard combinations when the relevance of the components is reversed. Column 7 displays the logarithmized true weighted allcause hazard ratio at time τ which can be obtained from the underlying cause-specific hazard functions. Columns 8 and 9 show the mean amount of events averaged over all data sets per scenario for all event types separately and its standard deviation. Columns 10 and 11 show the mean of the logarithmized estimated weighted all-cause hazard and its standard deviation based on the original parametric estimator θ w CE (τ ) and based on the new nonparametric estimatorθ w CE (τ ). Columns 12 to 13 show the empirical power values for the originally proposed permutation test based on θ w CE (τ ) and for the new weight-based log-rank test based onθ w CE (τ ). Note that the reported power values correspond to one-sided tests based on a one-sided significance level of 0.025. Table 3 depicts the amount of simulations that converged (Columns 2 and 3), the bias (Columns 4 and 5), the standardized bias (Columns 6 and 7), the square root of the mean square error (Columns 8 and 9), the relative efficiency (Column 10), and the coverage (Columns 11 and 12) for the logarithmized original and new estimators. Scenarios 1 and 3 reflect situations where the proportional hazards assumption is fulfilled for each component but the Weibull distributed cause-specific hazards are unequal and thus the composite effect is time-dependent. Since in this scenarios the assumptions for the original estimator is fulfilled it is intuitive that the (standardized) bias is small for the parametric estimator. Although the assumptions for the non-parametric estimator are violated the bias is still rather small. This good performance is also captured in the coverage which is mostly near the anticipated 95%. It is furthermore intuitive that the original estimator shows most often a smaller mean square error in relation to the non-parametric estimator. Note that in Scenario 3 the unweighted effects point into different directions but the direction of the weighted effect depends on the weighting scheme. In the Scenarios 2,

Results
1a   4, and 5 the proportional hazards assumption is not fulfilled for neither the components nor for the composite but the cause-specific hazards still follow a Weibull distribution. For Scenario 2 it can be seen that the estimated weighted effects are the same for both estimators but do not approach the true effect as good as in the Scenarios 1 and 3. This is because both approaches need at least the assumption of proportional hazards in the components. A similar outcome would be expected for the Scenarios 4 and 5. However, in both scenarios the parametric estimator performs much worse than the nonparametric estimator. This is due to the higher variability in the estimations. For Scenario 6 where there is no effect for the unweighted composite both approaches perform quite well. For the original estimator this was expected since its assumptions are fulfilled. In Scenario 7 with the weights 1 for event type 1 and 0.1 for event type 2 the true combined treatment effect is 0. This is also captured quite well in both estimators. Note that only for this specific weighting scheme the composite effect is 0 but not for the other weighting schemes. However, the performance of the estimation approaches is also satisfying for the other weighting schemes. In Scenario 8, Gompertz-Makeham distributed cause-specific hazards are assumed. Thereby, the proportional hazards assumption is fulfilled for the components and the composite. Thus, it is intuitive that the new non-parametric estimator closely coincide with the true effect. However, the parametric estimator based on the Weibull model is relevantly biased independent of the weighting scheme and shows a higher variability. Scenario 9 still depicts Gompertz-Makeham distributed cause-specific hazards but the proportional hazards assumption is only fulfilled for the components and not for the composite. Although the cause-specific baseline hazards are thus unequal the non-parametric estimator performs better in this scenario whereas the parametric estimator shows substantial bias and variability which might be also due to convergence problems. Scenario 10 represents Gompertz distributed cause-specific hazards where the proportional hazards assumption is neither fulfilled for the components nor for the composite. Compared to the two previous scenarios the performance of the parametric estimator has increased and is not globally worse than that of the non-parametric estimator. The performance depends on the weighting scheme. In here, not all τ -weight-combinations are displayed. However, the performance of the missing combination scenarios is comparable to the corresponding scenarios displayed.
In conclusion, the original parametric estimator turns out to be sensible against model miss-specifications for estimating the underlying cause-specific hazards as expressed by most values of the (standardized) bias and the coverage of the confidence intervals for Scenarios 4,5,8,9,10. In these scenarios, the performance of the non-parametric estimator tends to be better because not only the (standardized) bias is smaller and the coverage probability is better but also the relative efficiency favours the non-parametric approach. Moreover, in Scenarios 4 and 5 the (standardized) bias of the parametric estimator is smaller and its variation is considerably higher which cannot only be explained by the smaller amount of converged simulations. The higher amount of non-converging models for the original approach is furthermore a disadvantage. In scenarios where the assumption for the parametric estimator is fulfilled (Scenarios 1 and 3) its performance tends to be better than for the non-parametric approach. Although in these scenarios the assumption of equal cause-specific baseline hazards is violated, the performance of the non-parametric estimator is however not considerably worse than for the parametric estimator.
Except for Scenario 1b, the power of the weight-based log-rank test is uniformly equal or larger than the power of the permutation test. This power advantage in particular occurs in situations where the two point estimators coincide (Scenarios 2a and 2b or 10a) or even when the non-parametric estimator suggests a less extreme effect (Scenarios 8 or 9). For Scenario 6 where there is no effect for the components nor for the composite the permutation test in the investigated scenarios performs better in terms of preserving the type one error. In Scenario 7 where the composite effect is 0 for one weighting scheme the type one error is preserved for the permutation test as well as the weight-based log-rank test in this scenario.
If the weights are chosen to be 1 and 0.7, the performance comparisons basically come to the same results (compare Additional file 1). Summarizing the results of our simulation, the new non-parametric estimator and the corresponding weight-based log-rank test outperform the original estimator and the permutation test.

Discussion
In this work, we investigated a new estimator and test for the weighted all-cause hazard ratio which was recently proposed by Rauch et al. [15] as an alternative effect measure to the standard all-cause hazard ratio to assess a composite time-to-event endpoint. The weighted all-cause hazard ratio as a weighted effect measure for composite endpoints is appealing because it is a natural extension of the all-cause hazard ratio. It allows to regulate the influence of event types with a greater clinical relevance and thereby eases the interpretation of the results. Generally it must be noted that the weighted all-cause hazard ratio was introduced to ease the interpretation of the effect in terms of clinical relevance. The aim of the weighted effect measure is not to decrease the sample size or increase the power. The power of the weighted all-cause hazard ratio can be larger but may also be smaller than the power of the unweighted standard approach.