 Research
 Open Access
 Published:
Using the geometric average hazard ratio in sample size calculation for timetoevent data with composite endpoints
BMC Medical Research Methodology volume 21, Article number: 99 (2021)
Abstract
Background
Sample size calculation is a key point in the design of a randomized controlled trial. With timetoevent outcomes, it’s often based on the logrank test. We provide a sample size calculation method for a composite endpoint (CE) based on the geometric average hazard ratio (gAHR) in case the proportional hazards assumption can be assumed to hold for the components, but not for the CE.
Methods
The required number of events, sample size and power formulae are based on the noncentrality parameter of the logrank test under the alternative hypothesis which is a function of the gAHR. We use the web platform, CompARE, for the sample size computations. A simulation study evaluates the empirical power of the logrank test for the CE based on the sample size in terms of the gAHR. We consider different values of the component hazard ratios, the probabilities of observing the events in the control group and the degrees of association between the components. We illustrate the sample size computations using two published randomized controlled trials. Their primary CEs are, respectively, progressionfree survival (time to progression of disease or death) and the composite of bacteriologically confirmed treatment failure or Staphylococcus aureus related death by 12 weeks.
Results
For a target power of 0.80, the simulation study provided mean (± SE) empirical powers equal to 0.799 (±0.004) and 0.798 (±0.004) in the exponential and nonexponential settings, respectively. The power was attained in more than 95% of the simulated scenarios and was always above 0.78, regardless of compliance with the proportionalhazard assumption.
Conclusions
The geometric average hazard ratio as an effect measure for a composite endpoint has a meaningful interpretation in the case of nonproportional hazards. Furthermore it is the natural effect measure when using the logrank test to compare the hazard rates of two groups and should be used instead of the standard hazard ratio.
Background
Composite endpoints (CEs), defined as the union of several outcomes, are extensively used as a primary endpoint when designing a clinical trial. In timetoevent studies, CE refers to the elapsed time from randomization to the earliest observation among its components. It is common in oncology trials to use progressionfree survival (PFS) as a primary endpoint: this outcome is defined as the time elapsed from randomization to tumor progression or death from any cause, whichever occurs first [1]. In cardiovascular trials, major adverse cardiac event (MACE) is generally defined as a composite endpoint of time to cardiovascular death, myocardial infarction, stroke and target vessel revascularization [2]. Composite endpoints are as well often used for infectious diseases. In the ARREST trial [3], the primary endpoint was the time to bacteriologically confirmed treatment failure, disease recurrence or death.
In randomized controlled trials (RCTs), to assess the efficacy of an intervention on a timetoevent endpoint, the hazard ratio (HR) is routinely used. Design and analysis in most RCTs are based on the proportional hazards model, even when this proportionality is not met. Royston et al. [4] explored 55 comparisons in 50 published RCT and found evidence of nonPH at the 0.10 level almost in 1 out of 3 comparisons (31%) that had assumed proportional hazards for the sample size calculations. This is often the case in RCT with a CE even if the proportionality assumption holds for each component endpoint.
The conventional formulae for the required number of events with regard to the defined primary endpoint in timetoevent studies depends, given the specified significance level and power, on a single treatment effect summarized by the hazard ratio anticipated under the alternative hypothesis. The number of patients that have to be recruited to observe the calculated number of events depends on, among others, the probability of observing the event during the followup. In the context of a trial with a CE, if these formulae are to be used, it is necessary to decide on a summary for the hazard ratio of the CE, HR_{∗}(t).
Different summaries for HR_{∗}(t) have been put forward such as the average hazard ratio (AHR) proposed by Kalbfleisch and Prentice [5], and the geometric average hazard ratio (gAHR). Schemper et al. [6] compares these average hazard ratios, and explores weighted versions of AHR and gAHR. While under proportional hazards all definitions lead to the same values, under nonproportional hazards both the unweighted and the weighted versions of AHR and gAHR are close to each other except when the hazards cross. We emphasize here the use of the gAHR as it nicely connects with the logrank test, as opposed to the AHR. Our paper has some analogy with Rauch et al.’s work [7], who provide guidance on the practical use of the average hazard ratio introduced by Kalbfleisch and Prentice [5].
This paper focuses on the sample size calculation for a clinical trial with a primary composite endpoint. We start by introducing the notation, some of the assumptions and definitions. Next, the logrank test and the noncentrality parameter is set forth and based on the connection between the two, the number of events and sample size formulae for a twosample problem based on a twocomponent composite endpoint (CE) are provided. Following this, software CompARE (https://cinna.upc.edu/CompARETimeToEvent/) is introduced and we show how to use it to design trials with composite endpoints in the setting in which each composite endpoint approximately satisfies the proportionality hazard assumption. Its application is illustrated by means of two real RCTs, ZODIAC and ARREST. Then, the results of a simulation study of the empirical power based on the sample size formula previously derived are shown. These simulations are run for several scenarios including different values of the component causespecific hazard ratios, a wide range of probabilities of observing the events in the control group and different degrees of association between the components. We conclude the paper with a discussion.
Methods
Geometric average hazard ratio
In such an RCT, individuals are followed until the event of interest (\({\mathcal {E}}_{1}\) or \({\mathcal {E}}_{2}\)), the end of the study or censoring due to random loss to followup. For each group g (g=0,1) we denote by \(T^{(g)}_{1}\) and \(T^{(g)}_{2}\) the times to \({\mathcal {E}}_{1}\) and \({\mathcal {E}}_{2}\), respectively, and by \(T^{(g)}_{*}\) the time to the occurrence of \({\mathcal {E}}_{*} \), i.e., the earlier occurrence of \({\mathcal {E}}_{1}\) or \({\mathcal {E}}_{2}\). We are in this case dealing with a competing risk situation for which several approaches are possible, one well known alternative being the Fine and Gray model. However, we have chosen to model the causespecific hazards because it can be accomplished with standard methods for single types of events by treating all competing events as right censored at the time the competing event occurs [8]. In addition, most often the causespecific hazard is reported from previous studies that provide the basis for the assumed effect size If we view the events \({\mathcal {E}}_{1}\) and \({\mathcal {E}}_{2}\) as the two causes of the composite event \({\mathcal {E}}_{*}\), then for g=0,1, the causespecific hazard rates, \(\lambda ^{(g)}_{C1}(t)\) and \(\lambda ^{(g)}_{C2}(t)\), the allcause hazard rate corresponding to the hazard of the composite endpoint \(T^{(g)}_{*}, \lambda _{*}^{(g)}(t)\), and the survival function of \(T^{(g)}_{*}, S^{(g)}_{*}(t)\), are expressed, respectively, as
Denote by HR_{k}(t) (k=1,2) the causespecific hazard ratios, that is, the hazard ratios of the individual components, and by HR_{∗}(t) the allcause hazard ratio of the composite endpoint \(T^{(g)}_{*}\).
Define the geometric average hazard ratio, gAHR, as the exponentiated mean of the logarithm of the hazard ratio, that is,
where the expectation is taken with respect to a given eventtime distribution, which in this case is the average distribution of \(T_{*}^{(0)}\) and \(T_{*}^{(1)}\) (see Eq. 2 below). Although HR_{∗}(t) is expected to change over time, gAHR is independent of time, keeps its interpretability under non proportional hazards and, as we will see in “Logrank test for the composite endpoint T_{∗}” section, is the natural effect measure when using the logrank test.
Because within a clinical trial there is a maximum followup time, say τ, only a restricted version of the geometric average hazard ratio can be consistently estimated. Define the truncated geometric average hazard ratio at time τ,gAHR(τ) as
where \(f^{(a)}_{*}(t)=(f_{*}^{(0)}(t)+f_{*}^{(1)}(t))/2\) is the average of the density functions of \(T_{*}^{(0)}\) and \(T_{*}^{(1)}, f^{(g)}_{*}(t)= \frac { \partial }{\partial t} S^{(g)}_{*}(t)\) is the density function of \(T^{(g)}_{*} \) (g=0,1) and \(p^{(a)}_{*}(\tau)=\int _{0}^{\tau } f^{(a)}_{*}(t) dt =(p^{(0)}_{*}(\tau)+p^{(1)}_{*}(\tau))/2\) is the average probability of experiencing the event \({\mathcal {E}}_{*}\) over both groups by time τ.
The geometric average hazard ratio and the allcause hazard ratios take identical values under proportionality of the allcause hazard rates, that is, if \(HR_{*}(t)=\lambda _{*}^{(1)}(t)/\lambda _{*}^{(0)}(t)=h\) for 0<t<τ, then gAHR(τ)=HR_{∗}(t)=h, for 0<t<τ.
Logrank test for the composite endpoint T _{∗}
The hypothesis of no treatment difference when we are using the composite endpoint T_{∗} is stated in terms of the allcause hazard rates for T_{∗}, that is, \(H_{0}^{*}\!: \lambda _{*}^{(0)}(\cdot)=\lambda _{*}^{(1)}(\cdot)\).
The standard approach to assess the above comparison is the logrank test. Assume that we have n patients, with n^{(0)}=(1−π)·n allocated to treatment group 0 and n^{(1)}=π·n to group 1, where π is the proportion of individuals allocated to group 1. Denoting by d_{∗} the total number of patients from both groups who have experienced event \({\mathcal {E}}_{*}\) (either \({\mathcal {E}}_{1}\) or \({\mathcal {E}}_{2}\)), say at times t_{i} (i=1,⋯,d_{∗}) and by R^{(g)}(t_{i}) the number of individuals at risk at time t_{i} from group g (g=0,1), the logrank test statistic Z_{∗} can be expressed as
and one rejects \(H_{0}^{*}\) when Z_{∗} is large.
The large sample behaviour of Z_{∗} is studied by Schoenfeld [9] who shows that Z_{∗}, under the null hypothesis of equality of the survival distributions in the two groups, is asymptotically normal with mean 0 and unit variance. Since for any fixed alternative to \(H_{0}^{*}\), the power of Z_{∗} will typically go to 1 as n→∞, the large sample behaviour of Z_{∗} when \(H_{0}^{*}\) does not hold is studied for a sequence of contiguous alternatives to \(H_{0}^{*}\) which approach \(H_{0}^{*}\) as n→∞. That is, we view \(\lambda _{*}^{(0)}(\cdot)\) as fixed, let \(\lambda _{*}^{(1)}(\cdot)\) vary with n and define the sequence of contiguous alternatives to \(H_{0}^{*}\) as \(H^{*}_{a,n}\!:\lambda ^{(1)}_{*,n}(t)=\lambda _{*}^{(0)}(t)e^{g(t)/\sqrt n},\) stating that for any finite n, the two groups have a log hazard ratio at time t equal to \(g(t)/\sqrt n\). Under these conditions Z_{∗} is also approximately unitvariance normal, but with a nonzero mean that depends on the survival and censoring distributions in the two groups, and the proportion of subjects that are in each group. The asymptotic theory behind these results is analogous to what is done when studying the largesample properties of likelihoodbased tests in more standard settings (where there is no censoring). The reader is referred to Section 3.3 in Lehmann [10] for additional technical details about the contiguous alternative hypotheses.
Under any form of g(t), even constant, Gómez and Lagakos [11] applied this result for a composite endpoint under the assumption of noninformative censoring. They showed that for a sufficiently large time, τ, the noncentrality parameter μ_{∗} is, approximately, as follows
where \( f^{(0)}_{*}(t)\) is the marginal density function for \(T^{(0)}_{*}\).
This allows evaluation of the behaviour of the logrank test under alternatives where the hazard functions for the two groups are nonproportional, as is the case in the composite endpoints situation that we are dealing with. Furthermore, if we replace \( f^{(0)}_{*}(t)\) by the average of the density functions of \(T_{*}^{(0)}\) and \(T_{*}^{(1)}, f^{(a)}_{*}(t)\), which trivially equals to \( f^{(0)}_{*}(t)\) under \(H_{0}^{*}\), the expression (4) for the noncentrality parameter, becomes
or equivalently, using the expression for gAHR(τ) in (2),
showing that it depends on the geometric average hazard ratio without relying either on the proportionality of the causespecific hazard rates or on the allcause hazard rates.
Sample size estimation
Assume that you are planning a RCT based on a composite endpoint as the primary endpoint, that you are basing the comparison between the two groups on the logrank test statistic Z_{∗} given in (3) and that the geometric average hazard ratio is used as a measure of the treatment effect. From now on, the focus is to claim superiority of the new therapy (g=1), hence, the logrank test statistic Z_{∗} given in (3) will be used and the null hypothesis will be rejected for a onesided α significance level whenever Z_{∗}<−z_{α} where z_{α} is the αquantile of the standard normal distribution, noting that negative values of Z_{∗} favor the new therapy.
The asymptotic results in previous subsection may be applied to a fixed sample size n and a fixed alternative, hence the expression (6) of μ_{∗}(τ) can be used to plan the size, power and the duration of a study. Using that Z_{∗}, follows a normal distribution with mean μ_{∗}(τ) and variance 1, the power 1−β is such that
it follows from (7) that −z_{α}−μ_{∗}(τ)=−z_{1−β}=z_{β}⇒μ_{∗}(τ)=−(z_{α}+z_{β}), and equating with (6) we have
The total sample size for both groups is therefore as follows:
and the expected number of CE events \(e_{*}=n\cdot p_{*}^{(a)}(\tau)\) is given by
In the special case of equal sample sizes (π=0.5), (9) and (10) become, respectively,
To obtain expression (12) from expression (11) (or vice versa), the same followup period (τ) had to be assumed for all study participants, regardless of how recruitment was carried out. Although we consider this to be a common approach in clinical trials, it is not the only one. Another option would be for recruitment to take place over a certain duration and for subsequent followup to be done up to a fixed point in time. These and other strategies are well explained in Chapter 8 of the book of Friedman et al. [12]. Nevertheless, the key point is in the estimation of the number required of events because that is where a difference may be compared to other methods. Once the number of events required is estimated, the effect of the recruitment rate or the presence of different followup periods on sample size calculation — i.e. the number of patients required — would be the same across different methods. There are several references that deal with this issue. The article of Lachin et al. [13] shows how to go from the number of events to the number of patients under situations where patients enter the trial in a nonuniform manner over time or patients may exit from the trial due to loss to followup. Besides, Bernstein et al. [14] provide a routine in Fortran to estimate the probabilities of observing the event from the parameters referring to recruitment and followup times.
Observe that, formula (12) corresponds to George and Desu’s formula [15] (known these days as Schoenfeld’s formula [9]) for the required number of events if the hazard ratio is substituted by the geometric average hazard ratio. The main difference lies, however, in that while in Schoenfeld’s formula you are assuming that the hazard rates are proportional when dealing with a composite endpoint the allcause hazard rates do not have to be proportional. However, to be able to compute the gAHR(τ) you need extra distributional assumptions that are described in the next section.
CompARE: a software application to design trials with composite endpoints
We introduce here the webbased application CompARE (https://cinna.upc.edu/CompARETimeToEvent/) that will be used for the sample size computations. CompARE is a website specifically created to design clinical trials with composite endpoints. From input parameters such as the causespecific hazard ratios, the probabilities of observing each event, the shape parameters of the distributions of the time to each component (assumed Weibull) and the correlation between marginal distributions, CompARE computes and plots, among others, the hazard ratio along time, HR_{∗}(t), summaries such as the geometric average hazard ratio, gAHR(τ), and the restricted mean survival time, RMST(τ), and calculates the sample size for a given significance level and power.
CompARE has been built for balanced designs (equal sample size in both groups) and depends on anticipated values provided by the users. In particular, in order to compute the required sample size by means of (11) we need to compute gAHR(τ) and \(p_{*}^{(a)}(\tau)\) and to do so we have to make distributional assumptions and provide the values of several parameters. Specifically we need, for each group (g=0,1), the distribution of the composite endpoint \(T^{(g)}_{*}=\min \{T^{(g)}_{1}, T^{(g)}_{2}\}\), which is derived from the joint distribution between \(T^{(g)}_{1}\) and \(T^{(g)}_{2}\). In what follows we itemize the elements implemented in CompARE for a full characterization of the joint distribution.

1
The joint distribution between \(T^{(g)}_{1}\) and \(T^{(g)}_{2}\) is modeled through a copula. The copula binds the marginal distributions of \(T^{(g)}_{1}\) and \(T^{(g)}_{2}\) through an association parameter. CompARE can use several copulas, but in this work we use Archimedean copulas by Frank, Clayton, and Gumbel (see [16]) as the more appropriate for modeling eventtime data, providing different dependence characteristics.The measure of association between \(T_{1}^{(g)}\) and \(T_{2}^{(g)}\) is given by Spearman’s rank correlation coefficient ρ or Kendall’s τ. We assume that these measures of association are the same in both groups. If C(u,v;θ) denotes the chosen copula and θ is the association parameter, τ and ρ are defined as follows:
$$\begin{array}{@{}rcl@{}} \rho&=&12\int_{0}^{1}\int_{0}^{1}[C(u,v;\theta)uv]dudv \\ \tau&=&4\int_{0}^{1}\int_{0}^{1}C(u,v; \theta)dC(u,v; \theta)1 \end{array} $$(13) 
2
The marginal laws for \(T^{(g)}_{k}\) (g=0,1;k=1,2) are from the Weibull family of distributions. The Weibull law depends on a scale and a shape parameter. It has been chosen because it is flexible enough to represent different lifetime data scenarios, allowing increasing, constant (exponential model) and decreasing hazard functions, although would not be valid for nonmonotonous hazard functions. Furthermore Weibull distributions for both groups result in proportional hazards if they share the same shape parameter. The exponential law, which is often the preferred choice for sample size calculations, is a special case when the shape parameter equals to 1. While the shape parameters (β_{1},β_{2}) of the Weibull distributions are given as inputs by the researcher, the scale parameters \(\left (b_{1}^{(0)}, b_{2}^{(0)},b_{1}^{(1)}, b_{2}^{(1)}\right)\) are determined via:

(a)
The probabilities \(p_{1}^{(0)}=p_{1}^{(0)}(\tau)\) and \(p_{2}^{(0)}=p_{2}^{(0)}(\tau)\) of observing endpoints \(T_{1}^{(0)}\) and \(T_{2}^{(0)}\). They are defined, taking into account the competing risk setting, as \(p^{(0)}_{1}=\text {Prob}\left \{T^{(0)}_{1}<\tau, T^{(0)}_{1}<T^{(0)}_{2}\right \}\) and \(p^{(0)}_{2}=\text {Prob}\left \{T^{(0)}_{2}<\tau, T^{(0)}_{2}<T^{(0)}_{1}\right \}\) based on the joint distribution of \(T^{(g)}_{1}\) and \(T^{(g)}_{2}\), which has been modeled through the copula C(u,v;θ) explained in item 1. In those cases when \({\mathcal {E}}_{2}\) (analogously for \({\mathcal {E}}_{1}\)) does not represent a fatal event, then we can observe all the occurrences of \({\mathcal {E}}_{1}\) and define \(p^{(0)}_{1}=\text {Prob}\{T^{(0)}_{1}<\tau \}\). The scale parameters in the reference group (g=0) are function of the joint density \(f_{(1,2)}^{(0)}(\cdot,\cdot ;\theta)\) and are computed as the solution of the following equations:
$$\begin{array}{@{}rcl@{}} p_{1}^{(0)}=\int_{0}^{\tau}\left(\int_{u}^{\tau}f_{(1,2)}^{(0)}(u, v;\theta) dv\right)du \\ p_{2}^{(0)}=\int_{0}^{\tau}\left(\int_{v}^{\tau}f_{(1,2)}^{(0)}(u, v;\theta) du\right)dv \end{array} $$(14) 
(b)
The causespecific hazard rates \(\lambda ^{(g)}_{C1}(t)\) and \(\lambda ^{(g)}_{C2}(t)\) (g=0,1) for \({\mathcal {E}}_{1}\) and \({\mathcal {E}}_{2} \), respectively. We assume that treatment groups have proportional causespecific hazard rates for each component and denote by HR_{1} and HR_{2} the respective causespecific hazard ratios, that is,
$$\begin{array}{@{}rcl@{}} {}\text{HR}_{1}&=&\lambda_{C1}^{(1)}(t)/\lambda_{C1}^{(0)}(t)\quad \text{for\ all} \quad t<\tau \\ {}\text{HR}_{2}&=&\lambda_{C2}^{(1)}(t)/\lambda_{C2}^{(0)}(t) \quad \text{for\ all} \quad t<\tau \end{array} $$(15)Without loss of generality assume that both events \({\mathcal {E}}_{1}\) and \({\mathcal {E}}_{2}\) are undesirable and that the new therapy is expected to reduce the risk of both events, that is, HR_{k}<1,k=1,2. The proportionality of the causespecific hazard ratios HR_{1} and HR_{2} allows us to compute the scale parameters in the new therapy group (g=1). For more details, the reader is refered to the supplementary material in Gómez and Lagakos [11] for the relationship between the scale parameters with \(\left (p_{1}^{(0)}, p_{2}^{(0)}\right)\) and (HR_{1},HR_{2}).

(a)
In many phase III RCTs the results from earlier trials provide the rationale for the further assessment of the intervention, and could be used for the sample size calculation. In other situations, similar clinical trials or observational studies could be used. However, as in most sample size settings, obtaining the required parameters is not straightforward and the resulting sample size may depend heavily on the appropriate choice of those. In the composite endpoint setting the choice of the copula and the association between the marginals adds complexity. Since CompARE allows the users to compute the sample size under different scenarios, the influence on the sample size of the different choices can be studied, providing a basis for a more informed decision.
Results
Case studies
The ZODIAC trial
We illustrate the technique for sample size determination using the ZODIAC trial [17]. The trial compared the efficacy of vandetanib plus docetaxel versus placebo plus docetaxel as secondline treatment in patients with advanced nonsmallcell lung cancer. The coprimary endpoints of this trial were overall survival (OS) and progression free survival (PFS), defined as the absence of death and disease progression (DP).
A total of 1,176 events (both groups) were determined to be necessary to detect at least a 20% reduction (HR<0.8) on PFS treatment effect using a twotailed logrank test with 0.90 power and 0.0242 level of significance (they adjusted for multiplicity to simultaneously assess both coprimary endpoints). They enrolled 1,391 patients which were followed between 3 to 24 months. The trial was conducted between May 2006 and April 2008 and, at the end of the study, the reported causespecific HRs for each component were 0.91 (OS) and 0.77 (time to progression, TTP) and the estimated HR of the composite endpoint, PFS, was 0.79. The probabilities of observing DP or deaths (accounting for those subsequent to the DP) were, respectively, 0.74 and 0.59. They concluded that vandetanib in combination with docetaxel significantly improves PFS compared with placebo plus docetaxel.
Assume that a future trial for advanced nonsmallcell lung cancer is to be conducted using PFS as the composite primary endpoint and aiming to prove the effect of an intervention on the PFS through the geometric hazard ratio gAHR. We can use the abovementioned reported values in the ZODIAC trial as our anticipated parameters, allow for different magnitudes for the association between death and TTP and for different patterns (constant, decreasing or increasing) for the causespecific hazard rates. A total of 5 scenarios have been considered. In all but the first, a moderate correlation has been assumed (Spearman’s ρ=0.5). The first two scenarios follow the classic assumption in sample size calculations of exponentiality (Weibull shape parameters β_{1}=β_{2}=1), the first assuming weak correlation between OS and TTP (Spearman’s ρ=0.1). Scenarios 3 and 4 assume exponentiality (β_{1}=1) for OS but increasing (β_{2}=2) and decreasing (β_{2}=0.5) hazard rates over time for TTP, respectively. Scenario 5 assumes increasing hazard rate (β_{1}=2) for OS and decreasing (β_{2}=0.5) for TTP. Scale parameters were worked out from the reported cause specific hazard ratios and the probabilities of observing DP (\(p_{2}^{(0)}=0.74\)) and death (\(p_{1}^{(0)}=0.59\), including those after DP) (see item 2a in previous section). Based on ZODIAC’s trial followup, we set a fixed τ=24 months.
Table 1 summarizes the results for the 5 scenarios. CompARE provided the geometric average hazard ratio evaluated at 24 months, gAHR(24), the required number of events to achieve 90% power using Eq. 12, the probability of observing the composite event in either group, \(p_{*}^{(a)}\) and the corresponding sample size using Eq. 11. The empirical power with this sample size was obtained via simulation with 10,000 runs.
The geometric average hazard ratio evaluated at 24 months, gAHR(24), slightly changes among scenarios, ranging from 0.803 to 0.825, and is close both to the reported point estimate of PFS hazard ratio (0.79) and to the PFS hazard ratio used in the sample size calculation (0.80). These small differences in the effect measure cause a reduction of 305 events when going from Scenario 5 to Scenario 3. This reduction together with a very large probability of observing the composite event in Scenario 3 implies almost 600 fewer patients in the total sample size for this scenario.
We can also observe that the monotonicity pattern of the causespecific hazard rates, determined by the shape parameters β_{j}, has an important influence on the probability of observing the composite event and, consequently, on the sample size. Finally, the degree of association has as well some impact on the required number of events. For instance, when comparing Scenarios 1 and 2, we need 102 extra events in the later due to a higher association between OS and TTP.
After running simulations with these scenarios, the obtained empirical powers were very close to the target power (0.9) and they do not seem to be influenced neither by different hazard behaviors nor by the association magnitude. We conducted simulations under other settings (not shown) with other possible combinations of β_{1} and β_{2} (β_{j}∈{0.5,1,2}) and for a wide spectrum of correlations (ρ∈{0.1,0.3,0.5,0.8}) and the minimum empirical power achieved was 0.893.
Finally, we explored the influence that different marginal hazard rate patterns could have on the behaviour of the PFS hazard ratio, HR_{∗}(t), and Fig. 1, reproduced with CompARE, depicts them. While in the first two scenarios, where both components are exponentially distributed, HR_{∗}(t) does not vary much over time and a summary measure such as the hazard ratio could capture well enough the effect of the intervention, in the remainder three scenarios HR_{∗}(t) could vary between 0.77 and 0.90. The first 3 scenarios would correspond to interventions with a decrease in hazard ratio over time, while in the last two, the efficacy would be greater at the beginning of the followup. These graphs show the relevance of the behavior of the hazard rates in the evolution of treatment efficacy over time.
The ARREST trial
The ARREST trial [3] was the first large randomized clinical trial of antibiotic therapy in patients with Staphylococcus aureus (SA) bloodstream infection. It tested the hypothesis that adjunctive rifampicin improves disease outcome. The study was designed with a composite primary outcome: bacteriologically confirmed treatment failure or disease recurrence or death by week 12.
Assuming 0.80 power and a twosided test with size α=0.05, 770 participants were needed to detect a 30% relative reduction (from 35% to 25%) in the composite primary endpoint. In addition, a decrement in the percentage of deaths from 16 to 9% was anticipated. A 10% and 4% losses to followup were assumed in each component, respectively. No differences were found in the primary composite endpoint, HR_{∗}=0.96 (95% CI, from 0.68 to 1.35). The HR for overall survival was HR_{OS}=1.10 (95% CI, from 0.76 to 1.60), leading to a point estimate of the overall survival treatment effect in the opposite direction than the one expected.
This new case study has interest on its own because of the following differences with the ZODIAC trial. First, the ARREST trial includes three outcomes and shows that having more than two outcomes of interest does not limit our methodology. In this case we can combine several outcomes into a single component, as long as we can anticipate the parameters required in the calculation of the sample size for such components and the new HR is reasonably constant. Second, there is a huge difference in the proportion of events; while in the ZODIAC trial, more than 50% of the patients suffered any event, in the ARREST trial, none of the involved events was present in more than 15% of the patients. Third, since we have had access to the ARREST raw data, this illustration shows how a previous study might help to set the input parameters for the sample size calculation.
Suppose we want to carry out a new RCT to show the efficacy of adjunctive rifampicin in reducing bacteriologically confirmed treatment failure, disease recurrence or death by week 12. These three outcomes will be considered and their composite will be chosen as the primary endpoint. Overall survival is our first component, while nonfatal events (bacteriologically confirmed treatment failure or disease recurrence) will conform the second component, which in fact is the most relevant because it more closely reflects the treatment effect. Only for the purpose of this illustration, we are assuming a weak beneficial treatment effect on the allcause mortality (HR_{1}=0.95) and a large effect on the bacteriologically confirmed treatment failure or disease recurrence (HR_{2}=0.35). The probabilities of observing each component event in the control group were 0.14 and 0.05 for fatal and nonfatal events, respectively. In the competingrisks setting, like this one, marginal timetoevent distributions are not directly estimable from the raw data [18]. However, at the design stage, we have to anticipate the shape parameters. On one hand, the OS shape parameter can be estimated because we have the information on all deaths (before and after the nonfatal events), obtaining \(\beta _{1}^{(0)}=0.7\). We are aware that when considering deaths after PD, we are not dealing with the marginal distribution in a context of competing risks, but we consider this a good approximation. For the nonfatal events, and since we have partial information from the trial, we have computed the shape parameter \(\beta _{2}^{(0)}\) for different potential scenarios, obtaining values that range from 0.9 to 2.7. These different values lead to sample sizes ranging from 3,136 to 3,350. Only for the purpose of this illustration, we have assumed that the nonfatal events are precisely those that have been observed and in this case, \(\beta _{2}^{(0)}=0.91\) was obtained.
We will assume that the correlation between the marginal distributions of fatal and nonfatal time to events is weak (ρ=0.1). Table 2 shows the assumed parameters, together with the number of events and sample size required, the probability of observing the composite event and the empirical power.
Some meaningful conclusions emerge from Table 2. First of all, the number of patients required taking into account an expected 10% followup losses would be 3,598 (=3,238/0.90) patients. This is a clearly higher number of patients than planned in the ARREST trial (n=770). This is mainly due to the fact that the ARREST trial protocol anticipated a much greater treatment effect on OS (going from a proportion of 16% to 9% is equivalent to a relative risk of 0.56) when in fact it resulted in a HR_{OS}=1.10, and this why we are considering HR_{1}=0.95 for OS. This almost negligible effect on OS together with the highest probability of observing OS causes the treatment effect on the other component to be blurred in the composite endpoint and results in a huge sample size. In this sense, a trial that deals with the most relevant component (bacteriologically confirmed treatment failure or disease recurrence) as primary endpoint would only require observing 29 events for a HR=0.35 and, assuming that mortality is independent of treatment failure or disease recurrence and that the event was observed in a proportion of patients equal to 0.05, it would involve 570 patients. Regarding the main objective of our work, the empirical power considerably trap the target power of 0.80.
CompARE can be used to plot the HR_{∗}(t) for the considered scenario. Figure 2 represents the allcause HR over time based on estimated parameters from the raw data. It is reasonably constant with a value slightly around to 0.80 for almost all the followup except for the earlier times. So, a summary measure such as gAHR could serve to describe the hazard ratio and its estimated value gAHR=0.788 would be used in the design stage of a new study to calculate the needed sample size.
Simulation studies
Simulation settings
The aim of the simulation study is to evaluate if the proposed method for calculating the sample size reaches the desired empirical power under different scenarios. Furthermore, we compare the proposed method to the naïve method resulting from averaging the HRs of the components. This measure would be very likely the choice of a trialist in the absence of further information. From now on, we will call this new measure the naive HR (nHR) and the method associated the naive method.
We have chosen scenarios that represent realistic situations when designing an RCT [11]. The probabilities \(p_{1}^{(0)}\) and \(p_{2}^{(0)}\) of observing each event in the control group have been taken between 0.1 and 0.5; the causespecific hazard ratios HR_{1} and HR_{2} of each component from 0.6 to 0.9; the times until the events for each component (\({\mathcal {E}}_{k}, \ k=1,2)\) have been modeled according to Weibull distributions with constant (β_{k}=1), decreasing (β_{k}=0.5) or increasing hazards (β_{k}=2); and the correlations between these times have been selected from low to moderatelyhigh (from ρ=0.1 to ρ=0.5). In addition, three different copulas (Frank, Clayton and Gumbel) were implemented to model the joint distribution. For simplicity, and without loss of generality, we have scaled the problem to a unit time of followup (τ=1).
We have considered two different settings. Setting 1 (405 scenarios) assumes that the times \(\left (T_{1}^{(0)}, T_{2}^{(0)}\right)\) are exponentially distributed, while Setting 2 (3,240 scenarios) considers \(\left (T_{1}^{(0)}, T_{2}^{(0)}\right)\) Weibull distributed. Scenarios with observed proportions \(p_{1}^{(0)}=p_{2}^{(0)}=0.5\) are not realistic since they represent scenarios without censoring and have not been considered (See Table 3). Frank’s copula has been chosen to bind \((T_{1}^{(0)}, T_{2}^{(0)})\) in all the scenarios of both settings to compute the empirical powers. In order to assess the relevance of the copula’s choice on the results, the gAHR has been calculated in all the scenarios of both settings and through 3 different copulas (Frank, Clayton and Gumbel).
Simulation procedure
We ran 10,000 iterations for each scenario described in Table 3.
The empirical power of onesided logrank test for the composite endpoint is computed using a statistical significance of α=0.025. Given Frank’s copula and a set of input parameters (\(\beta _{1}, \beta _{2}, p_{1}^{(0)}, p_{1}^{(1)}, HR_{1}, HR_{2}, \rho \)), the simulation was conducted following the next steps:

Parameter of the copula. The copula parameter, θ, which defines the association between both components is calculated from the Spearman’s correlation coefficient, ρ. The relationship between these parameters is onetoone and given in (13).

Scale parameters. Based on Eqs. 14 and (15), we can numerically obtain the scale parameters of the Weibull marginal distribution in groups 0 and 1, respectively using the multiroot function of the rootSolve R package.

Geometric average hazard ratio,gAHR(τ). gAHR(τ) is calculated following Eq. 2, which depends on \(HR_{*}(t)=\lambda _{*}^{(1)}(t)/\lambda _{*}^{(0)}(t)\). HR_{∗}(t) was numerically estimated for 1,000 equidistant points over the followup time (from 0 to τ).

Sample size. For a given power of 1−β=0.8 and onesided significance level α=0.025, together with the gAHR(τ), Eq. 11 is used to compute the sample size.

Generate data. For each of the 10,000 iterations, bivariate data with the sample size obtained in the previous step was generated via Frank’s copula and using the Mvdc function of the copula R package. The simulated data was censored at the end of followup and takes into account the competing risks.

Test. For each of the 10,000 iterations, the longrank test is conducted on the data and the statistic Z_{∗} for the composite endpoint T_{∗} is stored. This test is implemented in the survdiff function of the survival R package.

Empirical power. The empirical power is estimated as the proportion of Z_{∗} statistics falling into the rejection region, i.e, Z_{∗}<−1.96 along all the iterations.
All simulations were performed using the R version 3.6.1. We have not run the simulation for those scenarios with an associated sample size greater than 20,000 (both groups) due to their computational cost and because they do not represent realistic setups in the scope of RCTs. The R code for simulations, not supported by CompARE, is available at https://github.com/jordicortes40/sample_size_composite.
Simulation results for exponential case
From the 405 scenarios of this setting, we have excluded 9 (2.2%) and 3 scenarios (0.74%) with a sample size larger than 20,000 (both groups) when using the gAHR and nHR, respectively. These cases correspond to scenarios where both HRs are equal to 0.9 and the observed proportions of the events are equal or less than 0.10. We summarize in Fig. 3 and Table 4 the empirical powers for both measures (gAHR and nHR) corresponding to Setting 1 where \(T_{1}^{(0)}\) and \(T_{2}^{(0)}\) are exponentially distributed and bound via Frank’s copula.
In the 396 included scenarios using the gAHR, the required number of events (equation 12) ranges from 122 to 3,338 with a median equal to 644 [IQR: 2221,254] and the total sample size ranges from 176 to 17,402 with a median equal to 1,644 [IQR: 6004,157].
Figure 3 shows a violinboxplot comparing the empirical powers achieved with gAHR and nHR methods, respectively merging all scenarios. For the former, both mean (solid point) and median are equal to 0.799, very close to the target power 0.80 taking into account the simulation mean standard error of 0.004. Moreover, this violinboxplot shows an almost perfect symmetry with respect to the mean and the median, indicating a similar propensity to move away from the central tendency towards both higher and lower values. On the contrary, the naive method provides empirical powers with a mean/median lower than desired (0.771/0.739) and involves powers in a wide range from 0.22 to 0.99.
Table 4 presents summary statistics for the empirical power from both methods according to different categories for the input parameters: 1) Stratifying by treatment effect: i) one of the two components has a large treatment effect (HR_{k}=0.6,k=1 or 2); ii) both components have a small treatment effect (HR_{k}=0.9,k=1,2); iii) the remainder cases; 2) Stratifying by the probability of observing the event in the control group: i) at least one of the two components has a very small probability (\(p_{k}^{(0)}=0.05, k=1\ \text {or} \ 2\)); ii) both components have equal and large probabilities of being observed (\(p_{1}^{(0)}=p_{1}^{(0)}\geq 0.3\)); iii) the remainder cases; 3) Considering three different values for the correlation between both endpoints.
Overall, using the gAHR, 95.5% of scenarios had an empirical power between 0.79 and 0.81 in front of only 6.5% scenarios in the same interval using the nHR. The results based on the gAHR method were quite consistent among all strata. The first quartile of empirical power was, at least, 0.795 for any considered stratum, indicating that a power of no more than half a percentage point less than the target power will be achieved in 75% of the situations. There were 16 scenarios (4.0%) with empirical power below 0.790. This percentage was slightly higher when a large treatment effects (HR_{k}=0.6) was present in any of the components (6.2%) or if at least one of the observed proportions was equal to 0.05 (7.2%). Both scenarios had lower treatment effects (HR_{k}≥0.8) and observed proportions in the control group between 0.1 and 0.3 for both components. It is worth mentioning that no relevant differences were observed in the empirical power according to the correlation, but it should be borne in mind that for high correlations, a larger sample size was required to achieve the same power. The set of scenarios where the application of the naive method would lead to fairly good control of power would be those with very similar treatment effect on both components. For example, as can be seen in the table, when both HRs are equal to 0.9, there may be a decrease in the desired power – due to the competing events – but it would not go beyond around 7% in the worst situation.
Simulation results for nonexponential cases
From the 3,240 scenarios considered in the setting 2, we excluded, as in the exponential setting 1, 72 scenarios (2.2%) using the gAHR and 24 scenarios (0.74%) using the nAHR with a sample size larger than 20,000. Again, HRs equal to 0.9 and probabilities in the control arm equal or less than 0.10 provided these situations that required extreme sample sizes. We summarize in Fig. 4 and Table 5 the results of the remaining scenarios of setting 2 where \(T_{1}^{(0)}\) and \(T_{2}^{(0)}\) are bound via Frank’s copula and Weibull distributed with shape parameters β_{1} and β_{2} equal to 0.5,1,2, excluding the case in which both are exponential (β_{1}=β_{2}=1). Thus, we included extreme scenarios where the hazard trends over time of both components pointed out in opposite directions with increasing (β_{k}=2) and decreasing (β_{k}=0.5) hazard rates in one and the other component, respectively.
In the included 3,168 scenarios with calculations founded on the gAHR, the needed number of events ranges from 122 to 3,356 with a median equal to 642 and the total sample size ranges from 176 to 17,402 with a median equal to 1,616. Figure 4 shows the empirical power for the 10,000 simulations. Again, both the mean and the median (0.798) are quite close to 0.80 and the violinboxplot reveals a symmetry regarding to these statistics. Meanwhile, the naive method does not get a suitable power control and provides powers that on average (0.739) are lower than the target value, even with extreme scenarios giving probabilities to detect the treatment effect as low as 0.208 or as high as 0.997.
Table 5 presents summary descriptive statistics of the empirical power according to the same categories we defined for the exponential case. We have included a fourth strata splitting the empirical power according to equal and decreasing hazard rates (β_{1}=β_{2}=0.5); equal and increasing hazard rates (β_{1}=β_{2}=2) and the remaining scenarios including different hazard behaviour over time (β_{1}≠β_{2}).
Empirical power derived from the gAHR ranged from 0.782 to 0.813 and 95.7% of the scenarios were between 0.79 and 0.81, while only 6.9% scenarios remains within this range for the naive method. Regarding the results coming from the gAHR, the first quartile of empirical power was, at least, 0.794 in any strata and 0.795 overall, which guarantees a power that at most will be only half a percentage point lower than the target in 75% of scenarios. Almost all situations (99.7%) reflecting lower treatment effects (HR_{k}=0.9) provided an empirical power above 0.79. There were 135 (4.0%) scenarios with empirical power below 0.790. This percentage was higher when a marked treatment effect (HR_{k}=0.6) was present in any of the components (6.3%) or when at least one of the proportion of observed events in the control group was equal to 0.05 (5.8%). The 10 (0.3%) scenarios that presented an empirical power slightly higher than 0.81 did not share any common feature regarding to the input parameters and it could be explained due to the standard error (\(\sqrt {\frac {0.8 \times 0.2}{10,000}}=0.004\)) associated to the simulation procedure. Different levels of correlation between components did not provide relevant discrepancies. The conclusions for the naive method are the same in this nonexponential setting as in the exponential setting
Effect of copula on g A H R
The gAHR was calculated in all scenarios of settings 1 and 2 and for the 3 Archimedean copulas (Frank, Clayton and Gumbell) in order to assess the relevance of the copula binding \(T_{1}^{(0)}\) and \(T_{2}^{(0)}\). Table 6 provides the deciles of the gAHR values for each copula.
Overall, the deciles of the gAHR values obtained from either one of the 3 copulas are very similar. In particular, Frank’s and Gumbel’s Copula show identical values in gAHR percentiles from the 40% percentile on while Clayton’s copula slightly differs from both. The maximum absolute difference among any two copulas, found around the gAHR median, is 0.04, and, although small, it might have an important impact on the computation of the required number of events. For instance, going from a gAHR=0.80 to gAHR=0.76 implies a 34% decrease in the number of required events, as it happens when using Schoenfeld’s formula and the HR. Based on these simula tion findings and others not shown here, we recommend the use of Frank’s copula to bind the joint distribution of the components unless more information can be gathered on the real correlation and the joint behaviour between both components. Nevertheless, CompARE allows the use of these 3 copulas, among others, which can be useful to calculate the HR_{∗}(t) under different association patterns (e.g., stronger correlations at earlier or later times) between the component times.
Discussion
We have shown that the use of gAHR in conventional sample size formulas in timetoevent studies with composite timetoevent endpoints provides the desired power when two treatment groups are compared using the logrank test. This is true regardless of whether the proportional hazards assumption holds or not. In studies involving a composite endpoint, obtaining the theoretical value of this summary measure could be hard; however, CompARE has proven to be a useful tool for this purpose, removing tedious calculations. The gAHR method enhances the performance over the rule of thumb approach based on averaging the treatment effects on each component.
The use of the HR has been debated when hazards are not proportional. Also, the hazard ratio as estimated from the observed data doesn’t reflect the causal hazard ratio if there is heterogeneity in patient risk [19, 20]. Other measures such as the restricted mean survival time may be better to quantify treatment effects. But, actually, the HR is still widely used and most of the published RCTs report the HR as the main measure of the treatment effect. Furthermore, the estimand exp(β) itself still has an interpretation as the ratio of the logarithms of the survival functions log[P(T^{(1)}>t)]/ log[P(T^{(0)}>t)]. In this sense, several summary measures have been proposed to capture the treatment effect when the HR_{∗}(t) varies considerably over time. Probably the most popular is the AHR proposed by Kalbfleisch and Prentice [5], which is mentioned in several studies as a good measure of the effect of an intervention [6, 7]. The gAHR, in most of our simulation settings, presents values very close to the AHR as shown in the BlandAltman plot of concordance (Fig. 5). This point is interesting from the perspective that gAHR could be interpreted as a measure of proportional hazards just in the same way as AHR and, in addition, the former has the advantage of a direct relationship with the sample size calculation.
One question that may arise is how sensitive the sample size and power are to misspecification of some of the input parameters. Of course, a misspecified HR will always influence the needed number of events and consequently, the power in any sample size calculation. Some information regarding the proportion of events observed is also essential to deduce the number of patients from the number of events. Our methodology requires additional inputs, such as the shape parameters of the Weibull distributions; the degree of association among the times to event of both components; or the choice of the copula. They are mostly nuisance parameters, so ideally the sample size would be quite insensitive to the assumptions that is made. We have seen that they are not very sensitive to the choice of copula. But the shape parameters of the Weibull distribution or the correlation might influence the resulting sample size if we blunder in the choice. For the former, the users should draw on their experience to determine the direction of risk over time by assessing whether the event rate increases, decreases or remains constant during the followup. For example, an outcome variable about the infection after a surgery, the most critical stage is just after the intervention and subsequently, the risk decreases over time. Regarding the correlation, obviously, it cannot be determined with certainty, but a rule of thumb would be enough to approximate it. For instance, events that rarely occur in the same patient should be weakly associated or, conversely, if the events are usually dependent on the patient’s characteristics, then it can lead to a moderate/high correlations.
Our investigation has several limitations. First, we have only addressed the case of a composite endpoint of two components. In scenarios with more than two possible outcomes, we recommend to combine them first into two groups according to their relevance or their expected effectiveness as long as it is feasible to anticipate the parameters associated with these components based on previous literature or previous RCT phases. Second, we are aware that the latent failure time model that we are imposing has been criticised because the dependence structure between \(T_{1}^{(g)}\) and \(T_{2}^{(g)}\) is in general not identifiable since the latent times are not observable. Nevertheless, latent times are the predominant approach for simulating competing risks data as it is discussed by Allignol et al. [21] as long as they yield the right data structure and they are computationally correct. We remind as well here that we are at the design phase of the RCT and that we are not addressing the estimation of the treatment effect measures. Third, since we are in a competing risk situation, the hazard of having one of the event types is influenced by the other competing event, making for a complex interference. If, in addition, treatment effects are different on both events, the proper definition of the “at risk”sets is involved and the intuition for what “proportional causespecific hazards” means is not straightforward. However, this is the information that can be usually extracted from previous published studies and, therefore, the one that can be used to estimate the sample size for future trials. Fourth, we have only dealt with the situation of equalsized treatment groups. Although it is well known that the situation that maximizes the power is the one in which the events are balanced among groups, RCTs usually are designed to balance the number of patients among the different treatment arms.
Conclusions
The waste associated with biomedical research is paramount [22] and one of the main problems is poorly designed studies. Underestimating the optimal sample size may lead to failure to make effective interventions available to patients due to unsuccessful trials. On the other hand, trials with too many patients might unnecessarily subject people to ineffective interventions. One may use CompARE platform (https://cinna.upc.edu/CompARETimeToEvent/) to design randomized controlled trials with composite endpoints when the proportional hazards assumption does not hold.
Availability of data and materials
The only dataset analysed during the current study comes from the ARREST trial. The data are not publicly available because the authors are not the owners of the data and they do not have the permission to made them publicly available. Data are however available from the authors upon reasonable request and with permission of the owner of the data MRC Clinical Trials Unit of UCL (https://www.ctu.mrc.ac.uk/), or can be asked for directly from the owner.
Declarations
References
 1
Saad ED, Katz A. Progressionfree survival and time to progression as primary end points in advanced breast cancer: often used, sometimes loosely defined. Ann Oncol. 2009; 20(3):460–4. https://doi.org/10.1093/annonc/mdn670.
 2
Gomez G, GomezMateu M, Dafni U. Informed Choice of Composite End Points in Cardiovascular Trials. Circ Cardiovasc Qual Outcome. 2014; 7(1):170–8. https://doi.org/10.1161/CIRCOUTCOMES.113.000149.
 3
Thwaites GE, Scarborough M, Szubert A, Nsutebu E, Tilley R, Greig J. Adjunctive rifampicin for Staphylococcus aureus bacteraemia (ARREST): a multicentre, randomised, doubleblind, placebocontrolled trial. The Lancet. 2018; 391(10121):668–78. https://doi.org/10.1016/S01406736(17)32456X.
 4
Royston P, ChoodariOskooei B, Parmar MKB, Rogers JK. Combined test versus logrank/Cox test in 50 randomised trials. Trials. 2019; 20(1):172. https://doi.org/10.1186/s1306301932515.
 5
Kalbfleisch JD, Prentice RL. Estimation of the Average Hazard Ratio. Biometrika. 1981; 68(1):105. https://doi.org/10.2307/2335811.
 6
Schemper M, Wakounig S, Heinze G. The estimation of average hazard ratios by weighted Cox regression. Stat Med. 2009; 28(19):2473–89. https://doi.org/10.1002/sim.3623.
 7
Rauch G, Brannath W, Brückner M, Kieser M. The average hazard ratio  A good effect measure for timetoevent endpoints when the proportional hazard assumption is violated?Methods Inf Med. 2018; 57(3):89–100. https://doi.org/10.3414/ME17010058.
 8
Kalbfleisch JD, Prentice RL. The Statistical Analysis of Failure Time Data. Wiley Series in Probability and Statistics. Hoboken: Wiley; 2002. https://doi.org/10.1002/9781118032985.
 9
Schoendfeld D. The asymptotic properties of nonparametric tests for comparing survival distributions. Biometrika. 1981; 68(1):316–9. https://doi.org/10.1093/biomet/68.1.316.
 10
Lehmann EL, Vol. 42. Elements of LargeSample Theory. Springer Texts in Statistics. New York: Springer; 1999, p. 224. https://doi.org/10.1007/b98855.
 11
Gómez G, Lagakos SW. Statistical considerations when using a composite endpoint for comparing treatment groups. Stat Med. 2013; 32(5):719–38. https://doi.org/10.1002/sim.5547.
 12
Friedman LM, Furberg CD, DeMets DL, Reboussin DM, Granger CB. Fundamentals of Clinical Trials. In: Fundamentals of Clinical Trials, 5th edn. New York: Springer: 2015. p. 165–200. https://doi.org/10.1007/9783319185392.
 13
Lachin JM, Foulkes MA. Evaluation of Sample Size and Power for Analyses of Survival with Allowance for Nonuniform Patient Entry, Losses to FollowUp, Noncompliance, and Stratification. Biometrics. 1986; 42(3):507. https://doi.org/10.2307/2531201.
 14
Bernstein D, Lagakos SW. Sample size and power determination for stratified clinical trials. J Stat Comput Simul. 1978; 8(1):65–73. https://doi.org/10.1080/00949657808810249.
 15
George SL, Desu MM. Planning the size and duration of a clinical trial studying the time to some critical event. J Chronic Dis. 1974; 27(12):15–24. https://doi.org/10.1016/00219681(74)900046.
 16
Trivedi PK, Zimmer DM. Copula Modeling: An Introduction for Practitioners. Foundations and Trends® in Econometrics. 2006; 1(1):1–111. https://doi.org/10.1561/0800000005.
 17
Herbst RS, Sun Y, Eberhardt WEE, Germonpré P, Saijo N, Zhou C, Wang J, Li L, Kabbinavar F, Ichinose Y, Qin S, Zhang L, Biesma B, Heymach JV, Langmuir P, Kennedy SJ, Tada H, Johnson BE. Vandetanib plus docetaxel versus docetaxel as secondline treatment for patients with advanced nonsmallcell lung cancer (ZODIAC): a doubleblind, randomised, phase 3 trial. Lancet Oncol. 2010; 11(7):619–26. https://doi.org/10.1016/S14702045(10)701327.
 18
Tsiatis A. A nonidentifiability aspect of the problem of competing risks. Proc Natl Acad Sci U S A. 1975; 1(72):20–2. https://doi.org/10.1073/pnas.72.1.20.
 19
Hernán MA. The Hazards of Hazard Ratios. Epidemiology. 2010; 21(1):13–5. https://doi.org/10.1097/EDE.0b013e3181c1ea43.
 20
Aalen OO, Cook RJ, Røysland K. Does Cox analysis of a randomized survival study yield a causal treatment effect?,. Lifetime Data Anal. 2015; 21(4):579–93. https://doi.org/10.1007/s109850159335y.
 21
Allignol A, Schumacher M, Wanner C, Drechsler C, Beyersmann J. Understanding competing risks: a simulation point of view. BMC Med Res Method. 2011; 11(1):86. https://doi.org/10.1186/147122881186.
 22
Chalmers I, Glasziou P. Avoidable waste in the production and reporting of research evidence. The Lancet. 2009; 374(9683):86–9. https://doi.org/10.1016/S01406736(09)603299.
Acknowledgements
CompARE has been developed by GG and JC in collaboration with Marta Bofill and Moisès Gómez. We thank the trial management group and the MRC clinical trials unit staff for providing the data of the ARREST trial.
Funding
G. Gómez and J. Cortés were partially supported by the Ministerio de Economía y Competitividad (Spain) [MTM201564465C21R (MINECO/FEDER)], the Ministerio de Ciencia, innovación y Universidades [PID2019104830RBI00] and the Departament d’Economia i Coneixement de la Generalitat de Catalunya (Spain)[2017 SGR 622 (GRBIO)]. Ronald B. Geskus was supported by the Wellcome Trust (grant number 106680/Z/14/Z).
Author information
Affiliations
Contributions
Conceptualization, G.G.; Methodology, J.C, R.G, K.M.K and G.G.; formal analysis, J.C.; writing–original draft preparation, J.C., G.G. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Cortés Martínez, J., Geskus, R.B., Kim, K. et al. Using the geometric average hazard ratio in sample size calculation for timetoevent data with composite endpoints. BMC Med Res Methodol 21, 99 (2021). https://doi.org/10.1186/s1287402101286x
Received:
Accepted:
Published:
Keywords
 Treatment Effect
 Composite endpoint
 Randomized Controlled Trial
 ProgressionFree Survival
 Simulation
 Copula
 NonProportional Hazards