Impact of a non-constant baseline hazard on detection of time-dependent treatment effects: a simulation study

Background Non-proportional hazards are common with time-to-event data but the majority of randomised clinical trials (RCTs) are designed and analysed using approaches which assume the treatment effect follows proportional hazards (PH). Recent advances in oncology treatments have identified two forms of non-PH of particular importance - a time lag until treatment becomes effective, and an early effect of treatment that ceases after a period of time. In sample size calculations for treatment effects on time-to-event outcomes where information is based on the number of events rather than the number of participants, there is crucial importance in correct specification of the baseline hazard rate amongst other considerations. Under PH, the shape of the baseline hazard has no effect on the resultant power and magnitude of treatment effects using standard analytical approaches. However, in a non-PH context the appropriateness of analytical approaches can depend on the shape of the underlying hazard. Methods A simulation study was undertaken to assess the impact of clinically plausible non-constant baseline hazard rates on the power, magnitude and coverage of commonly utilized regression-based measures of treatment effect and tests of survival curve difference for these two forms of non-PH used in RCTs with time-to-event outcomes. Results In the presence of even mild departures from PH, the power, average treatment effect size and coverage were adversely affected. Depending on the nature of the non-proportionality, non-constant event rates could further exacerbate or somewhat ameliorate the losses in power, treatment effect magnitude and coverage observed. No single summary measure of treatment effect was able to adequately describe the full extent of a potentially time-limited treatment benefit whilst maintaining power at nominal levels. Conclusions Our results show the increased importance of considering plausible potentially non-constant event rates when non-proportionality of treatment effects could be anticipated. In planning clinical trials with the potential for non-PH, even modest departures from an assumed constant baseline hazard could appreciably impact the power to detect treatment effects depending on the nature of the non-PH. Comprehensive analysis plans may be required to accommodate the description of time-dependent treatment effects. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01372-0.


Type I error
When comparing performance measures such as power for the stipulated PH and non-PH scenarios with a known treatment effect, it is important to assess all analytical approaches are controlling the Type I error level at the same or similar nominal value when there is truly no effect. We compared the empirical type I error of the tests of regression-based treatment effect estimate and equal survival function under the null treatment effect by simulation. In these simulations, there was no treatment effect (ie HR =1) in both periods specified by the data-generating models. Pooled replicates for each event rate type and all change point times are presented in Table S1. For the majority of the tests, the empirical Type I errors are within or close to the nominal two-sided 5% significance level. The Type I error of the RP(PH) combined test is conservative under both types of non-PH. A minor Type I error inflation is observed for the FH(0,1) test weighted for late effects (Type I error: 5.7% (95% CI 5.5%, 5.8%)), with even smaller increases in the Type I error above the nominal level also being observed for the versatile test and the regression coefficient estimates for the RP(PH) ∆RMST, the RP(TD)∆RMST and the AFT TR. Similar minor increases in Type I error rate have been reported in other simulation studies comparing the power of tests for treatment effect under non-PH scenarios.
Supplementary Figures S2 and S3 further present the results of this empirical assessment of Type I error by the decreasing, constant and increasing event rate scenarios using the change points from the lag to effect and early effect that ceases non-PH scenario under investigation for each of the analysis methods. Comparisons of the power of the analysis methods presented below needs to be undertaken with the conservative Type I error for the RP(PH) combined test, and minor inflation of the empirical Type I errors for versatile test and the regression coefficient estimates for the RP(PH) ∆RMST, the RP(TD) ∆RMST and the AFT TR in mind.     198 (190,202) 198 (190,202 (

Early effect that ceases
Power of the regression model approaches Figure S4 presents the results for the non-PH scenario of an early effect that ceases. Seven different modelling approaches were compared. For the constant hazard event rate scenario, the average number of events during the effective treatment period were 22%, 56% and 82% of the total number of events observed for the early effect times of three, ten and twenty months respectively. For the decreasing hazard event rate, the average number of events during the period when there was an early effect were 28%, 63% and 85%, and for the increasing hazard event rate, the average number of events during the effective period were 18%, 52% and 80% of the total number of events observed for the early effect times of three, ten and twenty units respectively.
Supplementary Table S6 presents a summary of event numbers during the active and inactive phases of treatment effect for this early effect that ceases non-PH scenario.

Coverage (%)
Supplementary Figure S4: The power (%), scaled treatment effect magnitude (%) and coverage (%) are presented as relative to that anticipated at the design stafe of the trial assuming PH. The early effect period lengths investigated were t early = 3, 10, 20 and 50 months, with the setting t early = 50 representing PH.
When the treatment was constantly effective throughout the follow up period (t early = 50) equivalent to a PH data generating model, we observed power at or very close to the design model values of 80% for all estimates of treatment effect except for the LM method. There was substantial decreased power for this period-specific estimate partly due to less than half of the events  198 (188,202) Constant 198 (191,202) N/A 198 (191,202) Increasing 198 (191,202) N/A 198 (192,202) being used in the estimation of HR after the prespecified cutpoint of t LM = 10 was applied under all event rates, and partly due to the inclusion of more events from the no treatment effect period. For all methods, there was an appreciable loss of power in the early effect non-PH scenario. A decreasing event rate was able to offset the lower power seen as a result of fewer events occurring during the period when the treatment effect had ceased, relative to the number of events observed under a constant event rate. Conversely, the losses in power observed under an increasing event rate in the presence of an early effect that ceases were greater as a result of more events occurring during the period where the treatment had no effect. This pattern of relative power loss was observed for all three estimands.

Scaled Treatment Effect (STE) estimates of regression model approaches
The results comparing the magnitude of treatment effect estimates are presented in the middle panel of Figure S4. For the STE under the PH scenario (t early = 50), estimates close to the design model values are obtained for the HR and ∆RMST estimands. Non-constant event rates affect the magnitude of the TR estimated from an AFT model. A decreasing event rate resulted in STEs greater than were observed with a constant event rate, and an increasing event rate resulted in STEs lower than estimated under constant event rates.

Coverage of regression model approaches
Coverage of the estimators for the treatment effect used in the design model is presented in the bottom panel of Figure S4. Under PH, coverage at the design model value of 95% was observed when the treatment effect persisted throughout the analysis period (t early = 50). The presence of an early effect that ceases quickly causes a dramatic decrease in the observed coverage. In contrast, having a treatment that stops being effective later has far less impact and most of the nominal coverage is maintained. Non-constant event rates have minimal impact on coverage for the estimates of HR, but for an increasing event rate, estimates of ∆RMST were more affected. The effect of non-constant event rates was most noticeable for the coverage estimates for the TR from an AFT model, consistent with the observed effect of non-constant event rates on the STEs. The summary estimates for bias, coverage and power with the Monte Carlo standard errors (MCSEs) for simulations under this scenario for the decreasing, constant and increasing baseline hazards are presented in Supplementary Tables S7, S8 and S9 respectively.

Power of the tests of equal survival curves
Supplementary Figure S5 presents the results of investigating the effect of non-constant hazard rates in the presence of an early effect that ceases for seven tests of equal survival functions. In the scenario equivalent to PH, only the LR and Cox tests achieve the power values anticipated under the design model, with the versatile test and the combined tests showing a small decrease in power. Under PH, all three FH tests (using early, middle and late weightings) had lower power than the expected 80%. The FH early test, with weighting emphasising earlier events in the survival curve, obtained the highest power when the treatment was only effective for short initial periods of 3% and 10% of study duration. The versatile test obtained the next highest power in the presence of the shorter effective periods but also had a power value closer to that observed for the LR and Cox tests when the treatment effect length was longer or persisted for the entire follow up. The RP(TD) combined test was closest to the versatile test, with allowing for a time-dependent treatment effect improving the power values slightly at each of the times investigated, relative to the RP(PH) combined test.
In general, the effect of non-constant event rates on the power of tests was consistent with what we observed for the modelling approaches. Decreases in the power loss were observed for a decreasing event rate compared to a constant event rate. An increasing event rate resulted in greater power losses than observed under a constant event rate. Whilst most changes in power observed attributable to a non-constant event rate were relatively modest for this simulation, depending on the test and the length of the effective period under consideration differences in power values ±5% were observed.  Figure S5: Effect of non-constant event rates on the power of seven tests of equal survival function. The power of the z-test for the HR treatment effect from the Cox PH model is included in the panel as a comparator. The early effect period lengths investigated were t early = 3, 10, 20 and 50 months, with the setting t early = 50 representing PH.