Group sequential designs in pragmatic trials: feasibility and assessment of utility using data from a number of recent surgical RCTs

Background Assessing the long term effects of many surgical interventions tested in pragmatic RCTs may require extended periods of participant follow-up to assess effectiveness and use patient-reported outcomes that require large sample sizes. Consequently the RCTs are often perceived as being expensive and time-consuming, particularly if the results show the test intervention is not effective. Adaptive, and particularly group sequential, designs have great potential to improve the efficiency and cost of testing new and existing surgical interventions. As a means to assess the potential utility of group sequential designs, we re-analyse data from a number of recent high-profile RCTs and assess whether using such a design would have caused the trial to stop early. Methods Many pragmatic RCTs monitor participants at a number of occasions (e.g. at 6, 12 and 24 months after surgery) during follow-up as a means to assess recovery and also to keep participants engaged with the trial process. Conventionally one of the outcomes is selected as the primary (final) outcome, for clinical reasons, with others designated as either early or late outcomes. In such settings, novel group sequential designs that use data from not only the final outcome but also from early outcomes at interim analyses can be used to inform stopping decisions. We describe data from seven recent surgical RCTs (WAT, DRAFFT, WOLLF, FASHION, CSAW, FIXDT, TOPKAT), and outline possible group sequential designs that could plausibly have been proposed at the design stage. We then simulate how these group sequential designs could have proceeded, by using the observed data and dates to replicate how information could have accumulated and decisions been made for each RCT. Results The results of the simulated group sequential designs showed that for two of the RCTs it was highly likely that they would have stopped for futility at interim analyses, potentially saving considerable time (15 and 23 months) and costs and avoiding patients being exposed to interventions that were either ineffective or no better than standard care. We discuss the characteristics of RCTs that are important in order to use the methodology we describe, particularly the value of early outcomes and the window of opportunity when early stopping decisions can be made and how it is related to the length of recruitment period and follow-up. Conclusions The results for five of the RCTs tested showed that group sequential designs using early outcome data would have been feasible and likely to provide designs that were at least as efficient, and possibly more efficient, than the original fixed sample size designs. In general, the amount of information provided by the early outcomes was surprisingly large, due to the strength of correlations with the primary outcome. This suggests that the methods described here are likely to provide benefits more generally across the range of surgical trials and more widely in other application areas where trial designs, outcomes and follow-up patterns are structured and behave similarly. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-022-01734-2.


Longitudinal outcomes
In a two-arm controlled clinical trial where participants are followed-up and outcomes reported at a series of t timepoints, let y ijs be the outcome for the i th of N participants (i = 1, . . . , N ), at time-point s (s = 1, . . . , t) recruited into intervention arm j (0 = control and 1 = treatment) of the trial. Let us assume that the total number of participants supplying (follow-up) data at time-point s is N 0 s + N 1 s , where N 0 s is the number in the control arm and N 1 s is the number in the treatment arm. Before recruitment into the trial is completed, we assume that the number of outcome data are structured such that N 0 1 ≥ N 0 2 ≥ · · · ≥ N 0 t−1 ≥ N 0 t and N 1 1 ≥ N 1 2 ≥ · · · ≥ N 1 t−1 ≥ N 1 t ; i.e. there are always more or equivalent data available for the early than the later time-points. At the completion of the study (i.e. the end of follow-up), we also further assume that data are independent (between participants) and that the distribution of outcomes (y ij1 , . . . , y ijt ) is multivariate normal, with mean (µ j1 , . . . , µ jt ) and covariance matrix where σ s is the standard deviation of the outcome at time s and ρ ss is the correlation between endpoints at timepoints s and s . The primary interest of the clinical trial is to estimate the effect of the treatment on the study outcome at time-point t (the primary study endpoint), which we call β t . In order to undertake the study in the most efficient manner possible, we aim to assess the treatment effect at interim analyses (early looks) during follow-up with the possibility of stopping the study for either futility or efficacy. A number of authors have investigated this problem [1,2,3] and more generally group-sequential analysis for longitudinal data [4,5]. The method of analysis proposed by Engel and Walstra [2] was formulated for a long-term and a single short-term endpoint in the setting of a fixed sample size experiment, rather than a clinical trial, and recognised the fact that information on the long-term endpoint was available from the short-term endpoint, due to the correlation between data from individual study participants at the two time points. The motivation for this approach, referred to as double-regression, was to use observations from the early outcome, which are considered to be generally less expensive, to inform inferences on more expensive definitive (final) outcomes. In the setting of a clinical trial, the cost of the longer term outcome (relatively to the shorter outcome) is both in the resources required for longer term follow-up and the wider societal costs of potentially exposing patients to ineffective treatments, if the study ultimately provides little support for the efficacy of the intervention under test. A more general approach in the setting of a sequential trial, with a number of interim analyses, with a long-term and potentially many short-term endpoints involving comparison of an experimental arm with a control arm was first suggested by Galbraith and Marschner [3].

Model
Formulating as a linear longitudinal model with correlated errors, under the assumption of multivariate normality (MVN) for outcome y i , we can write, where Σ i (σ, ρ) is the k × k covariance matrix of y i characterised by parameters σ (σ 1 , . . . , σ k ) and ρ (ρ 12 , . . . , ρ ss ), X i is a k × 2t design matrix (where study participant i has k endpoints; 1 ≤ k ≤ t) and β is a 2t × 1 vector of unknown model parameters. β can be structured, for convenience, such that β = (β 10 , β 20 , . . . , β t0 , β 1 , β 2 , . . . , β t ), where β s0 estimates the outcome mean in the control arm of the study, and β s estimates the effect of the treatment arm relative to the control arm at time-point s. Therefore, β t is the effect of the treatment on the study outcome at time-point t (the primary endpoint).
The maximum likelihood estimator for β, under the multivariate normal assumption for known Σ i , is the generalized least squares estimator [6] with variance given by Estimatesβ and var(β) follow naturally given covariance matrix Σ i , which may be obtained from estimates of the correlations ρ, and standard deviations σ. These could in principle be obtained in a number of ways or could be fixed to known or expected values. Galbraith and Marschner [3] suggest using conventional mixed-effects models for analysis of correlated data to estimate the covariance parameters. This can easily be implemented, for example, by fitting separate fixed-effects for each follow-up time with an unstructured error covariance using the function lme in R [7] package nlme. Alternatively, for the purposes of monitoring information accumulation during a trial it is often beneficial to directly model the covariance structure amongst repeated outcomes rather than including random-effects to account for within individual dependences. This can be achieved using the generalized least squares model (e.g. gls function in nlme), which unlike the mixed-effects model, provides explicit estimates of the covariance parameters [8]; see example code in Section 6 for details of how models can be fitted to data in R. Either model formulation provides consistent and unbiased estimates of model parameters (see the simulation study of Section 7), under an assumed multivariate normal distribution with a general covariance structure, common follow-up times for each individual and missing outcomes that are assumed to be a consequence of the shortened follow-up duration.
Simple expressions for estimatorsβ t and var(β t ) can be obtained directly for one (t = 2) and two (t = 3) early outcomes, after some algebraic manipulation, from expressions (3) and (4), if we assume that numbers in the control and intervention arms are equal (N 0 s = N 1 s = N s ), and by noting that N 1 ≥ N 2 ≥ · · · ≥ N t−1 ≥ N t , where the determinants of the correlation matrices and var(β 3 ) andβ 3 for two early outcomes are

Information
In order to understand how information on β t is accumulated as a trial proceeds, the covariance matrix can be written as Σ k to indicate that, for any study participant, it is formed from the k × k sub-matrix consisting of rows (r) and columns (c) 1 to k of the complete covariance matrix Σ. The variance matrix (4) can then be expressed as the block matrix where V 0 and V 1 are t × t matrices with elements given by where j is the indicator for the treatment group allocation (0 = control and 1 = treatment), N 0 t+1 = N 1 t+1 = 0 and 1 ≤ r, c ≤ t.
, with information on the treatment effect for the study outcome at time-point t given by I = 1/var(β t ). The information depends on estimates of covariance parameters ρ = ρ 12 , . . . , ρ ss and σ = σ 1 , . . . , σ k , through Σ k , and the number of participants (N 0 k and N 1 k ) with data at each time-point k. Therefore, it is possible to monitor a study as it proceeds by estimating the covariance parameters and calculating the accumulated information at any point during recruitment. Pre-set information thresholds can be used to trigger interim analyses, with stopping decisions being made based on estimates of β t and var(β t ) from expressions (3) and (4).

Implementation for a two-arm trial
For a two-arm study, with outcomes observed at t time-points, study participants are randomized to either the control or active intervention arms, and data collection proceeds until the first interim analysis when N t1 , . . . , N 11 , data are available; for notational convenience hereafter we let N tw , . . . , N 1w be the number of study participants with data available at interim analysis w, where N kw = N 0 kw + N 1 kw and 1 ≤ k ≤ t. The planned number of interim analyses W , and the criteria for triggering each analysis, are stated explicitly at the start of the study. If information accrual is used as the criterion for triggering an interim analysis, then the thresholds are stated in advance and are such that I 1 < I 2 < · · · < I W . Other criteria are perfectly possible, but these are not discussed in detail here for purposes of brevity; for instance one might decide that the minimum number of participants with outcomes at time t, N t , must exceed some pre-set threshold. These should also similarly be stated explicitly in advance. The first interim analysis is triggered when the estimated information on the treatment effect at time t, determined from current estimates of σ and ρ, is equal to or greater than I 1 . Expressions expressions (3) and (4) are then used to obtain estimates β t1 and var(β t1 ), using only data available at the first look, from which the test statistic Z 1 = β t1 /sd(β t1 ) is estimated. The observed test statistic Z 1 is then compared to pre-defined lower and upper stopping boundaries l 1 and u 1 , which are determined by the pre-set expected information I 1 at the first look and the planned error spend (see Section 5), and either the trial is stopped, for futility or efficacy, or it continues to the next interim analysis. At each subsequent interim analysis, the test statistic Z w = β tw /sd(β tw ) is calculated in an analogous manner as at the first analysis, using all available data, and compared to stopping boundaries l w and u w that determine whether the study is stopped early (e.g. see Parsons et al. [1]). Bias corrected point estimates of the intervention effect and associated confidence intervals can calculated, if required, using appropriate statistical methods that adjust for the interim analyses (if stopping occurred after the first analysis) [9,10]. In general, if a study is stopped at an interim analysis, then data collection will continue for all participants recruited up to that point and these final (complete follow-up) data can also be used for making inferences in an overrunning analysis [11].

Sequential stopping boundaries
In a sequential trial, with a series of W interim analyses (looks), we aim to compare the two study intervention arms in order to make inferences about the superiority of the active intervention arm (over the control). Decisions about whether to stop the study or continue recruitment, at each interim analysis w, are made by comparing the test statistic Z w to pre-defined lower and upper boundaries l w and u w . These boundaries are such that the type I error rate (false positive rate) is controlled across all W interim analyses. For a one-sided alternative at overall level α, with possible stopping for futility, the type I error rate spent is such that α * u1 < · · · < α * uW = α and α * l1 < · · · < α * lW = 1 − α, where α * uw is the probability of stopping and rejecting H0 in favour of β t > 0 at look w (efficacy) and α * lw is the probability of stopping without rejecting H0 at look w (futility). The type I error rate spent is determined by α * lw and α * uw , which are specified in advance of the study beginning. For a two-arm study, standard group sequential methods and widely available software allow one to calculate the lower and upper stopping boundaries (l w and u w ) at each look w using numerical integration [12]. For instance, the gsBound function in R package gsDesign [13], can calculate boundaries using the expected information and values of α * lw and α * uw . As a simple illustrative example, consider a trial with two planned interim analyses (w = 1, 2 and total number of interim analyses W = 2), for two outcomes, early and late (primary) outcomes, and expected variance equal to 4 for both outcomes, and expected correlation ρ 12 = 0.5 and variances σ 2 1 = σ 2 2 = 4. If we plan the first interim analysis when there are primary outcome data from 20 participants and early data from 40 participants in each group, then we expect that N 0 11 = N 1 11 = 40, N 0 21 = N 1 21 = 20, assuming equal numbers in intervention arms, and for notational convenience N 0 31 = N 1 31 = 0. Using expressions (9) to (10) and the appropriate covariance matrices, we can write var(β t ) = V[t, t] = 7/20 and the information at the first interim analysis is I 1 = 20/7. At the second interim analysis, we might expect that N 0 12 = N 1 12 = 60, N 0 22 = N 1 22 = 30 and N 0 32 = N 1 32 = 0, and thus the expected information is I 2 = 30/7. The information at the end of the study, which we denote as analysis W + 1 = 3, when N 0 13 = N 1 13 = N 0 23 = N 1 23 = 90 is given by I 3 = 45/4, or more simply in this example where sample sizes in the groups are equal (i.e. at the study end we expected that N 0 k3 = N 1 k3 = 90) by I 3 = N k3 /2σ 2 2 = 45/4. Setting the lower and upper stopping probabilities to α * l = (0.320, 0.640, 0.975) and α * l = (0.001, 0.010, 0.025) respectively, and the corresponding information as I = (20/7, 30/7, 45/4), as inputs to the gsBound function, gives the following lower and upper bounds l = (−0.47, 0.33, 2.06) and u = (3.09, 2.34, 2.06). Therefore, to undertake a group-sequential study, we recruit participants and monitor current information from early and late outcomes until it first reaches I 1 . We then calculate Z 1 and compare to the lower and upper boundaries l 1 and u 1 , stopping the study for futility if Z 1 ≤ l 1 or efficacy if Z 1 ≥ u 1 . Otherwise, the study continues to recruit until we reach I 2 , when we stop the study for futility if Z 2 ≤ l 2 or efficacy if Z 2 ≥ u 2 . If the study continues to the end, we reject at the upper 2.5% level if Z 3 ≥ u 3 . For studies that stop at an interim analysis, data collection will continue (until follow-up is complete) for all participants recruited up to the point the study is stopped. In such circumstances, final inferences can be made using the deletion method that effectively analyses the data as if the only interim analyses to have taken place were those prior to analysis w, when the study stopped, and the analysis at w + 1 (the overrunning analysis) [11,14]. Bias corrected estimates and confidence intervals can be readily calculated using previously described methods [9,10] and available R packages [15,13,16].

R code
As a means to demonstrate how the models described here can be fitted to longitudinal data from a clinical trial, code is shown below in R [7] to simulate data with t = 3 time-points for N individuals from two treatment groups (coded as 0, the control, and 1). Correlations are arbitrarily set, for illustrative purposes, to ρ 12 = 0.6, ρ 23 = 0.5 and ρ 13 = 0.8 and standard deviations to σ 1 = 10, σ 2 = 20 and σ 3 = 16, and data set to missing for 50% and 25% of values at time-points t = 3 and t = 2 respectively to simulate data at an interim analysis when follow-up is incomplete. Models are fitted using functions gls and lme in package nlme [17]. The simulations in Section 7 use a much more general coding framework, than that shown here, that allowed different patterns of follow-up and varying treatment group sizes.

Simulations
At an interim analysis during a clinical trial, comparing a control and test intervention, for an outcome reported at a series of t time-points, we wish to estimate the treatment effect at time β t . In order to do this, we fit a longitudinal model (Section 2) to the totality of data available at the interim analysis. We can fit either a mixed-effects model or generalized least squares model using the functions lme and gls in R package nlme. We show, via a simulation study, that both methods provide identical (and consistent) estimates of β t and var(β 2 ). However, function gls provides identifiable estimates of the covariance parameters (σ and ρ) that are useful for assessing model assumptions as a study progresses and for this reason this method may often be preferable. Data are simulated from multivariate Normal distributions with t outcomes (where t = 2, 3, 4), to give a time series of t outcomes for each of N participants in a clinical trial with two intervention arms, for four different settings (see Tables 1 to 4), comprising two treatment effect sizes (mean difference in outcomes between intervention arms) and two correlation models (moderate and strong), labelled as (i) to (iv), and two sample sizes (labelled as small and large N ). The sample sizes were set such that the median values for the group sizes were approximately as follows in the small N study: N 1 = 160 and N 2 = 80 for t = 2, N 1 = 180, N 2 = 120 and N 3 = 60 for t = 3 and N 1 = 192, N 2 = 144, N 3 = 96 and N 4 = 48 for t = 4. Sample sizes for the large N study were ten times larger than these values. The covariance model for each setting was determined by standard deviations (σ s ), which were (arbitrarily) selected to be an even integer in the range 10 to 30, and correlations between outcomes (ρ ss ), which were as per Tables 1 to 4 for each setting, and (arbitrarily) set to represent either moderate or strong correlations between outcomes. The mean model for each setting was such that the mean difference between intervention arms at time t was either 0 or 10, with differences at earlier time-points taking integer values in the range -10 to 10. Again, the values are arbitrary, and were selected in order to demonstrate the consistency of estimation only. At each simulation, data were generated in R using package mvtnorm [15] from the set covariance model for each setting, and treatment differences β t and variances var(β t ) were estimated using functions gls and lme in package nlme. In addition, for gls covariance parameter estimates were obtained and for lme the residual (within-group error), intercept and total standard deviation estimates were obtained (where σ 2 total = σ 2 intercept + σ 2 residual ). The results of 1000 simulations are shown in Tables 1 to 4, for each of the settings. Estimates of β t and var(β t ) were, as expected, identical for methods gls and lme, as were estimates of σ 1 (from gls) and σ total , which for lme is set by default to the standard deviation for the first time-point. The results show that treatment effect estimates β t from the simulations were consistent with the selected mean models for all values of t, and confidence intervals were narrower for the large sample sizes, than for the small sizes. Estimates of the covariance parameters (σ and ρ), from gls, were also consistent with the selected covariance models for all values of t. In summary, model fitting using either gls or lme gives exactly the same overall model fit. For applications, where monitoring information accumulation and assessing modelling assumptions for the covariance parameters at interm analyses during a clinical trial, fitting models using gls may be preferable as it allows these tasks to be undertaken much more simply and directly. Table 1: Estimates of means and confidence intervals of model parameters (based on 1000 simulations), β 2 , var(β 2 ), ρ ss , σ s and standard deviation components for settings (i) to (iv), t = 2 for small and large N using gls and lme