 Research
 Open access
 Published:
A gated group sequential design for seamless Phase II/III trial with subpopulation selection
BMC Medical Research Methodology volume 23, Article number: 2 (2023)
Abstract
Background
Due to the high cost and high failure rate of Phase III trials where a classical group sequential design (GSD) is usually used, seamless Phase II/III designs are more and more popular to improve trial efficiency. A potential attraction of Phase II/III design is to allow a randomized proofofconcept stage prior to committing to the full cost of a Phase III trial. Population selection during the trial allows a trial to adapt and focus investment where it is most likely to provide patient benefit. Previous methods have been developed for this problem when there is a single primary endpoint and two possible populations.
Methods
To find the population that potentially benefits with one or two primary endpoints (e.g., progression free survival (PFS), overall survival (OS)), we propose a gated group sequential design for a seamless Phase II/III trial design with adaptive population selection.
Results
The investigated design controls the familywise error rate and allows multiple interim analyses to enable early stopping for efficacy or futility. Simulations and an illustrative example suggest that the proposed gated group sequential design has more power and requires less time and resources compared to the group sequential design and adaptive design.
Conclusions
Combining the group sequential design and adaptive design, the gated group sequential design has more power and higher efficiency while controlling for the familywise error rate. It has the potential to save drug development cost and more quickly fulfill unmet medical needs.
Background
The high failure rate of phase III trials combined with their substantial cost makes selecting an appropriate treatment and population for evaluation of paramount importance in drug development [1]. Seamless Phase II/III multiarm clinical trials use the initial part of the trial (Phase II) to investigate all treatments and/or populations and an indepth evaluation on the promising one(s) in the second part (Phase III). Using data accumulated across both phases of a single Phase II/III trial for inference enable more efficient and effective development of a treatment for an appropriate indication than separate trials for Phases II and III.
Considering a second line small cell lung cancer clinical trial, a platinumsensitive subgroup yields a much greater treatment benefit. Even if the treatment benefit in the platinumresistant subgroup is less certain, from a marketing perspective, the allcomer population with the inclusion of the platinumresistant subgroup can give maximum patient benefit, followed by market value if the platinumresistant subgroup also receives benefit from the experimental treatment. Under this circumstance, a direct Phase III trial with a broad population can be risky. A more efficient approach could be a seamless Phase II/III design with population selection in the Phase II portion of the trial followed by a potentially targeted Phase III enrollment with focused patient population to confirm the benefit. Benefit for either progression free survival (PFS) or overall survival (OS) could justify a new treatment paradigm. This is an extension of method for a single primary endpoint by Jenkins et al. [2].
In clinical trials, the clinical benefit of an intervention is often characterized by multiple outcomes. For multiple hypothesis testing problems, the familywise error rate (FWER), the probability of erroneously rejecting at least one null hypothesis, needs to be bounded by a prespecified significance level α. A sequence of methods derived from weighted Bonferronibased closed test procedures have been proposed to control the FWER for multiple testing. Examples of such methods include BonferroniHolm procedure [3], gatekeeping procedures based on Bonferroni adjustments [4] and the graphical approach [5, 6]. As group sequential designs are widely used and commonly employed in order to facilitate early efficacy testing, the application of group sequential designs to multiple endpoints becomes popular and has been widely studied recently [6,7,8,9,10,11,12,13,14,15,16].
Adaptive seamless Phase II/III designs allow Phase II assessment of whether withintrial extension to Phase III is justified. Here we consider that the adaptation includes choosing a meaningful population for an effective investment with high probability of success. A predefined, targeted subgroup and the full population are both studied in the first stage of the adaptive Phase II/III design. Investment in the second stage of the adaptive Phase II/III design is then focused on the population(s) most likely to provide patient benefit after the futility analysis at the end of Phase II. Due to the multiple sources potentially contributing to the decision error in this type of design, the FWER control should be studied carefully. The closed testing procedure [17] is usually applied to test multiple hypotheses in the setting of population selection. The FWER control strategies using multiple testing method [18, 19], combination test method [20, 21], the marginal pvalue combinational approach [22], and a conditional error function approach [23] have been proposed. The application of adaptive Phase II/III designs to multiple endpoints has been investigated using different methods [2, 24,25,26].
To improve the trial efficiency in the adaptive phase II/III design, we propose a method to combine group sequential design (GSD) with the adaptive design. With the implement of GSD, the trial can stop early to save time and resources. However, the closed testing principle between the subgroup and the full population could dramatically decrease the power of an adaptive Phase II/III design when only one group has meaningful efficacy. To improve the power while controlling FWER, we propose a gated group sequential design (gGSD) combining the group sequential design and the adaptive design. The endpoints in the subgroup and full population are tested with a prespecified order using the hierarchical testing [9]. Methods section illustrates the details of the proposed design. The performance of gGSD is evaluated by simulations, and an illustrative example is used to illustrate the design and its efficiency in Results section. The Summary section summarizes the proposed study design.
Methods
We consider a randomized, parallel group clinical trial with two treatment arms – experimental and control, and dual primary endpoints – arbitrarily OS and PFS as a prototypical example. There is an interest to investigate the efficacy of the experimental treatment in both the full population (F) and a targeted subgroup (S). Four null hypotheses below are of interest:

1)
\({H}_{0}^{\left\{F, OS\right\}}\): no difference in OS between arms in the full population;

2)
\({H}_{0}^{\left\{F, PFS\right\}}\): no difference in PFS between arms in the full population;

3)
\({H}_{0}^{\left\{S, OS\right\}}\): no difference in OS between arms in the targeted subgroup;

4)
\({H}_{0}^{\left\{S, PFS\right\}}\): no difference in PFS between arms in the targeted subgroup.
Let \({\alpha }_{1}\), \({\alpha }_{2}\), \({\alpha }_{3}\) and \({\alpha }_{4}\) be the initial significance level for the hypotheses \({H}_{0}^{\left\{F, OS\right\}}\), \({H}_{0}^{\left\{F, PFS\right\}}\), \({H}_{0}^{\left\{S, OS\right\}}\) and \({H}_{0}^{\left\{S, PFS\right\}}\), respectively, and \(\mathrm{\alpha }\) be the overall significance level. Jenkins, et al. [2] proposed a method for population selection in the seamless adaptive design framework with only one analysis in stage 2 after population selection in stage 1. In this paper, we extend their method for population selection to control FWER for all four of the aforementioned hypotheses. We further add a group sequential design strategy in stage 2 for flexible early efficacy testing. The design consists of an initial learning stage (stage 1) analogous to a randomized Phase II trial and a second confirmatory phase (stage 2) analogous to a randomized Phase III trial. The selection between populations F and S is based on the PFS results at the end of stage 1. Based on that, the trial can either stop for futility, or continue to stage 2 in both populations F and S, or the subgroup S only, or the full population F only without analyzing the subgroup S in stage 2. Note that there is no hypothesis testing at the end of stage 1. In stage 2, we consider group sequential setting with \(K1\) interim analyses and one final analysis, where PFS and OS in populations F and/or S are tested by using group sequential approaches, with alpha allocation following the graphical approach [6]. Figure 1 shows the analysis flowchart for K = 3.
According to the FDA guidance for adaptive design [27], the design, conduct, and analysis of an adaptive clinical trial intended to provide substantial evidence of effectiveness should satisfy four key principles: 1) the chance of erroneous conclusions should be adequately controlled, 2) estimation of treatment effects should be sufficiently reliable, 3) details of the design should be completely prespecified, and 4) trial integrity should be appropriately maintained. There are three potential reasons for inflation of Type I error: 1) early rejection of null hypothesis at interim analysis; 2) adaptation of design features and combination of information across trial stages; and 3) multiple hypothesis testing. To control the type I error rate, the following strategies are proposed: group sequential plans for early rejection; the combination of pvalues using methods such as the inverse normal method for adaptation; multiple testing methodologies such as the closed testing procedures for multiple hypothesis. If needed, all three approaches can be combined to control the FWER.
For subjects recruited in stage 1, the nominal onesided observed pvalues of \(H_0^{\left\{F,\cdot\right\}}\) and \({H}_{0}^{\left\{S, \cdot \right\}}\) at the kth analysis (\(k=\mathrm{1,2},\dots ,K\)) will be denoted by \({p}_{1k}^{\left\{F, \cdot \right\}}\) and \({p}_{1k}^{\left\{S, \cdot \right\}}\), respectively. For subjects recruited in stage 2, the nominal onesided observed pvalues of \({H}_{0}^{\left\{F, \cdot \right\}}\) and \({H}_{0}^{\left\{S, \cdot \right\}}\) at the kth analysis (\(k=\mathrm{1,2},\dots ,K\)) will be denoted by \({p}_{2k}^{\left\{F, \cdot \right\}}\) and \({p}_{2k}^{\left\{S, \cdot \right\}}.\) The goal is to control the FWER (i.e., the probability of rejecting at least one of the true null hypotheses \({H}_{0}^{\left\{F, OS\right\}}\), \({H}_{0}^{\left\{S, OS\right\}}\), \({H}_{0}^{\left\{F, PFS\right\}}\) and \({H}_{0}^{\left\{S, PFS\right\}}\)) at a nominal level α. We consider all potential reasons of type I error inflation, with the closed testing principle applied for multiple testing, inverse combination testing used to analyze the data from two stages, and the graphical approach applied for group sequential analyses with different hypotheses. Combining these strategies, the FWER of the proposed design is strictly controlled [2, 6].
At the end of stage 1, the nonbinding futility analysis for PFS in the subgroup S and full population F are performed. This determines whether the trial can continue to stage 2 with one or two populations, or just stop at the end of stage 1. No testing for rejection is done at the end of stage 1. Only one futility analysis is conducted no matter how many interim analyses might follow in the second stage, although additional futility analyses could be added as they only decrease Type I error. Let \({HR}^{F}\) and \({HR}^{S}\) be the estimated hazard ratio (HR) of the full population and the subgroup, and \({\theta }^{F}\) and \({\theta }^{S}\) be the prespecified hazard ratio threshold for the full population and the subgroup, respectively. Table 1 provides the decision rule for population selection. We choose \({\theta }^{x}\) (\(x=F,S\)) to ensure that \(\mathrm{P}\left(\mathrm{HR}>{\theta }^{x}true HR\right)={\gamma }^{x}\) where \({\gamma }^{x}\) is a prespecified threshold that the trial does not pass the futility gate under the true alternative HR. Under equal randomization, log(HR) approximately follows a normal distribution with mean log(true HR) and variance 4/(number of events). This gives a way to calculate the aforementioned thresholds.
Stage 2
Once the futility boundary at the end of stage 1 is passed, the trial will continue to stage 2 with one or two populations. As described above, there are three possible scenarios in stage 2.

Scenario 1: continue to stage 2 in the subgroup S only with the planned sample size in S, allocating additional alpha to S; i.e., \({\alpha }_{1}={\alpha }_{2}=0\);

Scenario 2: continue to stage 2 in the full population F with the planned sample size in F without further analysis of S, allowing additional allocation of alpha to F; i.e., \({\alpha }_{3}={\alpha }_{4}=0\);

Scenario 3: continue to stage 2 in both populations F and S with the planned sample size, continuing testing in both populations.
The gated group sequential design (gGSD) incorporates the hierarchical testing strategy and the group sequential design. The hierarchical testing strategy was proposed by Glimm et al. [9] for the ordered testing of endpoints such as PFS and OS with FWER controlled. In our study design, we modify their strategy to accommodate multiple testing scenarios with FWER controlled between populations; i.e., the hierarchical testing strategy is used for the ordered testing of populations.
In scenario 1, only PFS and OS in the subgroup S will be tested according to the alpha allocated using the graphical approach. An arbitrary alternative graphical approach could also be used: e.g., \({H}_{0}^{\left\{S, PFS\right\}}\) is first tested with the full significance level \(\mathrm{\alpha }\), and the full \(\mathrm{\alpha }\) will be passed to test \({H}_{0}^{\left\{S, OS\right\}}\) if and when \({H}_{0}^{\left\{S, PFS\right\}}\) is rejected using the overall hierarchical method of Glimm, et al. [9]. Note that the patients for the F minus S population enrolled in stage 1 will be followed continuously since the information from those patients is used in the closed testing procedure.
In scenario 2, only PFS and OS in the full population F will be tested according to the alpha allocated using the graphical approach; analogous to Scenario 1, an alternate graphical approach could also be used: e.g.\(, {H}_{0}^{\left\{F, OS\right\}}\) will be tested at level \(\mathrm{\alpha }\) only if \({H}_{0}^{\left\{F, PFS\right\}}\) is rejected (a hierarchical approach).
In scenario 3, the subgroup S and the full population F are tested hierarchically, i.e., the hypotheses in F will not be tested until at least one hypothesis in S is rejected. For the hypotheses within the same population F or S, the graphical approach of Maurer and Bretz [6] is applied. More specifically, the hypotheses in the subgroup S is tested based on the graphical approach with \({\alpha }_{3}+{\alpha }_{4}=\mathrm{\alpha }\). Under the hierarchical rule, the hypotheses in the full population F will be tested by using graphical approach with \({\alpha }_{1}+{\alpha }_{2}=\mathrm{\alpha }\) if at least one hypothesis in the subgroup S is rejected. The graphical approach ensures that \(\mathrm{\alpha }\) reallocation occurs only between PFS and OS within the same group, and does not occur between different groups (i.e., between F and S). Note that the sequential testing rules and the timing of analyses is independent between the subgroup and the full population. Figure 2 illustrates the gGSD testing procedures in stage 2 for the efficacy analyses with K = 3. The design is eventdriven and will continue to the final analysis unless all the hypotheses are rejected.
The inversenormal combination test is applied to control the FWER regardless of the decision at the futility analysis at the end of stage 1. For the kth analysis in stage 2, weights \({w}_{1k}\) and \({w}_{2k}\) are prespecified to combine the pvalues from stage 1 (\({p}_{1k}\)) and stage 2 (\({p}_{2k}\)), where \({w}_{1k}^{2}+{w}_{2k}^{2}=1\). The null hypothesis is rejected if \({w}_{1k}{\Phi }^{1}\left(1{p}_{1k}\right)+{w}_{2k}{\Phi }^{1}\left(1{p}_{2k}\right)\ge {c}_{k}\), where \({c}_{k}\) is the zstatistic boundary using the allocated alpha. It has been pointed out that the test statistics may not have the desired null distribution for timetoevent endpoints in a twostage adaptive design [28, 29]. The violation of the independent increments assumption can lead to type I error inflation. To ensure that the hypothesis is tested with proper protection of the familywise Type I error, we follow the method in previous adaptive design study [2]. Specifically, the pvalues are calculated separately for subjects recruited to stage 1 (i.e., \({p}_{1k}\)) and those recruited to stage 2 (i.e., \({p}_{2k}\)). The additional followup of stage 1 subjects during stage 2 contributes to the stage 1 pvalue (\({p}_{1k}\)). The closed testing procedures are applied to control the FWER. The Hochberg correction [30] with equal weighting, \({p}_{i}^{FS}=\mathrm{min}\left[2\mathrm{min}\left\{{p}_{i}^{F},{p}_{i}^{S}\right\},max\left\{{p}_{i}^{F},{p}_{i}^{S}\right\}\right]\), is used to compute the pvalues of the intersection hypotheses between populations. The minimum zstatistic boundary of hypotheses with the allocated alpha in the intersection testing is used as \({c}_{k}\).
The weights and pvalues to be used in combination tests are provided below, where the PFS endpoint is used as an example; the OS endpoint can be performed in a similar manner. Note that the weights \({w}_{1k}\) and \({w}_{2k}\) need to be prespecified for controlling the FWER, and can be different for PFS and OS endpoints.

1.
S only scenario – when considering \({H}_{0}^{\left\{S, PFS\right\}}\) only.
Testing \({H}_{0}^{\left\{FS, PFS\right\}}\):\({w}_{1k}{\Phi }^{1}\left(1{p}_{1k}^{\left\{FS,PFS\right\}}\right)+{w}_{2k}{\Phi }^{1}\left(1{p}_{2k}^{\left\{S,PFS\right\}}\right)\);
Testing \({H}_{0}^{\left\{S, PFS\right\}}\): \({w}_{1k}{\Phi }^{1}\left(1{p}_{1k}^{\left\{S,PFS\right\}}\right)+{w}_{2k}{\Phi }^{1}\left(1{p}_{2k}^{\left\{S,PFS\right\}}\right)\).

2.
F only scenario—when considering \({H}_{0}^{\left\{F, PFS\right\}}\) only.
Testing \({H}_{0}^{\left\{FS, PFS\right\}}\):\({w}_{1k}{\Phi }^{1}\left(1{p}_{1k}^{\left\{FS,PFS\right\}}\right)+{w}_{2k}{\Phi }^{1}\left(1{p}_{2k}^{\left\{F,PFS\right\}}\right)\);
Testing \({H}_{0}^{\left\{F, PFS\right\}}\): \({w}_{1k}{\Phi }^{1}\left(1{p}_{1k}^{\left\{F,PFS\right\}}\right)+{w}_{2k}{\Phi }^{1}\left(1{p}_{2k}^{\left\{F,PFS\right\}}\right)\).

3.
F and S scenario – when considering both \({H}_{0}^{\left\{F, PFS\right\}}\) and \({H}_{0}^{\left\{S, PFS\right\}}\).
Testing \({H}_{0}^{\left\{FS, PFS\right\}}\):\({w}_{1k}{\Phi }^{1}\left(1{p}_{1k}^{\left\{FS,PFS\right\}}\right)+{w}_{2k}{\Phi }^{1}\left(1{p}_{2k}^{\left\{FS,PFS\right\}}\right)\);
Testing \({H}_{0}^{\left\{F, PFS\right\}}\): \({w}_{1k}{\Phi }^{1}\left(1{p}_{1k}^{\left\{F,PFS\right\}}\right)+{w}_{2k}{\Phi }^{1}\left(1{p}_{2k}^{\left\{F,PFS\right\}}\right)\);
Testing \({H}_{0}^{\left\{S, PFS\right\}}\): \({w}_{1k}{\Phi }^{1}\left(1{p}_{1k}^{\left\{S,PFS\right\}}\right)+{w}_{2k}{\Phi }^{1}\left(1{p}_{2k}^{\left\{S,PFS\right\}}\right)\).
Results
Simulations
To illustrate the performance of the proposed design in terms of type I error and power, we conduct simulations and compare the performance with the other two wellestablished approaches:

Group sequential design (GSD): group sequential design for the 4 hypotheses of interest using the graphical approach of Maurer and Bretz [6] without any population or hypothesis adaptation.

Adaptive design (AD): subpopulation selection is performed in the futility analysis. The overall significance level is set to be \(\mathrm{\alpha }\) to test all 4 hypotheses rather than setting the overall significance level to be α to test only 2 hypotheses in each population (S and F) in gGSD. The same alpha reallocation strategy [6] is used to control the FWER.
The gGSD is a seamless phase II/III trial integrating AD and GSD into one study design. Briefly, AD is implemented in the subgroup selection stage (futility analysis), followed by GSD in the second stage (i.e., two interim analyses and one final analysis). Three simulation settings are considered. Table 2 gives the detailed information for these three settings. In each setting, two interim analyses and one final analysis are planned in stage 2. Specifically, PFS testing is planned at IA1 and IA2 (which is also the final for PFS), while OS testing is planned at IA1, IA2 and FA. Some parameters are set to be the same for all three settings: 1) for the control arm, the median PFS (OS) is assumed to be 4 (10.5) months and 3 (5.7) months both in the subgroup and the complement of the subgroup, respectively; 2) the yearly dropout rates for PFS and OS are 10% and 1%, respectively. In settings 1 and 2, the hazard ratio (experimental/control) for PFS and OS are 0.7 for both the subgroup and the full population. In setting 3, the hazard ratios of PFS and OS are 0.7 for the subgroup, but 1 for the full population. For the full population: at the design stage, the information fractions for PFS are approximately 90% for IA1 and IA2 is the final analysis; the information fractions for OS are approximately 69% for IA1 and 92% for IA2. For the subgroup population: at the design stage, the information fraction for PFS is approximately 89% for IA1 and IA2 is the final analysis; the information fractions for OS are approximately 66% for IA1 and 91% for IA2. Some other parameters used in the simulations are provided in Table 2 below where the sample size is calculated based on the group sequential design with a power of at least 85% for all four hypotheses. The alpha boundaries are computed using the LanDeMets spending function approximating O'BrienFleming bounds with a total of 1sided \(\mathrm{\alpha }\)=0.025.
For each setting, the performance of GSD is provided as a reference for comparison. For AD and gGSD, the futility analyses for PFS are performed at the end of stage 1. This determines whether the trial continues to stage 2 with one or two populations, or the trial stops. Let the futility threshold (\(\gamma\)), the probability of the trial not passing the futility gate under the alternate hypothesis, be 5%. This results in \({\theta }^{F}=0.85\) and \({\theta }^{S}=0.9\) for setting 1, and \({\theta }^{F}=0.83\) and \({\theta }^{S}=0.85\) for settings 2 and 3.
The timetoevent data were generated using an Rpackage “simtrial” [31] with settings specified in Table 2. The “simtrial” package generates independent timetoevent datasets according to a userspecified trial design. Information of the enrollment, dropout, and infection processes are prespecified in each treatment arm. A total of 10,000 replications were performed for each setting. For AD and gGSD, eight different sets of weights were evaluated for the inversenormal combination tests. Ideally, weights \({w}_{1k}\) and \({w}_{2k}\) would be chosen to be proportional to the square root of the number of events in each stage for the kth analysis. As an example, set \(\left(w_{1k}^{PFS},w_{2k}^{PFS}\right)\) = \(\left(\sqrt{\frac{n_{1k,PFS}}{n_{1k,PFS}+n_{2k,PFS}}},\sqrt{\frac{n_{2k,PFS}}{n_{1k,PFS}+n_{2k,PFS}}}\right)\) for PFS hypothesis where \({n}_{ik, PFS}\) is the number of PFS events from stage i subjects (i = 1,2) and \(\left(w_{1k}^{OS},w_{2k}^{OS}\right)\) = \(\left(\sqrt{\frac{n_{1k,OS}}{n_{1k,OS}+n_{2k,OS}}},\sqrt{\frac{n_{2k,OS}}{n_{1k,OS}+n_{2k,OS}}}\right)\) for OS hypothesis where \({n}_{ik, OS}\) is the number of OS events from stage i subjects (i = 1,2). \({w}_{1k}\) and \({w}_{2k}\) need to be prespecified in order to control the TypeI error rate. Since it is impossible to know the decision at futility analysis and the number of events from stage 1 and 2 for each efficacy analysis, we use prespecified weights to compute pvalues.
The proposed gGSD is FWER controlled and the simulations showed that it is conservative: e.g., the type I error is less than the specified 0.025 level as shown in Table 3. Table 4 shows the power of rejecting the subgroup (S), or both subgroup and full population (S&F). The performance of the proposed gGSD depends on the choice of the weights \({w}_{1k}\) and \({w}_{2k}\). The first set of weights are computed using the number of PFS/OS events in the simulation and are used as a reference. When \({w}_{1k}<{w}_{2k}\), AD and gGSD have lower power to detect treatment efficacy compared with GSD. When \({w}_{1k}\ge {w}_{2k}\), gGSD has higher power than GSD and AD. Table 4 indicates that the events driven weight or more weights for stage 1 data lead to a better gGSD performance. The performance of gGSD is robust for the weights as long as more weight is assigned to stage 1 data. Thus, assigning more weights for data from stage 1 is recommended in order to utilize the information more efficiently. The simulation results for setting 3 (only subgroup has significant treatment benefit) demonstrate that the proposed gGSD reduces the patient’s exposure to less effective treatment comparing to GSD if the complementary subgroup has less significant treatment effect since gGSD does not enroll patients in the complementary subgroup in stage 2.
Another advantage of the proposed gGSD is that it can terminate early with high power. Figure 3 shows the stopping time of three designs for three weight sets with the highest power in Table 4. For GSD, the trial stops early if and only if all the four hypotheses are rejected before the final analysis. For example, there are 3 hypotheses being rejected in IA1 and the last hypothesis is rejected in IA2, then the termination point for this trial is at IA2. For AD and gGSD, the trial stops early if no subgroup/full group passes the futility boundary or all the hypotheses tested are rejected before the final analysis. Detailed requirements for early stopping of the trial are listed in Table 5. As shown in Fig. 3, gGSD is more efficient (i.e., higher probability to reject all the hypotheses tested and stop early before the final analysis) with higher or comparable power compared to GSD and AD (Fig. 3 panels JL). Therefore gGSD requires less time and resources to prove new treatment efficacy than GSD and AD without sacrificing power for an important underlying benefit.
An illustrative example
We use an example with specified pvalues to illustrate the potential advantage of the proposed gGSD compared to GSD. Consider a group sequential design for a Phase III 2^{nd} line small cell lung cancer trial with a 50% prevalence of platinumsensitive subgroup where PFS and OS are the dual primary endpoints. This example contains a total of 924 patients with other parameters same as setting 2 listed in Table 2. The graphical approach [6] was used to control FWER of the four hypotheses with a total of FWER level 0.025. PFS and OS hypotheses are tested in two interim analyses and only OS hypotheses are tested in the final analysis.
This example illustrates that gGSD has more power to reject the null hypotheses compared to GSD. Table 6 contains the nominal pvalues and data generated pvalues at each interim analysis and the final analysis for GSD and gGSD. As shown in Table 6, none of the four hypotheses are rejected by using GSD. Using the gGSD and the gated rules in Table 1 with \({\theta }^{F}=0.83\) and \({\theta }^{S}=0.85\), stage 2 is continued for the full group only (i.e., Scenario 2 in stage 2). Once \({H}_{0}^{\left\{F, PFS\right\}}\) is rejected, \({H}_{0}^{\left\{F, OS\right\}}\) will be tested at level \(\mathrm{\alpha }=0.025\). A fixed weight \({w}_{1k}\)=\({w}_{2k}\)=\(\sqrt{0.5}\) is used for all the pvalue combination tests in gGSD. With a pvalue of 0.0022 at the IA1, the PFS is rejected. A pvalue of 0.0125 at IA1 fails to reject OS at IA1. Then the trial continues to IA2 for OS testing in the full group. With a pvalue of 0.0019 at the IA2, the OS hypothesis is rejected at IA2. So none of the four hypotheses are rejected in GSD while gGSD rejects two full group hypotheses.
Summary
Seamless Phase II/III designs are getting more attention and being increasingly adopted as a cost effective and time saving drug development strategy. In this paper, we proposed a gated group sequential design for seamless Phase II/III trial with potential subgroup selection. Combining this with GSD, our proposed gGSD design enables population selection and multiple interim analyses to enable early stopping. In this paper, we extended Jenkins, et al. [2] method for population selection to control FWER for all four of the aforementioned hypotheses with dual primary endpoints. The hierarchical testing strategy proposed by Glimm et al. [9] was modified to accommodate our multiple testing scenarios with FWER controlled between populations. Within each population, the graphical approach combined with standard group sequential design was used for flexibility. The familywise error rate of proposed gGSD is strictly controlled. A prespecified subgroup and the full population are tested hierarchically to control the FWER. Simulation results and the illustrative example suggest that the gated group sequential design can reduce sample size compared to the other trial designs; e.g., the proposed gated group sequential design could achieve the same power with a smaller sample size compared to the commonly used GSD. Furthermore, the trial can terminate early with sufficient strong evidence from efficacy analyses and potentially moves efficacious products into market faster for unmet medical needs. A special note on the particular advantage of the gGSD over GSD in the simulation study occurs when the true benefit is in the subgroup, but not in the full group. The gGSD is designed to focus on the stage 2 selected population, increases power over a Phase III study of both populations and reduces the patient’s exposure to less effective treatment comparing to GSD if the complementary subgroup has less significant treatment effect.
The idea proposed in this paper can also be applied to conduct efficient trials and simultaneously investigate several vital questions for drug development, such as identifying the most beneficial subgroup for a new treatment or dose (treatment) selection problem. Moreover, the proposed gGSD is applicable to more than one subgroup where the subgroups are nested. In this paper, the subgroup was prespecified. However, this subgroup information may not be always accurately identified before the trial. Freidlin and Simon [32] proposed an adaptive signature design to find sensitive patients, without prespecified, into a formal Phase III trial. The subgroup size does not have any impact on the method proposed in this paper. However, practically speaking, generally the subgroup size should be at least 50% of the full population to be financially feasible and maybe ethical reasonable for using this type of seamless design. The proposed seamless design shares the same potential operational challenges discussed in the literature that the trial team may choose to hold the enrollment while the team decides the population selection at the end of stage 1. Different approaches could be used in setting up the criteria for moving into stage 2. One such example could be the predictive probability as used in the Belle 4 study [33]. In an adaptive timetoevent design, the number of events collected in stage 2 could be influenced by a subpopulation selection. These issues arise from the fact that patients who are recruited before an interim analysis and hence enter the interim analysis as censored observations at the time of the interim can have an event later and then enter the analysis again. The strategy discussed in Jenkins et al. [2] could be used to address the independent increments assumption.
In this paper, the PFS of the dualprimary endpoints was used for the adaptation. Other surrogate “proofofconcept” endpoint such as the objective response could be used if more appropriate. The gGSD is a twostage trial design with two arms where the second stage data are used for a classical group sequential design framework. In this regard, the more commonly discussed multiarm multistage (MAMS) design can be combined with gGSD. The research is under investigation. When there is a severe nonproportional hazard such as the delayed effect, the proposed gGSD in current format may be less efficient due to the potential poor performance in the futility analysis.
Availability of data and materials
All data used were simulated. The simulation programs can be accessed from the GitHub repository (https://github.com/populus0112/gatedgroupsequentialdesign_gGSD).
References
Pretorius S, Grignolo A. Phase III Trial Failures: Costly, But Preventable. Appl Clin Trials. 2016;25(8)2016.
Jenkins M, Stone A, Jennison C. An adaptive seamless phase ii/iii design for oncology trials with subpopulation selection using correlated survival endpoints. Pharm Stat. 2011;10:347–56.
Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat. 1979;6:65–70.
Chen X, Luo X, Caprizzi T. The application of enhanced gatekeeping strategies. Stat Med. 2005;24:1385–97.
Edwards D, Madsen J. Constructing multiple test procedures for partially ordered hypothesis sets. Stat Med. 2007;26:5116–24.
Maurer W, Bretz F. Multiple testing in group sequential trials using graphical approaches. Stat Biopharm Res. 2013;5:311–20.
Hung H, Wang S, O’Neill R. Statistical considerations for testing multiple endpoints in group sequential or adaptive clinical trials. J Biopharm Stat. 2007;17(6):1201–10.
Liu Q, Anderson K. On adaptive extensions of group sequential trials for clinical investigations. J Am Stat Assoc. 2008;102:1621–30.
Glimm E, Maurer W, Bretz F. Hierarchical testing of multiple endpoints in groupsequential trials. Stat Med. 2010;29:219–28.
Tamhane A, Mehta C, Liu L. Testing a primary and a secondary endpoint in a group sequential design. Biometrics. 2010;66:1174–84.
Tamhane A, Wu Y, Mehta C. Adaptive extensions of a twostage group sequential procedure for testing primary and secondary endpoints (i): unknown correlation between the endpoints. Stat Med. 2012;31(19):2027–40.
Tamhane A, Wu Y, Mehta C. Adaptive extensions of a twostage group sequential procedure for testing primary and secondary endpoints (ii): sample size reestimation. Stat Med. 2012;31(19):2041–54.
Asakura K, Hamasaki T, Sugimoto T, Hayashi K, Evans S, Sozu T. Sample size determination in groupsequential clinical trials with two coprimary endpoints. Stat Med. 2014;33(17):2897–913.
Hamasaki T, Asakura K, Evans S, Sugimoto T, Sozu T. Groupsequential strategies in clinical trials with multiple coprimary outcomes. Stat Biopharm Res. 2015;7(1):36–54.
Schuler S, MKM, Rauch G. Choice of futility boundaries for group sequential designs with two endpoints. BMC Med Res Methodol. 2017;17:119.
Xu T, Qin Q, Wang X. Defining information fractions in group sequential clinical trials with multiple endpoints. Contemp Clin Trials Commun. 2018;10:77–9.
Marcus R, Peritz E, Gabriel K. On closed testing procedures with special reference to ordered analysis of variance. Biometrika. 1976;63:655–60.
Simes R. An improved bonferroni procedure for multiple test of significance. Biometrika. 1986;73:751–4.
Spiessens B, Debois M. Adjusted significance levels for subgroup analysis in clinical trials. Contemp Clin Trials. 2010;31:647–56.
Bretz F, Schmidli H, Konig F, Racine A, Maurer W. Confirmatory seamless phase ii/iii clinical trials with hypothesis selection at interim: general concepts. Biom J. 2006;48:623–34.
Brannath W, Zuber E, Branson M, Bretz F, Gallo P, Posch M, RacinePoon A. Confirmatory adaptive designs with bayesian decision tools for a targeted therapy in oncology. Stat Med. 2009;28:1445–63.
Sugitani T, Bretz F, Maurer W. A simple and flexible graphical approach for adaptive groupsequential clinical trials. J Biopharm Stat. 2016;26(2):202–16. https://doi.org/10.1080/10543406.2014.972509.
Scala L, Glimm E. Timetoevent analysis with treatment arm selection at interim. Stat Med. 2011;30:3067–81.
Stallard N. A confirmatory seamless phase ii/iii clinical trial design incorporating shortterm endpoint information. Stat Med. 2010;29:959–71.
Friede T, Parsons N, Stallard N, Todd S, ValdesMarquez E, Chataway J, Nicholas R. Designing a seamless phase ii/iii clinical trial using early outcomes for treatment selection: an application in multiple sclerosis. Stat Med. 2011;30:1528–40.
Friede T, Parsons N, Stallard N. A conditional error function approach for subgroup selection in adaptive clinical trials. Stat Med. 2012;31(30):4309–20.
FDA guidance for industry: Adaptive Designs for Clinical Trials of Drugs and Biologics. Dec 2019
Magirr D, Jaki T, Koenig F, Posch M. Sample size reassessment and hypothesis testing in adaptive survival trials. PLoS ONE. 2016;11(2):e0146465. https://doi.org/10.1371/journal.pone.0146465.
Bauer P, Posch M. Modification of the sample size and the schedule of interim analyses in survival trials based on data inspections (Letter to the Editor). Stat Med. 2004;23:1333–5.
Hochberg Y. A sharper bonferroni procedure for multiple tests of significance. Biometrika. 1988;75(4):800–2.
simTrial: Simulation of MultiArm Randomized Phase IIb/III Efficacy Trials with TimetoEvent Endpoints. https://www.rdocumentation.org/packages/seqDesign/versions/1.2/topics/simTrial
Freidlin B, Simon R. Adaptive Signature Design: An Adaptive ClinicalTrial Design for Generating and ProspectivelyTesting AGene Expression Signature for Sensitive Patients. Clin Cancer Res 2005;11(21):78728. https://doi.org/10.1158/10780432.CCR050605.
Martin M, et al. A randomized adaptive phase II/III study of buparlisib, a panclass I PI3K inhibitor, combined with paclitaxel for the treatment of HER2– advanced breast cancer (BELLE4). Ann Oncol. 2017;28:313–20.
Acknowledgements
We thank the editor and two reviewers for the valuable comments that improved the presentation of this paper.
Funding
This research received no specific grant from any funding agency in the public, commercial, or notforprofit sectors.
Author information
Authors and Affiliations
Contributions
GM, JL, JY, and KA contributed to the materials in this paper, read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable. There were no human participants, human material, or human data involved in this research.
Consent for publication
Not applicable.
Competing interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Miao, G., Liao, J.J.Z., Yang, J. et al. A gated group sequential design for seamless Phase II/III trial with subpopulation selection. BMC Med Res Methodol 23, 2 (2023). https://doi.org/10.1186/s12874022018250
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874022018250