What inference for two-stage phase II trials?
- Raphaël Porcher^{1, 2, 3}Email author and
- Kristell Desseaux^{2, 3}
DOI: 10.1186/1471-2288-12-117
© Porcher and Desseaux; licensee BioMed Central Ltd. 2012
Received: 12 October 2011
Accepted: 25 June 2012
Published: 6 August 2012
Abstract
Background
Simon’s two-stage designs are widely used for cancer phase II trials. These methods rely on statistical testing and thus allow controlling the type I and II error rates, while accounting for the interim analysis. Estimation after such trials is however not straightforward, and several different approaches have been proposed.
Methods
Different approaches for point and confidence intervals estimation, as well as computation of p-values are reviewed and compared for a range of plausible trials. Cases where the actual number of patients recruited in the trial differs from the preplanned sample size are also considered.
Results
For point estimation, the uniformly minimum variance unbiased estimator (UMVUE) and the bias corrected estimator had better performance than the others when the actual sample size was as planned. For confidence intervals, using a mid-p approach yielded coverage probabilities closer to the nominal level as compared to so-called ’exact’ confidence intervals. When the actual sample size differed from the preplanned sample size the UMVUE did not perform worse than an estimator specifically developed for such a situation. Analysis conditional on having proceeded to the second stage required adapted analysis methods, and a uniformly minimum variance conditional estimator (UMVCUE) can be used, which also performs well when the second stage sample size is slightly different from planned.
Conclusions
The use of the UMVUE may be recommended as it exhibited good properties both when the actual number of patients recruited was equal to or differed from the preplanned value. Restricting the analysis in cases where the trial did not stop early for futility may be valuable, and the UMVCUE may be recommended in that case.
Background
Phase II trials primarily aim at evaluating the activity of a new therapeutic regimen to decide if it warrants further evaluation in a larger-scale phase III trial, where it is usually compared to a standard treatment. The screening purpose of phase II trials implies that they are designed to reject a new therapeutic regimen showing low therapeutic activity. In cancer phase II trials, therapeutic activity is typically defined in terms of tumor shrinkage [1, 2], and a patient with tumor shrinkage is referred as a responder. The endpoint of such phase II trials is thus a binary endpoint (responder / nonresponder), and a new anticancer agent with too low a response rate should be excluded from further consideration.
Cancer phase II trials are often designed as multistage trials (two stages being most common) allowing early trial termination in case of a low response rate, in order to avoid giving patients an ineffective treatment and wasting resources. The original idea of such a strategy with early termination was suggested by Gehan [3], and many designs were then proposed ([4–6], among others). Among all available multistage designs, Simon’s design [6] is probably the most commonly used in practice. Conversely, early termination for high efficacy is not as important in the phase II setting. Actually, there are less ethical needs to stop the trial early for an effective agent, and accumulating data on both therapeutic activity and safety is important before setting up a large-scale randomized phase III trial.
As phase II trials primarily lead to the decision to proceed to a next step in the evaluation of the therapeutic regimen or not, their design essentially relies on statistical testing. Cancer phase II trials are therefore designed to control the probabilities to continue with an ineffective regimen or to abandon an effective one (type I and II error rates, respectively). Further analysis, and in particular estimation, is nevertheless useful and usually conducted, especially if the new regimen is selected for further consideration [7, 8]. A point estimate of the response rate, a confidence interval and sometimes a p-value are then computed at the termination of the trial. In particular, the point and confidence interval estimates are useful to design the future phase III trial, as well as other phase II trials. Owing to the possibility of early termination, the sample response rate, i.e. the maximum likelhood estimator (MLE), is typically biased, which is known as the optional sampling effect. Many approaches have thus been proposed to reduce the bias or the mean squared error (MSE) of estimators in such a setting [7, 9–14].
One important point concerning inference in two-stage phase II trials has been somewhat overlooked in the literature. As estimation is most important when the therapeutic regimen has been considered as effective, inference may be more common when the phase II trial proceeded to the second stage as compared to cases where it was stopped for futility at the first stage. Inference may thus be conditional on proceeding to the second stage (as e.g. in [12, 13]), or unconditional, over all possible paths as implicitely considered in most other works.
Another issue is the actual total sample size of the trial. Cancer phase II trials are generally of limited sample size, and methods are derived from the ’exact’ binomial distribution of data. However, the actual number of patients recruited in the trial may be different from the planned sample size [11, 15]. Inference in a Simon’s design where the sample size has been modified is however not straightforward, even in terms of hypothesis testing. A method has thus been proposed in the case where drop-outs are non-informative so that the interim analysis can always be performed after inclusion of the planned number of patients and the actual second stage sample size does not depend on results observed during the first stage [11]. Although designs where the second stage sample size can be adapted according to the first stage result exist [16, 17], this was not considered here.
In this paper, we compare the performance of the different approaches proposed in the literature for inference in a two-stage Simon’s phase II trial. In the next section, we present the different point estimators, confidence intervals and p-values proposed in the case where the actual sample size is as planned and in the case where the actual stage 2 sample size of the trial is different from the planned one. Then, results of a numerical study comparing the properties of the different methods in various settings are presented. We conclude with some discussion.
Methods
Simon’s design and notations
Let us denote Πas the true response rate when given some anticancer agent. Usual methodology of cancer phase II trials consists in testing the null hypothesis Π≤Π _{0}versus Π≥Π _{1} = Π _{0} + δ, where Π _{0} is the highest probability of response which would indicate that the agent is of no further interest, and Π _{1}the smallest probability of response indicating that the agent may be promising. Simon [6] considered two-stage designs where no stopping for efficacy is possible after the first stage. Briefly, n _{1}subjects are accrued during the first stage. If the number of responses observed in the first stage X _{1} is lower or equal to a critical value r _{1}, the trial is stopped for futility. If X _{1}>r _{1}, the trial proceeds to a second stage where n _{2} additional patients are accrued. Let us denote X _{2} the number of responses observed in the n _{2}second stage patients, X _{ t } = X _{1} + X _{2}and r _{ t } the final critical value. Then if X _{ t }≤r _{ t }futility is concluded at the end of the trial, whereas efficacy is concluded if X _{ t }>r _{ t }. Given (Π _{0} Π _{1}) many such two-stage designs may satisfy the prespecified type I and II error rates (α β). Simon proposed two criteria to choose an appropriate design among such acceptable designs. The first one minimises the expected sample size under the null hypothesis and is referred to as the ’optimal’ design. The second one minimizes the maximal sample size n _{ t } = n _{1} + n _{2}and is referred to as the ’minimax’ design. Jung et al.[18] further proposed a graphical method to search for alternative compromises between Simon’s optimal and minimax designs. For simplicity, we will however here concentrate on the two original Simon’s designs, although all following results may apply to any two-stage design where no early stopping for efficacy is possible.
for s = 1,…,r _{1} if m = 1 and s = r _{1} + 1,…,n _{ t } if m = 2, and where $a\wedge b=min(a,b)$ and $a\vee b=max(a,b)$.
Inference following a two-stage design
Point estimate
Although the primary goal of phase II trials is decision making rather than inference, obtaining an estimate of the true response rate is often of interest, particularly when the trial was deemed successful and the new drug accepted for further evaluation in phase III trials [7].
A median unbiased estimator may be considered as the value of Π such that the corresponding p-value would be 0.5 (see next section). It was used by Koyama and Chen [11] when n _{2} is different from its prespecified value, and will thus be denoted by ${\widehat{\Pi}}_{k}$, although they used ${\widehat{\Pi}}_{w}$ in their article when n _{2} was as planned.
Another approach was used by Tsai et al.[12], who restricted their analysis to cases where the trial proceeded to the second stage. In these cases, they derived a (conditional) maximum likelihood estimator of Π accounting for the truncated distribution of X _{1} (which must be at least r _{1} + 1). This conditional estimator will be denoted by ${\widehat{\Pi}}_{c}$. To compare all estimators on a fair basis, we assumed that when the trial stopped at the first stage, an unconditional MLE was used. A conditional distribution given X _{1}≤r _{1} may also be derived, but it makes little sense in cases where r _{1} is small, in particular when r _{1} is 0 or 1, which is the case for optimal and minimax designs for Π _{0} = 0.05 and Π _{1} = 0.2 or Π _{1} = 0.25 with α=0.05 and β=0.1, for instance. We thus preferred not to consider conditional inference for early trial termination.
Relating to the work of Tsai et al.[12], Li recently proposed an MSE-reduced estimator of Π as a weighted mean of the naive estimator and ${\widehat{\Pi}}_{c}$ [14]. This estimator showed slightly higher bias than ${\widehat{\Pi}}_{c}$, with a slightly lower MSE, but no clear advantage. It was thus not further considered here.
Numerical studies in various settings showed that the biased-corrected estimators ${\widehat{\Pi}}_{w}$ and ${\widehat{\Pi}}_{g}$ had often similar performance in terms of bias and mean squared error (MSE), with much smaller bias and slightly higher MSE than the MLE. As compared to the UMVUE, the MLE and the bias-corrected estimators have been shown to have smaller MSE in many situations, but not always [7, 10]. Other estimators were not extensively compared to each other or to the previous ones, in particular in the setting of conditional inference or when the actual sample size differes from the preplanned one. Determining in which situation one estimator would be preferable thus remains unclear.
P-value
The assumption on the distribution of S is true if m = 1, but obviously wrong if m = 2. This is exemplified on equation (7) by the summation on impossible sample paths where X _{1}<r _{1} and X _{2} = s−X _{1}.
It is therefore necessary to use the proper distribution of observed data to compute a p-value. The p-value is the probability under the null hypothesis to obtain a result at least as extreme as the one observed. Owing to the multistage procedure, several orderings, i.e. several definitions of ”at least as extreme”, may however be considered even if the proper distribution is used [20]. For instance, assume a design with n _{1}=24, n _{2}=39, r _{1}=8 and r _{ t }=24 (optimal design for Π _{0}=0.30, Π _{1}=0.50, α=0.05 and β=0.10). One may consider that obtaining 18 responders out of 63 patients after proceeding to the second stage is less extreme than obtaining 7 responders out of 24 patients and stopping at the first stage, because 18/63=0.286 is less than 7/24=0.292. This corresponds to MLE ordering [20, 21]. Conversely, one may also use stage-wise ordering, and consider that 18/63 is a more extreme result than 7/24 because it was observed after proceeding to the second stage instead of stopping at the first stage. Indeed, to proceed to the second stage the number of responders in the first stage was at least 9. This is the ordering recommended in Jennison and Turnbull in the general case of sequential clinical trials [20, chapter 18.4, p 180], and the one they use to compute exact confidence bounds for Π [22].
The bias-corrected estimators have the same ordering as the MLE [23]. They thus result in exactly the same p-value.
which is equivalent to the p-value given by Koyama-Chen for designs where attained n _{2} is as planned [11].
If the trial is stopped at the first stage, p _{ c } can simply be computed by $\underset{{\Pi}_{0}}{Pr}({X}_{1}\ge s)$ and is thus equal to p _{ s }.
Confidence interval
Beside point estimates, confidence intervals are often reported in phase II trials. Despite the one-sided nature of Simon’s design, it is not uncommon to report two-sided (1−2α) confidence intervals rather than left (1−α) one-sided confidence intervals. We will thus make this choice although both approaches are consistent with the one-sided test performed at level α. Note however that in many applications, two-sided 95% confidence intervals are reported, whatever the choice on the (one-sided) αlevel.
The existence of this interval relies on the stochastic ordering of the distribution of (M,S) with respect to Π[10]. It is the same as the confidence interval used in several other works [11, 22]. As it uses the UMVUE or stage-wise ordering, we refer to it as the exact stage-wise confidence interval. Using MLE ordering instead of stage-wise ordering does not result in the same property of stochastic ordering [10]. It was therefore not further considered.
Tsai et al.[12] considered several other intervals, both asymptotic and exact, but focusing on cases were the trial proceeds to the second stage, and using conditional inference as stated earlier. Asymptotic confidence intervals considered were the Wald and score intervals, both with or without continuity correction, and based on the conditional MLE given the trial proceeds to a second stage (referred as MLE in their article). Exact confidence intervals were Clopper–Pearson as explained above, but based on the conditional distribution of (M,S) given m = 2 (equation 10), and Sterne exact interval, modified to obtain an interval when the original method produces disjoint intervals as a confidence region. They concluded upon recommendation of score confidence intervals with continuity correction. Only the latter and Clopper–Pearson intervals will thus be considered here, and referred as the conditional score and conditional exact confidence intervals. Moreover, we proposed a mid-p confidence interval using the conditional distribution of (M,S) given m = 2. It is referred as the conditional mid-p confidence interval. Pepe et al. used parametric and nonparametric bootstrap confidence intervals for the UMVCUE in their article [13]. They showed that both methods yielded coverage probabilities reasonably close to the nominal level, but lower for the parametric bootstrap than for the nonparametric bootstrap. However, these methods do not provide correct confidence intervals in some situations, for instance when X _{2}=0 or s = n _{ t }. They were thus not considered here.
Extended or shortened trial
It is not uncommon that the actual sample size of a phase II trial would be different from the planned sample size [11, 15]. This may be due to differences between anticipated and actual accrual and drop out rates, for instance. For a two stage design, current practice often relies on ignoring the over- or underaccrual or in re-computing the decision boundaries as if the attained sample size had been planned in a single-stage design, which leads to bias and possible inflation of the type I error rate. Koyama and Chen [11] recently proposed a method to calculate a new critical value for the second stage analysis assuming dropouts and overrun would be totally non-informative. In this case, the interim analysis can always be performed on the preplanned n _{1}subjects, and the difference in sample size only concerns the second stage sample size. They also proposed a method for inference at the end of the trial, thus providing a point estimate, a confidence interval and a p-value.
Assume ${n}_{2}^{\prime}$=n _{2} + Δ n _{2} patients are accrued at the second stage instead of the preplanned n _{2}, and that ${X}_{2}^{\prime}$ success are then observed, where ${X}_{2}^{\prime}$ follows a binomial distribution of parameters (${n}_{2}^{\prime ,\Pi}$). Briefly, the method proposed consists in defining a new critical value for the second stage as the one leading to the same decision as when comparing the conditional p-value of the second stage Pr${}_{{\Pi}_{0}}({X}_{2}^{\prime}\ge {x}_{2}^{\prime}|{X}_{1}={x}_{1})$ to the conditional type I error rate given X _{1} = X _{1} in the original design with n _{2} patients at the second stage. The new conditional type I error rate is thus lower or equal to the original conditional type I error rate, allowing to control the unconditional type I error rate.
where $A({x}_{1},{n}_{2},\Pi )=\sum _{{x}_{2}={r}_{t}-{x}_{1}+1}^{{n}_{2}}\left(\genfrac{}{}{0ex}{}{{n}_{2}}{{x}_{2}}\right){\Pi}^{{x}_{2}}{(1-\Pi )}^{({n}_{2}-{x}_{2})}$ is the conditional power function at the second stage, and ^{ Π∗}is the solution of $A({x}_{1},{n}_{2},{\Pi}^{\ast})=\underset{{\Pi}_{0}}{Pr}({X}_{2}^{\prime}\ge {x}_{2}^{\prime}|{X}_{1}={x}_{1})$. Solving for ^{ Π∗} allows to extend the conditional power to all potential values of X _{1}, whereas only one particular value (X _{1}) was observed. The use of the conditional power function $A({x}_{1},{n}_{2},{\Pi}^{\ast})$ allows ordering different sample paths with different X _{1} and the actual sample size for stage 2 ${n}_{2}^{\prime}$ by comparing the ^{ Π∗}, smaller ^{ Π∗} indicating stronger evidence against the null hypothesis. This ordering is coherent with the hypothesis testing strategy they proposed, based on a new critical value to control the conditional type I error. In that respect, the p-value _{ p k } is lower than α if and only if the null hypothesis is rejected.
Koyama and Chen proposed the estimator ${\widehat{\Pi}}_{k}$ as the value of Π _{0} yielding a p-value p _{ k }=0.5, and a two-sided Clopper–Pearson-like confidence interval based on p _{ k }. The definition of p _{ k } by equation 11 should allow to control the overall type I error rate, but the properties of the test, estimator and confidence interval have not been thoroughly studied.
Although Koyama and Chen used a biased-corrected estimator when the second stage sample size was as planned, we denoted ${\widehat{\Pi}}_{k}$ the median estimator presented above also in the case where n _{2} patients are accrued at the second stage.
Numerical study
To examine the properties of the different methods, numerical studies were conducted. Several design scenarios were considered, that covered a range of possible phase II trials in oncology. To help determining these scenarios, a limited literature search of phase II cancer trials using Simon’s design over the last years was performed. As this study was informal and arbitrarily limited to some journals, no results are reported. Twelve design scenarios where thus considered, with response rates under the null hypothesis of 0.05, 0.1, 0.2, 0.3, 0.4 and 0.5. Trials with higher values of Π _{0}were considered as pretty rare, and therefore not considered. For each value of Π _{0}, two differences in response rate between the null and alternative hypotheses were considered, namely 0.15 and 0.2. In all cases, the type I error rate α was set to 0.05 and the type II error rate β to 0.10 (90% power). Then, for each combination of design parameters, a choice between Simon’s optimal and minimax design was made on a case by case basis, according to the expected total sample size of the trial and the probability of early termination under H_{0} and H_{1}.
For each design scenario considered, the probability of all possible outcomes (M,S) was computed using equation (1) for a range of values of the response rate Π varying from Π _{0} to Π _{0} + 0.20 (thus Π _{1} when δ was 0.20 and slightly more than Π _{1}when δ was 0.15). For each possible outcome, the resulting estimators, p-values and confidence intervals were also computed. As the probability of each outcome was the probability distribution of the estimators, p-values and confidence intervals, the bias and root mean square error (RMSE) of estimators, the probability of rejection of the tests based on the p-values and coverage probability of the confidence intervals could be derived.
To investigate the impact of accrual of some more or some fewer patients at the second stage as compared to the planned n _{2} value, trials where the second stage sample size was decreased by 1 or 2 or increased by 1, 2 or 5 were considered. These settings were not symmetrical because it was felt that overaccrual would be more frequent, because of the time delay to close a trial and because investigators would more likely want to protect the trial from patients exclusion and thus easily accrue more patients. Main analysis was unconditional: i.e. performance of the different methods was averaged over all possible outcomes. As some methods were more specifically developed to correct the analysis of the second stage results only, analysis restricted to cases where the trial proceeded to a second stage was also performed, and referred as conditional analysis.
To keep results simple and because the main findings were close to one scenario or another, only the results of six of the twelve scenarios are presented in detail. Additionally, these detailed results are only presented for situations where the second stage sample size was as planned. For situations where the second stage sample size was different from planned, the tables present results averaged over the different scenarios and the different values of Δ n _{2} (simple arithmetic average without any weighting). However, the description of results encompassed the whole range of data obtained and not only the results presented in the tables. Particular cases where results were representative or different from the overall message were then isolated.
All computations were performed using R 2.13.2 statistical software [28].
Results
Trial accrual as planned
Coverage probabilities of the 90% confidence intervals are presented in the right sub-panel of Figure 2 for each design scenario. Overall, the properties of all methods but the mid-p approach where disappointing, in particular for small values of Π _{0} such as 0.05 for instance. The mid-p confidence interval had coverage probabilities closer to the nominal level than the other approaches in almost all situations. It was conservative under H_{0} for smaller values of Π _{0}, but the coverage probability fluctuated around 90% when Π _{0} was 0.20 or more, within a margin of −1%to + 2% only. On the contrary, the exact (stage-wise ordering) confidence intervals had always a coverage probability above 90%, but often 2 to 3% above, and even between 7 and 8% above for smaller sample size trials. The conservative nature of Clopper–Pearson approach has already been reported, and the performance observed here for such intervals was however not clearly worse as that reported for so-called exact confidence intervals in a one sample (one-stage) setting [25]. Note that the phenomenon of oscillations in coverage probability according to Π appearing on the graphs is known, and caused by the lattice structure of the binomial distribution [29]. The confidence intervals based on the conditional score with continuity correction which exhibited better conditional performance in the work by Tsai et al.[12] and the conditional mid-p confidence interval had close performance, but for Π departing from Π _{0}, their coverage probabilities were lower than the nominal level in this unconditional setting. This occurred less frequently and less dramatically for the conditional exact confidence interval, which however had a coverage probability clearly above its nominal level for Π close to Π _{0}, especially for small values of Π _{0}.
Extended or shortened trial
Performance of the different methods when second stage sample size was different from planned: average over the different design scenarios and differences between the planned and attained second stage sample size
Property | Method | Π= Π _{0} | Π= Π _{0} + δ |
---|---|---|---|
Bias | ${\widehat{\Pi}}_{m}$ | −0.015 | −0.005 |
${\widehat{\Pi}}_{g}$ | −0.004 | 0.001 | |
${\widehat{\Pi}}_{u}$ | 0.000 | 0.000 | |
${\widehat{\Pi}}_{c}$ | −0.029 | −0.012 | |
${\widehat{\Pi}}_{p}$ | −0.028 | −0.009 | |
${\widehat{\Pi}}_{k}$ | −0.009 | −0.012 | |
RMSE | ${\widehat{\Pi}}_{m}$ | 0.060 | 0.071 |
${\widehat{\Pi}}_{g}$ | 0.063 | 0.067 | |
${\widehat{\Pi}}_{u}$ | 0.071 | 0.067 | |
${\widehat{\Pi}}_{c}$ | 0.061 | 0.076 | |
${\widehat{\Pi}}_{p}$ | 0.062 | 0.064 | |
${\widehat{\Pi}}_{k}$ | 0.062 | 0.070 | |
Rejection probability | p _{ n } | 0.033 | 0.882 |
_{ p m } | 0.036 | 0.887 | |
_{ p u } | 0.036 | 0.887 | |
p _{ c } | 0.012 | 0.800 | |
_{ p k } | 0.035 | 0.885 | |
Coverage probability | Naive exact | 0.940 | 0.916 |
Stage-wise | 0.937 | 0.933 | |
Mid-p | 0.916 | 0.895 | |
Conditional exact | 0.952 | 0.906 | |
Conditional score | 0.935 | 0.851 | |
Conditional mid-p | 0.936 | 0.860 | |
Koyama–Chen | 0.937 | 0.931 |
Performance of the estimators when second stage sample size is modified by Δ n _{2}: bias and root mean squared error in selected situations
Δ n _{2} =−2 | Δ n _{2} =−1 | Δ n _{2} = +1 | Δ n _{2} = +2 | Δ n _{2} = +5 | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Settings | Estimator | Bias | RMSE | Bias | RMSE | Bias | RMSE | Bias | RMSE | Bias | RMSE |
Optimal design with Π _{0} = 0.05, Π _{1} = 0.2: n _{1}=21, n _{2}=20, r _{1}=1, r _{ t }=4 | |||||||||||
Π = Π _{0} | ${\widehat{\Pi}}_{m}$ | -0.008 | 0.038 | -0.009 | 0.037 | -0.009 | 0.037 | -0.009 | 0.037 | -0.010 | 0.036 |
${\widehat{\Pi}}_{g}$ | -0.002 | 0.041 | -0.003 | 0.041 | -0.003 | 0.040 | -0.003 | 0.040 | -0.003 | 0.040 | |
${\widehat{\Pi}}_{u}$ | 0.000 | 0.046 | 0.000 | 0.046 | 0.000 | 0.046 | 0.000 | 0.045 | 0.000 | 0.045 | |
${\widehat{\Pi}}_{c}$ | -0.018 | 0.036 | -0.018 | 0.036 | -0.018 | 0.036 | -0.018 | 0.036 | -0.018 | 0.035 | |
${\widehat{\Pi}}_{p}$ | -0.018 | 0.037 | -0.018 | 0.037 | -0.018 | 0.036 | -0.018 | 0.036 | -0.018 | 0.035 | |
${\widehat{\Pi}}_{k}$ | -0.006 | 0.039 | -0.006 | 0.039 | -0.006 | 0.038 | -0.006 | 0.038 | -0.006 | 0.038 | |
Π = Π _{1} | ${\widehat{\Pi}}_{m}$ | -0.004 | 0.071 | -0.004 | 0.071 | -0.005 | 0.069 | -0.005 | 0.069 | -0.005 | 0.067 |
${\widehat{\Pi}}_{g}$ | 0.001 | 0.068 | 0.001 | 0.068 | 0.001 | 0.066 | 0.001 | 0.066 | 0.001 | 0.064 | |
${\widehat{\Pi}}_{u}$ | 0.000 | 0.068 | 0.000 | 0.067 | 0.000 | 0.066 | 0.000 | 0.065 | 0.000 | 0.064 | |
${\widehat{\Pi}}_{c}$ | -0.012 | 0.077 | -0.012 | 0.076 | -0.011 | 0.074 | -0.011 | 0.073 | -0.011 | 0.071 | |
${\widehat{\Pi}}_{p}$ | -0.009 | 0.076 | -0.009 | 0.075 | -0.009 | 0.074 | -0.009 | 0.073 | -0.009 | 0.071 | |
${\widehat{\Pi}}_{k}$ | -0.012 | 0.071 | -0.013 | 0.070 | -0.013 | 0.069 | -0.013 | 0.068 | -0.013 | 0.067 | |
Minimax design with Π _{0}=0.4, Π _{1}=0.6: n _{1}=29, n _{2}=25, r _{1}=12, r _{ t }=27 | |||||||||||
Π = Π _{0} | ${\widehat{\Pi}}_{m}$ | -0.015 | 0.078 | -0.016 | 0.078 | -0.016 | 0.077 | -0.017 | 0.077 | -0.018 | 0.076 |
${\widehat{\Pi}}_{g}$ | -0.004 | 0.080 | -0.004 | 0.080 | -0.004 | 0.080 | -0.004 | 0.079 | -0.004 | 0.079 | |
${\widehat{\Pi}}_{u}$ | 0.000 | 0.087 | 0.000 | 0.087 | 0.000 | 0.087 | 0.000 | 0.087 | 0.000 | 0.087 | |
${\widehat{\Pi}}_{c}$ | -0.037 | 0.082 | -0.037 | 0.082 | -0.036 | 0.081 | -0.036 | 0.080 | -0.036 | 0.079 | |
${\widehat{\Pi}}_{p}$ | -0.035 | 0.083 | -0.035 | 0.082 | -0.035 | 0.081 | -0.035 | 0.081 | -0.035 | 0.080 | |
${\widehat{\Pi}}_{k}$ | -0.010 | 0.079 | -0.010 | 0.078 | -0.010 | 0.078 | -0.010 | 0.078 | -0.010 | 0.078 | |
Π = Π _{1} | ${\widehat{\Pi}}_{m}$ | -0.003 | 0.074 | -0.003 | 0.074 | -0.003 | 0.073 | -0.003 | 0.073 | -0.003 | 0.071 |
${\widehat{\Pi}}_{g}$ | 0.001 | 0.070 | 0.001 | 0.070 | 0.002 | 0.069 | 0.002 | 0.068 | 0.002 | 0.067 | |
${\widehat{\Pi}}_{u}$ | 0.000 | 0.071 | 0.000 | 0.070 | 0.000 | 0.069 | 0.000 | 0.069 | 0.000 | 0.068 | |
${\widehat{\Pi}}_{c}$ | -0.011 | 0.082 | -0.011 | 0.081 | -0.010 | 0.080 | -0.010 | 0.079 | -0.010 | 0.077 | |
${\widehat{\Pi}}_{p}$ | -0.007 | 0.080 | -0.007 | 0.079 | -0.007 | 0.078 | -0.007 | 0.077 | -0.007 | 0.076 | |
${\widehat{\Pi}}_{k}$ | -0.012 | 0.073 | -0.012 | 0.072 | -0.011 | 0.071 | -0.011 | 0.071 | -0.011 | 0.070 |
In terms of hypothesis testing and p-values, all methods except the conditional test yielded very close results, with no increase of the type I error rate in the situations studied. Actually, the possible values of (M,S) where these methods disagreed in terms of rejection of the null hypothesis had very small probabilities in general, thus almost no impact on test size or power. In several situations, there were even no values of (M,S) for which the methods disagreed. On the contrary, the test based on the conditional p-value had a probability of rejection markedly smaller than other methods, with both a type I error rate and a power clearly under their nominal value.
The mid-p confidence intervals had again coverage probabilities closer to the nominal 90% level than the other methods, in particular than the Koyama–Chen method which was corrected for sample size modifications. Over all 120 situations covered, the Koyama–Chen confidence intervals were rather conservative but always preserved the nominal confidence level, with coverage probabilities ranging from 90.0% to 98.5%, with an average of 93.4%. On the contrary, coverage probabilities ranged from 85.7% to 96.5% for the mid-p confidence intervals, with an average of 90.1%. Coverage probabilities under the nominal level were more frequent under H_{1} than under H_{0} and for higher values of the probability of response Π.
Analysis conditional on proceeding to stage 2
In terms of RMSE, the conditional estimators ${\widehat{\Pi}}_{c}$ and ${\widehat{\Pi}}_{p}$ had close performance, with negligible differences in favor of ${\widehat{\Pi}}_{c}$ under H_{0} and of ${\widehat{\Pi}}_{p}$ under H_{1}. Despite their bias, all unconditional estimators except the UMVUE had generally lower RMSE than the conditional estimators. With biases as high as 4% for response rate of 5% or as 8% for a response rate of 20%, these estimators cannot be recommended for conditional inference, however.
Performance of the different methods for conditional inference when second stage sample size was different from planned: average over the different scenarios
Property | Method | Π= Π _{0} | Π= Π _{0} + δ |
---|---|---|---|
Bias | ${\widehat{\Pi}}_{m}$ | 0.038 | 0.004 |
${\widehat{\Pi}}_{g}$ | 0.053 | 0.010 | |
${\widehat{\Pi}}_{u}$ | 0.084 | 0.010 | |
${\widehat{\Pi}}_{c}$ | −0.003 | −0.002 | |
${\widehat{\Pi}}_{p}$ | 0.000 | 0.000 | |
${\widehat{\Pi}}_{k}$ | 0.057 | −0.003 | |
RMSE | ${\widehat{\Pi}}_{m}$ | 0.057 | 0.059 |
${\widehat{\Pi}}_{g}$ | 0.068 | 0.056 | |
${\widehat{\Pi}}_{u}$ | 0.086 | 0.054 | |
${\widehat{\Pi}}_{c}$ | 0.060 | 0.065 | |
${\widehat{\Pi}}_{p}$ | 0.061 | 0.064 | |
${\widehat{\Pi}}_{k}$ | 0.062 | 0.057 | |
Rejection probability | p _{ n } | 0.100 | 0.931 |
_{ p m } | 0.110 | 0.936 | |
_{ p u } | 0.110 | 0.936 | |
p _{ c } | 0.035 | 0.844 | |
_{ p k } | 0.107 | 0.933 | |
Coverage probability | Naive exact | 0.899 | 0.939 |
Stage-wise | 0.890 | 0.957 | |
Mid-p | 0.852 | 0.941 | |
Conditional exact | 0.939 | 0.929 | |
Conditional score | 0.910 | 0.894 | |
Conditional mid-p | 0.913 | 0.903 | |
Koyama–Chen | 0.889 | 0.956 |
Discussion
In terms of estimation, ${\widehat{\Pi}}_{g}$ and ${\widehat{\Pi}}_{u}$ should be recommended as they perform better than the other estimators, in particular when the true response rate is higher than the one under H_{0}, i.e. in cases when estimation is the most important. Although our simulations did not encompass all possible ranges of response rates and treatment effects, they cover a wide range of plausible situations, in which no clear advantage of the bias corrected estimator ${\widehat{\Pi}}_{g}$ over the UMVUE ${\widehat{\Pi}}_{u}$ could be found.
The choice of a conditional or unconditional inference is clearly overlooked in practical applications. Conditional inference — and conditional bias in particular — has attracted some interest in the setting of group sequential phase III trials, with concerns rather directed at the conditional bias of the estimator of the treatment effect when trials were stopped early for efficacy [30, 31]. In the setting of Simon’s two-stage phase II trials, conditional inference would rather be favored when the trial did not stop at the first stage, especially if the trial was deemed succesful at the end [13]. Such aspects of conditional inference have however been rarely discussed to our knowledge [13, 32]. Results show that unbiased or almost unbiased estimation can be performed using the UMVCUE [13] or the proper conditional distribution [12], respectively, both with very similar RMSE. In addition, both performed well even when the sample size at the second stage was slightly different from its planned value. To construct an estimator that would be both conditionally and unconditionally unbiased, one could also derive an estimator for trials stopping at the first stage that would use the conditional distribution given X _{1}≤r _{1}. In such a case, the estimator would be conditionally unbiased whether the trial was stopped at the first or the second stage, and thus would be unconditionally unbiased. Using a distribution of outcomes conditional on early stopping makes however little sense — if any — when r _{1} is small. For instance, if r _{1}=0, then the only potential outcome in case of early stopping is X _{1}=0, thus leading to a single possible value for the estimator of Π. It is therfore not possible to construct an unbiased estimator of any value of Π in this case. We therefore did not further develop this point in the paper. Another solution, however, would be to use a biased-corrected estimator such as Whitehead’s [19] or Guo’s [7] when the trial was stopped early. This has already been evoked by Pepe et al.[13], without further investigations.
In this study, we have concentrated on Simon’s design for phase II cancer trials. Other designs or adaptations however exist. In particular, Jovic and Whitehead have recently proposed point estimates, confidence intervals and p-values for a modified Simon’s design with early stopping for efficacy [33]. Other extensions of Simon’s design could also have been considered [5, 34]. In cases where early stopping for efficacy is possible, the results of the methods proposed by Jovic and Whitehead could have been used. Tsai et al. also applied their conditional method to Shuster’s design [34]. Nevertheless, a short look at cancer literature shows that a majority of cancer phase II trials still use Simon’s design.
In practical applications, it may occurr that the actual number of patients recruited would be slightly different from the preplanned value. For instance some patients may be unevaluable for response or they may withdraw their consent during study. On the contrary, some patients may be included in the study before recruitment is formally closed. For these cases, where the decrease or increase of second stage sample size may be considered as non informative, Koyama and Chen proposed inference procedures based on conditional power [11]. They clearly state in their article that the properties of their estimators, p-values and confidence intervals need to be further studied. In our numerical settings, it turned out that the UMVUE, which can still be used because it only makes use of boundary decisions at the second stage, performed better than the Koyama–Chen method. The behaviour of both estimators with modified sample size however deserve further investigations. Concerning confidence intervals, the mid-p intervals performed better than the so-called exact confidence intervals in most settings for both unconditional and conditional inference. Koyama and Chen however did not consider such an approach, and their confidence intervals rely on Clopper–Pearson method. Using a mid-p approach with their modifed p-value (equation 11) may also have improved the coverage probabilities of the confidence intervals.
Another interesting field of further research concerns inference in adaptive phase II trials, where the second stage sample size can be adapted according to the first stage results [16, 17]. In such cases, the decrease or increase in sample size cannot be considered as non informative anymore, and the method of Koyama and Chen does not apply. New developments are thus needed here.
Conclusions
For point estimation, the UMVUE ${\widehat{\Pi}}_{u}$ was unbiased both when the actual number of patients recruited was equal to or differed from the preplanned value. The bias corrected estimator ${\widehat{\Pi}}_{g}$ had negligible bias and slightly lower RMSE than the UMVUE only when the true response rate Π was close to its value under the null hypothesis. Both estimators perfomed better than the others and can thus be recommended. In terms of confidence intervals, mid-p confidence intervals performed best, as compared to the other exact confidence intervals, whether they ignore the group sequential nature of the trial or not.
When one is more particularly interested on inference conditional on having proceeded to the second stage, the UMVCUE ${\widehat{\Pi}}_{p}$ which is unbiased may be recommended. Conditional score or conditional mid-p confidence intervals should then be used.
Declarations
Authors’ Affiliations
References
- Miller AB, Hoogstraten B, Staquet M, Winkler A: Reporting results of cancer treatment. Cancer. 1981, 47: 207-214. 10.1002/1097-0142(19810101)47:1<207::AID-CNCR2820470134>3.0.CO;2-6.View ArticlePubMedGoogle Scholar
- Eisenhauer EA, Therasse P, Bogaerts J, Schwartz LH, Sargent D, Ford R, Dancey J, Arbuck S, Gwyther S, Mooney M, Rubinstein L, Shankar L, Dodd L, Kaplan R, Lacombe D, Verweij J: New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). Eur J Cancer. 2009, 45: 228-247. 10.1016/j.ejca.2008.10.026.View ArticlePubMedGoogle Scholar
- Gehan EA: The determination of the number of patients required in a preliminary and a follow-up trial of a new chemotherapeutic agent. J Chron Dis. 1961, 13 (4): 346-353. 10.1016/0021-9681(61)90060-1.View ArticlePubMedGoogle Scholar
- Fleming TR: One-sample multiple testing procedure for phase II clinical trials. Biometrics. 1982, 38: 143-151. 10.2307/2530297.View ArticlePubMedGoogle Scholar
- Chang MN, Therneau TM, Wieand HS, Cha SS: Designs for group sequential phase II clinical trials. Biometrics. 1987, 43: 865-874. 10.2307/2531540.View ArticlePubMedGoogle Scholar
- Simon R: Optimal two-stage designs for phase II clinical trials. Control Clin Trials. 1989, 10: 1-10. 10.1016/0197-2456(89)90015-9.View ArticlePubMedGoogle Scholar
- Guo HY, Liu A: A simple and efficient bias-reduced estimator of response probability following a group sequential phase II trial. J Biopharm Stat. 2005, 15 (5): 773-781. 10.1081/BIP-200067771.View ArticlePubMedGoogle Scholar
- Liu A, Wu C, Yu KF, Gehan E: Supplementary analysis of probabilities at the termination of a group sequential phase II trial. Stat Med. 2005, 24 (7): 1009-1027. 10.1002/sim.1990.View ArticlePubMedGoogle Scholar
- Chang M, Wieand H, Chang V: The bias of the sample proportion following a group sequential phase II clinical trial. Stat Med. 1989, 8 (5): 563-570. 10.1002/sim.4780080505.View ArticlePubMedGoogle Scholar
- Jung SH, Kim KM: On the estimation of the binomial probability in multistage clinical trials. Stat Med. 2004, 23 (6): 881-896. 10.1002/sim.1653.View ArticlePubMedGoogle Scholar
- Koyama T, Chen H: Proper inference from Simon’s two-stage designs. Stat Med. 2008, 27 (16): 3145-3154. 10.1002/sim.3123.View ArticlePubMedGoogle Scholar
- Tsai W, Chi Y, Chen C: Interval estimation of binomial proportion in clinical trials with a two-stage design. Stat Med. 2008, 27: 15-35. 10.1002/sim.2930.View ArticlePubMedGoogle Scholar
- Pepe MS, Feng Z, Longton G, Koopmeiners J: Conditional estimation of sensitivity and specificity from a phase 2 biomarker study allowing early termination for futility. Stat Med. 2009, 28 (5): 762-779. 10.1002/sim.3506.View ArticlePubMedPubMed CentralGoogle Scholar
- Li Q: An MSE-reduced estimator for the response proportion in a two-stage clinical trial. Pharm Stat. 2011, 10: 277-279. 10.1002/pst.414.View ArticlePubMedGoogle Scholar
- Green SJ, Dahlberg S: Planned versus attained design in phase II clinical trials. Stat Med. 1992, 11 (7): 853-862. 10.1002/sim.4780110703.View ArticlePubMedGoogle Scholar
- Banerjee A, Tsiatis AA: Adaptive two-stage designs in phase II clinical trials. Stat Med. 2006, 25 (19): 3382-3395. 10.1002/sim.2501.View ArticlePubMedGoogle Scholar
- Englert S, Kieser M: Adaptive designs for single-arm phase II trials in oncology. Pharm Stat. 2012, 11 (3): 241-249. 10.1002/pst.541.View ArticlePubMedGoogle Scholar
- Jung SH, Lee T, Kim KM, George SL: Admissible two-stage designs for phase II cancer clinical trials. Stat Med. 2004, 23 (4): 561-569. 10.1002/sim.1600.View ArticlePubMedGoogle Scholar
- Whitehead J: On the bias of maximum likelihood estimation following a sequential test. Biometrika. 1986, 73 (3): 573-581. 10.1093/biomet/73.3.573.View ArticleGoogle Scholar
- Jennison C, Turnbull BW: Group Sequential Methods with Applications to Clinical Trials. 2000, CRC Press, Boca RatonGoogle Scholar
- Armitage P: Numerical studies in the sequential estimation of a binomial parameter. Biometrika. 1958, 45 (1-2): 1-15. 10.1093/biomet/45.1-2.1.View ArticleGoogle Scholar
- Jennison C, Turnbull BW: Confidence intervals for a binomial parameter following a multistage test with application to MIL-STD 105D and medical trials. Technometrics. 1983, 25: 49-58. 10.1080/00401706.1983.10487819.View ArticleGoogle Scholar
- Jung SH, Owzar K, George SL, Lee T: P-value calculation for multistage phase II cancer clinical trials. J Biopharm Stat. 2006, 16 (6): 765-775. 10.1080/10543400600825645.View ArticlePubMedGoogle Scholar
- Clopper CJ, Pearson ES: The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika. 1934, 26 (4): 404-413. 10.1093/biomet/26.4.404.View ArticleGoogle Scholar
- Newcombe RG: Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat Med. 1998, 17: 857-872. 10.1002/(SICI)1097-0258(19980430)17:8<857::AID-SIM777>3.0.CO;2-E.View ArticlePubMedGoogle Scholar
- Neyman J: On the problem of confidence intervals. Ann Math Statist. 1935, 6 (3): 111-116. 10.1214/aoms/1177732585.View ArticleGoogle Scholar
- Mehta CR, Walsh SJ: Comparison of exact, mid-p, and Mantel-Haenszel confidence intervals for the common odds ratio across several 2×2 contingency tables. Am Statist. 1992, 46 (2): 146-150.Google Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. 2011, R Foundation for Statistical Computing, Vienna, Austria, [ISBN 3-900051-07-0]Google Scholar
- Brown L, Cai T, DasGupta A: Interval estimation for a binomial proportion. Stat Sci. 2001, 16 (2): 101-117.Google Scholar
- Pocock SJ, Hughes MD: Practical problems in interim analyses, with particular regard to estimation. Control Clin Trials. 1989, 10 (4): 209-221. 10.1016/0197-2456(89)90059-7.View ArticleGoogle Scholar
- Freidlin B, Korn EL: Stopping clinical trials early for benefit: impact on estimation. Clin Trials. 2009, 6 (2): 119-125. 10.1177/1740774509102310.View ArticlePubMedGoogle Scholar
- Ohman Strickland PA, Casella G: Conditional Inference Following Group Sequential Testing. Biom J. 2003, 45 (5): 515-526. 10.1002/bimj.200390029.View ArticleGoogle Scholar
- Jovic G, Whitehead J: An exact method for analysis following a two-stage phase II cancer clinical trial. Stat Med. 2010, 29 (30): 3118-3125. 10.1002/sim.3837.View ArticlePubMedGoogle Scholar
- Shuster J: Optimal two-stage designs for single arm phase II cancer trials. J Biopharm Stat. 2002, 12: 39-51. 10.1081/BIP-120005739.View ArticlePubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/12/117/prepub
Pre-publication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.