 Research Article
 Open Access
 Published:
Statistical inference for extended or shortened phase II studies based on Simon’s twostage designs
BMC Medical Research Methodologyvolume 15, Article number: 48 (2015)
Abstract
Background
Simon’s twostage designs are popular choices for conducting phase II clinical trials, especially in the oncology trials to reduce the number of patients placed on ineffective experimental therapies. Recently Koyama and Chen (2008) discussed how to conduct proper inference for such studies because they found that inference procedures used with Simon’s designs almost always ignore the actual sampling plan used. In particular, they proposed an inference method for studies when the actual second stage sample sizes differ from planned ones.
Methods
We consider an alternative inference method based on likelihood ratio. In particular, we order permissible sample paths under Simon’s twostage designs using their corresponding conditional likelihood. In this way, we can calculate pvalues using the common definition: the probability of obtaining a test statistic value at least as extreme as that observed under the null hypothesis.
Results
In addition to providing inference for a couple of scenarios where Koyama and Chen’s method can be difficult to apply, the resulting estimate based on our method appears to have certain advantage in terms of inference properties in many numerical simulations. It generally led to smaller biases and narrower confidence intervals while maintaining similar coverages. We also illustrated the two methods in a real data setting.
Conclusions
Inference procedures used with Simon’s designs almost always ignore the actual sampling plan. Reported Pvalues, point estimates and confidence intervals for the response rate are not usually adjusted for the design’s adaptiveness. Proper statistical inference procedures should be used.
Background
Simon’s twostage designs [1] are commonly used in phase II clinical trials, especially in cancer clinical trials. In a study with a Simon’s design, the null hypothesis is concerned with a response rate, H _{0}:π≤π _{0}. The power is calculated at some π _{1}>π _{0}. A Simon’s design is usually indexed by four numbers that represent the stage 1 sample size (n _{1}), stage 1 critical value (r _{1}), final sample size (n _{ t }) and final critical value (r _{ t }). In stage 1, a sample of size n _{1} is taken. If the number of successes X _{1} in stage 1 satisfies X _{1}≤r _{1}, the trial is stopped for futility; otherwise, an additional sample of size n _{2}=n _{ t }−n _{1} is taken. Let X _{2} be the number of successes in stage 2, and let X _{ t }=X _{1}+X _{2}. If X _{ t }≤r _{ t }, futility is concluded; otherwise efficacy is concluded by rejecting H _{0}. Softwares are available for calculating Simon’s twostage designs, for example, from a website at the National Cancer Institute: http://linus.nci.nih.gov/brb/samplesize/otsd.html, from a website at the Department of Biostatistics of the Vanderbilt University: http://biostat.mc.vanderbilt.edu/wiki/Main/TwoStageInference, and from the NCSS/PASS package: http://www.ncss.com/.
Koyama and Chen [2] (hereafter KC) pointed out that the inference procedures used with Simon’s designs almost always ignore the actual sampling plan. Reported Pvalues, point estimates and confidence intervals for the response rate are not usually adjusted for the design’s adaptiveness. They outlined proper statistical inference procedures for studies based on the Simon’s twostage designs.
Because the actual sample size of stage 2 may frequently differ from the planned one due to various reasons, KC also proposed a way to conduct a hypothesis testing when the stage 2 sample size is changed in a Simon’s design. They focused on the case of noninformative sample size change at the second stage. In other words, the actual stage 1 sample size always equals to the planned stage 1 sample size but the actual stage 2 sample size can differ from the planned stage 2 sample size. In addition, the decision to use a different sample size must be independent of the observed outcome data. Inference then needs to be made based on the actual data. This is in contrast to adaptive designs that can alter the sample size based on interim results. We restrict our attention to the same setting as KC although we believe our method can be extended.
The scenarios of noninformative sample size change or protocol deviation can arise quite frequently in practice. Shortening of stage 2 can occur in cases of early termination of study due to lack of funding, slow accrual, noninformative dropouts, accrual of ineligible subjects, etc. Such shortening of stage 2 sample size can be reasonably assumed to be independent of the outcomes of the study. Extension of stage 2 can occur in cases of sites coordination error, over compensation for unevaluable or dropout patients, or administrative reasons.
In applying KC’s method, we found some difficulties in calculation for certain scenarios due to the discrete nature of the binomial distribution. In particular, in the case when the number of responders x _{1} at the first stage exceeds the final boundary r _{ t } with an (unexpectedly) efficacious treatment. Because Simon’s twostage design does not stop for early efficacy [1], the study would continue to the second stage. In this case, KC’s method breaks down. Another possible problem is for the case when we have no responders at the second stage, that is, x _{2}=0. We give our detailed explanation after we review their method in the next section. We therefore introduce a different method for inference based on conditional likelihood. Besides the ability to make proper inference for the settings when KC’s method may be difficult to apply, our method is also seen to improve on statistical properties for many settings we have investigated.
Porcher and Desseaux [3] considered different approaches for point and confidence intervals estimation, as well as computation of pvalues for the same setting as KC. In their methods, the rankings used for computing pvalues were based on estimators instead of likelihood. They recommended the uniformly minimum variance unbiased estimator (UMVUE) as it exhibited good properties. In particular, when the second stage sample size is unaltered, they pointed out that the method based on UMVUE is equivalent to KC [3]. For this reason, our method should also improve on their methods.
In addition to [2, 3], other related works exist. Green and Dahlberg [4] were among the first who considered settings that accommodate a modified sample size in both stages even though the proposed analysis method was ad hoc. Masaki et al. [5] considered designs for a range of possible stage I and total sample size deviations from planned study. Li et al. [6] formulated a Bayesian approach with a modified sample size. Their method can have desirable frequentist properties under certain types of priors. Recently, Zeng et al. [7] considered computation improvement and proposed a normal approximation that is accurate even under small sample sizes.
Methods
Review of Koyama and Chen (2008)
The KC method centers mainly on the calculation of pvalues. Throughout, use P _{ π }(E) to represent the probability of the event E at a specific π. Denote x _{1} and x _{2} as the actual observed numbers of responders at stage 1 and 2 of a study based on Simon’s twostage design.
If x _{1}≤r _{1}, the trial is stopped early at the first stage due to futility. In this case, the pvalue is given by $P_{\pi _{0}}[X_{1} \ge x_{1}n_{1}]$ , which can be easily computed from the binomial distribution with size n _{1} and success probability π _{0}.
If x _{1}>r _{1}, the trial continues to the second stage. In this case, the pvalue calculation is based on observed sample paths, given by
where $P_{\pi _{0}}[X_{2}\geq x_{1}+x_{2}xn_{2}]$ represent more ‘extreme’ sample paths than the observed one given that x>r _{1} responses are observed at stage 1. The actual type I error and power are evaluated through
under H _{0} and H _{1}, respectively. Let A(x,n _{2},π)≡P _{ π }[X _{2}>r _{ t }−x  n _{2}] be the conditional rejection rate of H _{0} at the end of stage 2 given X _{1}=x. Then, the rejection rule at the end of stage 2, x _{1}+x _{2}>r _{ t }, is equivalent to
where A(x _{1},n _{2},π _{0}) serves as a conditional critical value.
When the actual sample size of stage 2, denoted by n ^{∗}, deviates from n _{2}, A(x _{1},n _{2},π) can still be used as a conditional criterion for decision making. That is to reject H _{0} when
However, with the presence of the second stage sample size deviation, the pvalue cannot be directly extended from (1) because the observed total number of responses x _{1}+x _{2} is not a good ranking determinant of ‘extremeness’ any more. In particular, KC gave a concrete example in which two different sample paths (x _{1},x _{2}) and $(x_{1}^{*}, x_{2}^{*})$ with the same total number of responses ( $x_{1}^{*}+x_{2}^{*}=x_{1}+x_{2}$ ) and the same deviated sample size $n_{2}^{*}$ of stage 2 may lead to different conclusions about the hypothesis. Therefore, Koyama and Chen [2] proposed the following way of calculating pvalue.

Find π ^{∗} such that $A(x_{1},n_{2},\pi ^{*})=P_{\pi _{0}}[X_{2}\geq x_{2}n_{2}^{*}]$ .

Compute the pvalue by
$$\begin{array}{@{}rcl@{}} \sum_{x = r_{1}+1}^{n_{1}} P_{\pi_{0}}[X_{1}=xn_{1}]A(x,n_{2},\pi^{*}). \end{array} $$
One difficulty with this way of calculation is when x _{1}>r _{ t }. Although infrequent, this happens when the investigational treatment is unexpectedly efficacious. Because Simon’s twostage designs do not stop for early efficacy [1], the study continues to the second stage. In this case, we have A(x _{1},n _{2},π)≡1 for any π. Therefore π ^{∗} can not be determined from step (a) above and the algorithm breaks down.
Another possible problem is for the case when we have x _{2}=0. In this case, $P_{\pi _{0}}[X_{2}\geq x_{2}n_{2}^{*}] \equiv 1$ for any $n_{2}^{*}$ . When x _{1}≤r _{ t }, this corresponds to the solution π ^{∗}=1. Therefore the corresponding pvalue is independent of $n_{2}^{*}$ and equals to $\sum _{x = r_{1}+1}^{n_{1}} P_{\pi _{0}}[X_{1}=x] = P_{\pi _{0}}[X_{1}> r_{1}]\). This may not be sensible as it is independent of both observed number of response x _{1} and of the actual second stage sample size $n_{2}^{*}$ . We therefore introduce a different method for inference based on likelihood.
Likelihood based construction of confidence intervals
We extend the existing likelihood based inference for twostage and multiple stage trials [8–12] to our setting for construction of pvalues and confidence intervals. In particular, we order permissible sample paths under Simon’s twostage designs using their corresponding conditional likelihood. In this way, we can calculate pvalues using the common definition: the probability of obtaining a test statistic value at least as extreme as that observed under H _{0}.
Let M denote the stopping stage, and let S _{ M } denote the total number of responders accumulated up to the stopping stage. That is, S _{ M }=X _{1} when M=1 and S _{ M }=X _{1}+X _{2} when M=2. Similarly, let N _{ M } be total sample size of the study. The probability mass function of the random vector (M;S _{ M }) is given by
where ∧ takes the minimum and ∨ takes the maximum of its arguments. Jung and Kim [8] showed that (M,S _{ M }) is complete and sufficient for π. The MLE of π is therefore $\hat {\pi }=S_{M}/N_{M}$ . However the MLE is biased [11, 13]. Based on the fact that X _{1}/n _{1} is always unbiased estimator for the true probability π, Jung and Kim [8] derived the UMVUE of π to be
The existence of the UMVUE $\tilde {\pi }$ also facilitates the determination of confidence intervals. In particular, an exact (1−α)% confidence interval (π _{ L },π _{ U }) for π is given by
and
Jung and Kim [8] showed that such ordering of the sample space by the UMVUE is the same as that by Jennison and Turnbull [14]. Chang and O’Brien [12] showed that likelihood ratio based construction is more efficient and led to smaller average CI length.
When there is study extension or shortening, the second stage sample size n _{2} becomes a random variable. The likelihood can depend on the probability that n _{2} obtains a specific value $n_{2}^{*}$ . However, in the case when such change of sample size is not related to π, the above likelihood can be viewed as the conditional likelihood given the observed value of $n_{2}^{*}$ and therefore can be used to make inference. The UMVUE takes the same format as in (3) except with $n_{2}^{*}$ in place of n _{2}.
The likelihood ratio test of H _{0}:π=π _{0} vs. H _{1}:π≠π _{0} is based on
where $\hat {\pi }=S_{M}/N_{M}$ . Under H _{0}, any path (m,s _{ m }) that has larger likelihood ratio is considered to be more ‘extreme’ against H _{0}. Therefore, the probability of observing (M,S _{ M }) or more extreme paths is
After correcting for the discreteness of the binomial distribution by a fraction of the probability of (M,S _{ M }), the pvalue is proposed to be
The acceptance region defined as $\{\pi _{0}: P_{\pi _{0}} \ge \alpha \}\phantom {\dot {i}\!}$ can be used to form the limits of a (1−α)% confidence interval of π. Note that it is possible that such a defined region may not be an interval. However, such case is rare and has minimal impact on the confidence interval performance [12].
Results and discussion
Simulation study
We conduct simulation studies to evaluate likelihood ratio test based CI construction, conditional likelihood based UMVUE, and compare their performances with approaches of Koyama and Chen [2]. In particular, we selected the designs from Tables one and two in Simon’s paper [1] and simulated 5,000 data sets based on various values of π. If a simulated study continues to the 2nd stage under the specified design, the actual sample size at the second stage of the study $n_{2}^{*}$ is generated via an equalprobability multinomial distribution that range from n _{2}/3 to 1.5n _{2}. We have also examined other possible ranges of $n_{2}^{*}$ and found similar results. We only report 90 % CI widths and coverage as well as the actual power from the two methods in Tables 1, 2 and 3 and visualized the comparison of the corresponding CI widths, CI coverage, and bias in Figs. 1, 2, 3 and 4. Since the two methods yield same CIs in the first stage, we only present the CI width comparison for studies that are made to the 2nd stage in our simulation. From the tables, we see that the average CI width based on conditional likelihood are either similar to or smaller than those based on Koyama and Chen [2] in most cases. In some cases, the improvement can be quite significant (Figs. 1, 2, 3 and 4).
We also compare CI coverage and bias based on all simulation studies including those stopped after the first stage. We see that the CI coverage are similar between the two methods. The conditional likelihood UMVUE has uniformly smaller biases than the estimate based on Koyama and Chen [2], especially when the underlying true probability is large.
Real example
Advanced hepatobiliary cancers have a poor prognosis, in part complicated by underlying liver dysfunction. Although surgical resection and liver transplantation can be curative for select patients, those with advanced disease have few treatment options with survival rates of 612 months. GI06101 was a multiinstitutional study conducted by the Hoosier Oncology Group aimed to assess the efficacy of erlotinib (Tarceva, OSI774; OSI Pharmaceuticals, Melville, NY) in combination with docetaxel in refractory hepatobiliary cancers [15]. Due to similarly poor outcomes and few existent treatment options for refractory disease at the time of this study’s design in 2006, both hepatocellular cancers and biliary tract cancers were included.
The primary end point of this trial was the rate of progression free survival (PFS) at 16 weeks. PFS was defined as time from the start of treatment until disease progression or death of any cause, whichever occurred first. A Simon optimal twostage design tested the hypothesis that the 16week PFS is π _{0}≤15 % (clinically inactive) versus the alternative of π _{1}≥30 % (warranting further study). The design used 0.10 as the level of significance and 80 % as power. This led to n _{1}=19, r _{1}=3, n _{ t }=39, and r _{ t }=8.
Among the 19 patients of the first stage, 8 were progression free at 16week. The study went on to the second stage and was terminated due to lack of funding after recruiting 6 patients. Among these 6 patients, 4 were progression free at 16week. Therefore we have $n_{2}^{*}=6$ , x _{1}=8, and x _{2}=4. The resulting estimate for 16week PFS rate is 0.435 with 90 % confidence interval (0.271,0.605) based on Koyama and Chen’s method, compared with 0.48 with 90 % confidence interval (0.322,0.646) based on the conditional likelihood method. The conditional likelihood based estimate is larger and has shorter CI width.
Conclusions
Koyama and Chen [2] considered statistical inference problem for phase II studies based on Simon’s twostage designs when there are study deviations at the second stage. We propose an alternative method for such problem based on likelihood principle. In addition to provide inference for a couple of scenarios where Koyama and Chen’s method breaks down, the resulting estimate appears to have certain advantage in terms of bias magnitude and confidence interval width in many cases.
Sample size change can also happen in the first stage [4, 16]. Our method of inference should be applicable if such change is not related to the actual outcome. There is also recent research on adaptive Simon’s twostage designs [17] where the second stage sample size is decided at the end of stage 1 based on observed responses. The decision can be to extend the study because there are fewer positive responses than expected or to shorten the study simply because there are more positive responses than expected. Our method should also be applicable. However the whole likelihood needs to be used that incorporates the mechanism of the second stage sample size determination.
References
 1
Simon R. Optimal twostage designs for phase ii clinical trials. Controlled Clinical Trials. 1989; 10:1–10.
 2
Koyama T, Chen H. Proper inference from simon’s twostage designs. Stat Med. 2008; 27:3145–154.
 3
Porcher R, Desseaux K. What inference for twostage phase ii trials?. BMC Med Res Methodol. 2012; 12:117.
 4
Green S, Dahlberg S. Planned versus attained design in phase ii clinical trials. Stat Med. 1992; 11:853–62.
 5
Masaki N, Koyama T, Yoshimura I, Hamada C. Optimal twostage designs allowing flexibility in number of subjects for phase ii clinical trials. J Biopharm Stat. 2009; 19:721–31.
 6
Li Y, Mick R, Heitjan D. A bayesian approach for unplanned sample sizes in phase ii cancer clinical trials. Clin Trials. 2012; 9:293–302.
 7
Zeng D, Gao F, Hu K, Jia C, Ibrahim J. Hypothesis testing for twostage designs with over or under enrollment. Stat Med. 2015. In press.
 8
Jung S, Kim K. On the estimation of the binomial probability in multistage clinical trials. Stat Med. 2004; 23:881–96.
 9
Emerson S, Fleming T. Parameter estimation following group sequential hypothesis testing. Biometrika. 1990; 77:875–92.
 10
Rosner G, Tsiatis A. Exact confidence intervals following a group sequential trial: A comparison of methods. Biometrika. 1988; 75:723–9.
 11
Whitehead J. On the bias of maximum likelihood estimation following a sequential test. Biometrika. 1986; 73:573–81.
 12
Chang M, O’Brien P. Confidence intervals following group sequential tests. Controlled Clin Trials. 1986; 7:18–26.
 13
Chang M, Wieand H, Chang V. The bias of the sample proportion following a group sequential phase ii clinical trials. Stat Med. 1989; 8:563–70.
 14
Jennison C, Turnbull B. Confidence intervals for a binomial parameter following a multistage test with application to milstd 105d and medical trials. Technometrics. 1983; 25:49–58.
 15
Hoosier Cancer Research Network. Erlotinib in Combination With Docetaxel in Advanced Hepatocellular and Biliary Tract Carcinomas. https://clinicaltrials.gov/ct2/show/NCT00532441.
 16
Chen T, Ng T. Optimal flexible designs in phase ii clinical trials. Stat Med. 1998; 17:2301–312.
 17
Banerjee A, Tsiatis A. Adaptive twostage designs in phase ii clinical trials. Stat Med. 2006; 25:3382–395.
Acknowledgements
This work was supported by Chinese National Science Foundation Projects 81470737, 81400496, and 81300911.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
JZ analysed the data and conducted simulations. MY motivated the idea of the manuscript and drafted the manuscript. XPF analysed the data, drafted the manuscript and interpreted the results. All authors read and approved the final manuscript.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Clinical trials
 Simon’s twostage designs
 Likelihood
 Phase II studies