Statistical inference for extended or shortened phase II studies based on Simon’s two-stage designs

Background Simon’s two-stage designs are popular choices for conducting phase II clinical trials, especially in the oncology trials to reduce the number of patients placed on ineffective experimental therapies. Recently Koyama and Chen (2008) discussed how to conduct proper inference for such studies because they found that inference procedures used with Simon’s designs almost always ignore the actual sampling plan used. In particular, they proposed an inference method for studies when the actual second stage sample sizes differ from planned ones. Methods We consider an alternative inference method based on likelihood ratio. In particular, we order permissible sample paths under Simon’s two-stage designs using their corresponding conditional likelihood. In this way, we can calculate p-values using the common definition: the probability of obtaining a test statistic value at least as extreme as that observed under the null hypothesis. Results In addition to providing inference for a couple of scenarios where Koyama and Chen’s method can be difficult to apply, the resulting estimate based on our method appears to have certain advantage in terms of inference properties in many numerical simulations. It generally led to smaller biases and narrower confidence intervals while maintaining similar coverages. We also illustrated the two methods in a real data setting. Conclusions Inference procedures used with Simon’s designs almost always ignore the actual sampling plan. Reported P-values, point estimates and confidence intervals for the response rate are not usually adjusted for the design’s adaptiveness. Proper statistical inference procedures should be used.

response rate are not usually adjusted for the design's adaptiveness. They outlined proper statistical inference procedures for studies based on the Simon's two-stage designs.
Because the actual sample size of stage 2 may frequently differ from the planned one due to various reasons, KC also proposed a way to conduct a hypothesis testing when the stage 2 sample size is changed in a Simon's design. They focused on the case of non-informative sample size change at the second stage. In other words, the actual stage 1 sample size always equals to the planned stage 1 sample size but the actual stage 2 sample size can differ from the planned stage 2 sample size. In addition, the decision to use a different sample size must be independent of the observed outcome data. Inference then needs to be made based on the actual data. This is in contrast to adaptive designs that can alter the sample size based on interim results. We restrict our attention to the same setting as KC although we believe our method can be extended.
The scenarios of non-informative sample size change or protocol deviation can arise quite frequently in practice. Shortening of stage 2 can occur in cases of early termination of study due to lack of funding, slow accrual, non-informative drop-outs, accrual of ineligible subjects, etc. Such shortening of stage 2 sample size can be reasonably assumed to be independent of the outcomes of the study. Extension of stage 2 can occur in cases of sites coordination error, over compensation for unevaluable or dropout patients, or administrative reasons.
In applying KC's method, we found some difficulties in calculation for certain scenarios due to the discrete nature of the binomial distribution. In particular, in the case when the number of responders x 1 at the first stage exceeds the final boundary r t with an (unexpectedly) efficacious treatment. Because Simon's two-stage design does not stop for early efficacy [1], the study would continue to the second stage. In this case, KC's method breaks down. Another possible problem is for the case when we have no responders at the second stage, that is, x 2 = 0. We give our detailed explanation after we review their method in the next section. We therefore introduce a different method for inference based on conditional likelihood. Besides the ability to make proper inference for the settings when KC's method may be difficult to apply, our method is also seen to improve on statistical properties for many settings we have investigated.
Porcher and Desseaux [3] considered different approaches for point and confidence intervals estimation, as well as computation of p-values for the same setting as KC. In their methods, the rankings used for computing p-values were based on estimators instead of likelihood. They recommended the uniformly minimum variance unbiased estimator (UMVUE) as it exhibited good properties. In particular, when the second stage sample size is unaltered, they pointed out that the method based on UMVUE is equivalent to KC [3]. For this reason, our method should also improve on their methods.
In addition to [2,3], other related works exist. Green and Dahlberg [4] were among the first who considered settings that accommodate a modified sample size in both stages even though the proposed analysis method was ad hoc. Masaki et al. [5] considered designs for a range of possible stage I and total sample size deviations from planned study. Li et al. [6] formulated a Bayesian approach with a modified sample size. Their method can have desirable frequentist properties under certain types of priors. Recently, Zeng et al. [7] considered computation improvement and proposed a normal approximation that is accurate even under small sample sizes.

Review of Koyama and Chen (2008)
The KC method centers mainly on the calculation of p-values. Throughout, use P π (E) to represent the probability of the event E at a specific π. Denote x 1 and x 2 as the actual observed numbers of responders at stage 1 and 2 of a study based on Simon's two-stage design.
If x 1 ≤ r 1 , the trial is stopped early at the first stage due to futility. In this case, the p-value is given by P π 0 [ X 1 ≥ x 1 |n 1 ], which can be easily computed from the binomial distribution with size n 1 and success probability π 0 .
If x 1 > r 1 , the trial continues to the second stage. In this case, the p-value calculation is based on observed sample paths, given by (1) where P π 0 [ X 2 ≥ x 1 + x 2 − x|n 2 ] represent more 'extreme' sample paths than the observed one given that x > r 1 responses are observed at stage 1. The actual type I error and power are evaluated through under H 0 and H 1 , respectively. Let A(x, n 2 , π) ≡ P π [ X 2 > r t − x | n 2 ] be the conditional rejection rate of H 0 at the end of stage 2 given X 1 = x. Then, the rejection rule at the end of stage 2, x 1 + x 2 > r t , is equivalent to where A(x 1 , n 2 , π 0 ) serves as a conditional critical value.
When the actual sample size of stage 2, denoted by n * , deviates from n 2 , A(x 1 , n 2 , π) can still be used as a conditional criterion for decision making. That is to reject H 0 when However, with the presence of the second stage sample size deviation, the p-value cannot be directly extended from (1) because the observed total number of responses x 1 + x 2 is not a good ranking determinant of 'extremeness' any more. In particular, KC gave a concrete example in which two different sample paths (x 1 , x 2 ) and (x * 1 , x * 2 ) with the same total number of responses (x * 1 + x * 2 = x 1 + x 2 ) and the same deviated sample size n * 2 of stage 2 may lead to different conclusions about the hypothesis. Therefore, Koyama and Chen [2] proposed the following way of calculating p-value.
One difficulty with this way of calculation is when x 1 > r t . Although infrequent, this happens when the investigational treatment is unexpectedly efficacious. Because Simon's two-stage designs do not stop for early efficacy [1], the study continues to the second stage. In this case, we have A(x 1 , n 2 , π) ≡ 1 for any π. Therefore π * can not be determined from step (a) above and the algorithm breaks down.
Another possible problem is for the case when we have x 2 = 0. In this case, P π 0 [ X 2 ≥ x 2 |n * 2 ] ≡ 1 for any n * 2 . When x 1 ≤ r t , this corresponds to the solution π * = 1. Therefore the corresponding p-value is independent of n * 2 and equals to n 1 . This may not be sensible as it is independent of both observed number of response x 1 and of the actual second stage sample size n * 2 . We therefore introduce a different method for inference based on likelihood.

Likelihood based construction of confidence intervals
We extend the existing likelihood based inference for twostage and multiple stage trials [8][9][10][11][12] to our setting for construction of p-values and confidence intervals. In particular, we order permissible sample paths under Simon's two-stage designs using their corresponding conditional likelihood. In this way, we can calculate p-values using the common definition: the probability of obtaining a test stat istic value at least as extreme as that observed under H 0 .
Let M denote the stopping stage, and let S M denote the total number of responders accumulated up to the stopping stage. That is, S M = X 1 when M = 1 and S M = X 1 +X 2 when M = 2. Similarly, let N M be total sample size of the study. The probability mass function of the random vector (M; S M ) is given by where ∧ takes the minimum and ∨ takes the maximum of its arguments. Jung and Kim [8] showed that (M, S M ) is complete and sufficient for π. The MLE of π is thereforê π = S M /N M . However the MLE is biased [11,13]. Based on the fact that X 1 /n 1 is always unbiased estimator for the true probability π, Jung and Kim [8] derived the UMVUE of π to bẽ The existence of the UMVUEπ also facilitates the determination of confidence intervals. In particular, an exact (1 − α)% confidence interval (π L , π U ) for π is given by Jung and Kim [8] showed that such ordering of the sample space by the UMVUE is the same as that by Jennison and Turnbull [14]. Chang and O'Brien [12] showed that likelihood ratio based construction is more efficient and led to smaller average CI length.
When there is study extension or shortening, the second stage sample size n 2 becomes a random variable. The likelihood can depend on the probability that n 2 obtains a specific value n * 2 . However, in the case when such change of sample size is not related to π, the above likelihood can be viewed as the conditional likelihood given the observed value of n * 2 and therefore can be used to make inference. The UMVUE takes the same format as in (3) except with n * 2 in place of n 2 .
The likelihood ratio test of H 0 : π = π 0 vs. H 1 : π = π 0 is based on     The acceptance region defined as {π 0 : P π 0 ≥ α} can be used to form the limits of a (1 − α)% confidence interval of π. Note that it is possible that such a defined region may not be an interval. However, such case is rare and has minimal impact on the confidence interval performance [12].

Simulation study
We conduct simulation studies to evaluate likelihood ratio test based CI construction, conditional likelihood Fig. 1 Confidence interval width comparison is based on studies made to the second stage; Coverage is to be compared with 90 %; Bias is the absolute value of difference between the estimate and true probability of response based UMVUE, and compare their performances with approaches of Koyama and Chen [2]. In particular, we selected the designs from Tables one and two in Simon's paper [1] and simulated 5,000 data sets based on various values of π. If a simulated study continues to the 2nd stage under the specified design, the actual sample size at the second stage of the study n * 2 is generated via an equal-probability multi-nomial distribution that range from n 2 /3 to 1.5n 2 . We have also examined other possible ranges of n * 2 and found similar results. We only report 90 % CI widths and coverage as well as the actual power from the two methods in Tables 1, 2 and 3 and visualized the comparison of the corresponding CI widths, CI coverage, and bias in Figs. 1, 2, 3 and 4. Since the two methods yield Fig. 2 Confidence interval width comparison is based on studies made to the second stage; Coverage is to be compared with 90 %; Bias is the absolute value of difference between the estimate and true probability of response Fig. 3 Confidence interval width comparison is based on studies made to the second stage; Coverage is to be compared with 90 %; Bias is the absolute value of difference between the estimate and true probability of response same CIs in the first stage, we only present the CI width comparison for studies that are made to the 2nd stage in our simulation. From the tables, we see that the average CI width based on conditional likelihood are either similar to or smaller than those based on Koyama and Chen [2] in most cases. In some cases, the improvement can be quite significant (Figs. 1, 2, 3 and 4).
We also compare CI coverage and bias based on all simulation studies including those stopped after the first stage. We see that the CI coverage are similar between the two methods. The conditional likelihood UMVUE has uniformly smaller biases than the estimate based on Koyama and Chen [2], especially when the underlying true probability is large. Fig. 4 Confidence interval width comparison is based on studies made to the second stage; Coverage is to be compared with 90 %; Bias is the absolute value of difference between the estimate and true probability of response

Real example
Advanced hepatobiliary cancers have a poor prognosis, in part complicated by underlying liver dysfunction. Although surgical resection and liver transplantation can be curative for select patients, those with advanced disease have few treatment options with survival rates of 6-12 months. GI06-101 was a multi-institutional study conducted by the Hoosier Oncology Group aimed to assess the efficacy of erlotinib (Tarceva, OSI-774; OSI Pharmaceuticals, Melville, NY) in combination with docetaxel in refractory hepatobiliary cancers [15]. Due to similarly poor outcomes and few existent treatment options for refractory disease at the time of this study's design in 2006, both hepatocellular cancers and biliary tract cancers were included.
The primary end point of this trial was the rate of progression free survival (PFS) at 16 weeks. PFS was defined as time from the start of treatment until disease progression or death of any cause, whichever occurred first. A Simon optimal two-stage design tested the hypothesis that the 16-week PFS is π 0 ≤ 15 % (clinically inactive) versus the alternative of π 1 ≥ 30 % (warranting further study). The design used 0.10 as the level of significance and 80 % as power. This led to n 1 = 19, r 1 = 3, n t = 39, and r t = 8.
Among the 19 patients of the first stage, 8 were progression free at 16-week. The study went on to the second stage and was terminated due to lack of funding after recruiting 6 patients. Among these 6 patients, 4 were progression free at 16-week. Therefore we have n * 2 = 6, x 1 = 8, and x 2 = 4. The resulting estimate for 16-week PFS rate is 0.435 with 90 % confidence interval (0.271, 0.605) based on Koyama and Chen's method, compared with 0.48 with 90 % confidence interval (0.322, 0.646) based on the conditional likelihood method. The conditional likelihood based estimate is larger and has shorter CI width.

Conclusions
Koyama and Chen [2] considered statistical inference problem for phase II studies based on Simon's two-stage designs when there are study deviations at the second stage. We propose an alternative method for such problem based on likelihood principle. In addition to provide inference for a couple of scenarios where Koyama and Chen's method breaks down, the resulting estimate appears to have certain advantage in terms of bias magnitude and confidence interval width in many cases.
Sample size change can also happen in the first stage [4,16]. Our method of inference should be applicable if such change is not related to the actual outcome. There is also recent research on adaptive Simon's two-stage designs [17] where the second stage sample size is decided at the end of stage 1 based on observed responses. The decision can be to extend the study because there are fewer positive responses than expected or to shorten the study simply because there are more positive responses than expected. Our method should also be applicable. However the whole likelihood needs to be used that incorporates the mechanism of the second stage sample size determination.