A comparison of a new multinomial stopping rule with stopping rules of fleming and gehan in single arm phase II cancer clinical trials

Background Response rate (RR) alone may be insensitive to drug activity in phase II trials. Early progressive disease (EPD) could improve sensitivity as well as increase stage I stopping rates. This study compares the previously developed dual endpoint stopping rule (DESR), which incorporates both RR and EPD into a two-stage, phase II trial, with rules using only RR. Methods Stopping rules according to the DESR were compared with studies conducted under the Fleming (16 trials) or Gehan (23 trials) designs. The RR hypothesis for the DESR was consistent with the comparison studies (ralt = 0.2, rnul = 0.05). Two parameter sets were used for EPD rates of interest and disinterest respectively (epdalt, epdnul): (0.4, 0.6) and (0.3, 0.5). Results Compared with Fleming, the DESR was more likely to allow stage two of accrual and to reject the null hypothesis (Hnul) after stage two, with rejection being more common with EPD parameters (0.4, 0.6) than (0.3, 0.5). Compared with Gehan, both DESR parameter sets accepted Hnul in 15 trials after stage I compared with 8 trials by Gehan, with consistent conclusions in all 23 trials after stage II. Conclusions The DESR may reject Hnul when EPD rates alone are low, and thereby may improve phase II trial sensitivity to active, cytostatic drugs having limited response rates. Conversely, the DESR may invoke early stopping when response rates are low and EPD rates are high, thus shortening trials when drug activity is unlikely. EPD parameters should be chosen specific to each trial.


Background
The increase in drugs available for study along with the human and resource costs for the conduct of clinical trials requires investigators to revisit trial design [1,2]. Nowhere is this more evident than in oncology, which must contend with more first-in-class drugs, longer development times, more drugs entering large phase III studies, and generally greater costs than other therapeutic areas [3]. In addition, the development of targeted drugs, which may induce limited tumour response, demands phase II trial designs which both minimize resource use and are sensitive and specific to signals of drug activity [4].
When response rate (RR) is used as a single primary endpoint, two sets of stopping rules have served as the basis for many prior two-stage phase II trials. The stopping rules of Gehan stop trials at the first stage when no response was observed [5]. The sample size for the first stage is based on a specified RR of interest and a beta error rate. If at least one response was observed, the second stage accrues using a sample size based on the desired standard error for the RR estimation and the number of responses observed in stage one. For the stopping rules of Fleming, the investigator specifies RR's of interest and disinterest as well as desired alpha and beta error rates [6]. Calculations determine the sample size in each stage and the minimum responses in stage one required to proceed to the second stage. The trial may be stopped after stage I of accrual to accept or reject the null hypothesis. Variations of the two-stage rules, such as those of Simon [7], have been designed to minimize the expected number of enrolled patients when drug is inactive. Despite the introduction of new study methods, the designs of Gehan, Fleming, and Simon still in common use [8,9].
Although RR remains the most common primary endpoint in phase II trials [8], disease stabilization may be a more appropriate endpoint for some agents and has also been associated with improved survival [10,11]. Similarly, a high rate of early progressive disease (EPD), defined here as progression at the first tumour measurement after initiation of treatment, correlates with poor survival [12,13]. Conversely, a low EPD rate may suggest drug activity, and could serve as a warning against early discard of a new agent. A combination of response and EPD as a multinomial endpoint would identify an active drug which produces a high response rate or low EPD rate.
Zee et al first derived stopping rules for a two-stage clinical trial with a multinomial endpoint of RR and EPD [14]. However, it was found that these stopping rules only achieved the desired power for an alternate hypothesis requiring sufficiently high RR and sufficiently low EPD, whereas the study had sought power for an alternate hypothesis allowing for either a favourable RR or a favourable EPD [15]. Recently, a new rule set [16], the Dual Endpoint Stopping Rule (DESR), was derived to address this problem. The new stopping rules offer the desired power as well as high rates of early stopping for drugs meeting the null hypothesis, but have not been applied to real data from phase II clinical trials. The objective of this paper is to compare the DESR with the stopping rules of Fleming and Gehan in a series of phase II trials as summarized by Dent et al [14,17]

Methods
The Dual Endpoint Stopping Rule (DESR) for phase II trials with endpoints of response and early progressive disease (EPD) rates is described here briefly and in detail previously, where variations on the rules and sensitivity testing have been provided [16]. Specifically, DESR is based on testing of the following hypotheses: H nul : r ≤ r nul and epd ≥ epd nul versus H alt : r ≥ r alt or epd ≤ epd alt , where the response rates (r nul ,r alt ) and early progressive disease rates (epd nul ,epd alt ) of interest are prespecified. These hypotheses imply that a new drug would be considered of interest for further study if either the response rate, r, was sufficiently high or the early progressive disease rate, epd, was sufficiently low; it is not necessary that both outcomes occur.
After additional study parameters including the sample size for stage I (n 1 ) and stage II (n 2 ) of the trial and the desired alpha error rate and power are provided, stopping rules are generated by simulations performed using TreeAge Pro Healthcare software (Williamstown, Massachusetts) with the Borderline Value Method [16], which assumes that response and EPD rates of the desirable drugs are not better than r = r alt or epd = epd alt . With the DESR, the trial would be stopped at the first stage after n 1 subjects are entered if n 1r ≤ n 1r-nul and n 1p ≥ n 1p-nul , where n 1r and n 1p are respectively the number of patients who responded and had early progression and n 1r-nul and n 1pnul are thresholds of the DESR. Barring stopping, n 2 more patients are recruited into the second stage. The null hypothesis will be rejected at the end of the second stage if n 1r + n 2r ≥ n 1r-alt + n 2r-alt or n 1p + n 2p ≤ n 1p-alt + n 2p-alt , where n 2r and n 2p are respectively the number of patients who responded and had early progression at stage II, n 1ralt + n 2r-alt represents the threshold number of responders required after stage II to conclude H alt , and n 1p-alt + n 2palt is similarly the threshold for the stage I and stage II subjects with early progression to conclude H alt .
Data from two sets of phase II trials previously studied by Dent et al [17], were used to evaluate the DESR and compare it with stopping rules of Fleming and Gehan. The first set of these phase II trials was undertaken by the National Cancer Institute of Canada Clinical Trials Group, using the two stage stopping rule of Fleming. Trials were designed based on testing of hypotheses H nul : r ≤ 5% and H alt : r ≥ 20%, which allows for continuation to the second stage of accrual (with n 2 = 15) if one or more responses are observed among the first n 1 = 15 patients. At the second stage, H nul is rejected if four or more responses are found. The second set of phase II trials was performed by the EORTC using the stopping rule of Gehan. The response rate of interest and beta error rate for the first stage were prespecified respectively as 20% and 0.05, which led to the sample size n 1 = 14. Recruitment to the second stage occurs if at least one response is seen, with the size of n 2 varying with the number of responses seen in the first stage in conjunction with a desired standard error rate. For comparison purposes, (r nul , r alt ) was selected as (0.05, 0.2) to derive DESR thresholds. Based on the work of Zee et al and others [Zee, 1999;Sekine, 1999], two plausible parameter sets were selected for EPD, (epd nul ,epd alt ) = (0.6, 0.4) or (0.5, 0.3), to assess the impact of EPD on early stopping.
The alpha error rate and power used to derive DESR thresholds were respectively 0.05 and 0.8, although actual error rates vary from this according to the final thresholds selected by the program [Goffin, 2008]. The sample sizes for both stages were set the same as that in the Fleming rules or actual recruitment to the various EORTC studies when comparisons were made with the Fleming and Gehan stopping rules respectively. Table 1 shows the thresholds of the DESR for the null and alternate hypothesis corresponding with the studies utilizing the rules of Fleming. The table is read along the first row of results as follows: With desired study parameters of r nul = 0.05, r alt = 0.2, epd nul = 0.6, epd alt = 0.4, alpha error 0.05, power 0.8, and two stages of accrual of 15 patients each, the trial would be stopped at the first stage to reject the drug (accept the null hypothesis) if there were 1 or fewer responding patients and 8 or more patients with early progressive disease. Otherwise, the second stage would accrue, at the end of which the drug would be accepted (null hypothesis rejected) if 4 or more patients had responded to the drug or 14 or fewer progressed. This stopping rule would have an actual power of 0.796, alpha error of 0.025, and an expected number of 16.4 patients accrued if the drug under study was uninteresting (i.e. drug meeting H nul ). Two pairs for the null and alternate hypothesis for epd are shown.

Results
Thresholds for DESR trials sized to match the studies conducted under the rules of Gehan are shown in Tables 2 and 3. Table 2 gives values for epd alt = 0.4, epd nul = 0.6, while Table 3 gives values for epd alt = 0.3, epd nul = 0.5.

Comparison with the Stopping Rules of Fleming
The comparison of the DESR and Fleming stopping rules for first stage stopping and second stage rejection of the null hypothesis is shown in Table 4. The DESR was more permissive at the first stage. For the EPD parameters epd alt = 0.4, epd nul = 0.6, the DESR allowed 6 of the 10 studies stopped by the Fleming rule to continue to the second stage of accrual, all on the basis of an acceptably low EPD rate. Using the EPD parameters epd alt = 0.3, epd nul = 0.5, the DESR allowed only 2 of these same 10 studies to continue to the second stage. In all cases where the DESR allowed accrual to the second stage but the rules of Fleming did not, the final conclusions about activity of the drugs from DESR were unknown since there was no data from the second stage of the trials and we could find no published phase III trial and no U.S. Food and Drug Administration (FDA) indication for the drugs and diseases under study in these phase II trials.
While six studies (Trials 11 through 16) were permitted to accrue to the second stage according to the Fleming rule, one study (Trial 11) was stopped by the investigators and this same study would have been stopped at stage one by the DESR. In the remaining five studies, H nul was rejected at end of study by the Fleming rule in two (12 and 16). By comparison, for the EPD parameters epd alt = 0.4, epd nul = 0.6, the DESR rejected H nul in all five trials at the end of stage II as a result of acceptable rates of EPD. Conversely, for the EPD parameters epd alt = 0.3, epd nul = 0.5, the DESR stopped three of the five trials at stage I, and rejected H nul after stage II in two trials (studies 12 and 15), with one consistent with the conclusion from Fleming rule (Trial 12). The differences again lay in the threshold for epd in the hypotheses under testing, with the EPD parameter set (epd alt = 0.3, epd nul = 0.5) requiring a lower observed rate of EPD for rejection of H nul than the EPD parameter set (epd alt = 0.4, epd nul = 0.6). In all cases where the DESR rejected H nul but Fleming did not, we found no phase III trial to confirm or deny drug activity, and no disease-specific FDA indication was found. The same lack of confirmation was found for study 16 which rejected H nul by the Fleming rule but not by the DESR with EPD parameters epd alt = 0.3, epd nul = 0.5.

Comparison with the Stopping Rules of Gehan
Comparing the DESR rules based on two sets of EPD parameters in the cohort of phase II trials conducted under the Gehan design, the choice of null and alternate values for epd did not alter the likelihood of early stopping or rejection of the null hypothesis by the DESR, in part as a result of consistently high rates of EPD in trials 1-15 (see Table 5).
Of the 23 trials conducted using the Gehan stopping rules, eight would have been stopped at stage I for acceptance of H nul by both Gehan and the DESR. In actuality, investigators continued seven of those trials (studies 1-7) through the second stage, although in all cases the studies were ultimately negative.
In the other 15 trials (9 to 23), accrual to the second stage was permitted under the stopping rules of Gehan. Of these, seven trials would have been stopped at the first stage by the DESR as a result of high epd rates in conjunction with only a single responding subject in each trial, and in all seven of these trials the rules of Gehan found the same results after accrual of the second stage (i.e., H nul accepted). In the final eight trials, H nul was rejected after the second stage by both the Gehan stopping rule and the DESR.

Discussion
The DESR uses the signal provided by the rate of early progressive disease in an attempt to better discern drug effectivess compared with response alone [16]. It has been demonstrated that rules can be generated that  meet the specified alpha error rate and power; this study assesses the relevance of the DESR when applied to actual patient data from phase II clinical trials [17]. Compared with the stopping rules of Fleming, the DESR was more likely to allow accrual of the second stage. This was more common with the rules specifying epd nul = 0.6 than epd nul = 0.5, as a higher EPD rate was tolerated without early drug rejection in the former case. At the second stage, the DESR with design parameters epd alt = 0.4, epd nul = 0.6 rejected H nul more frequently than either the Fleming stopping rules or the DESR with parameters epd alt = 0.3, epd nul = 0.5.
A somewhat different result was seen when comparing the DESR and the stopping rules of Gehan. In this instance, 15 studies were stopped at the first stage by the DESR (using both epd design parameter pairs), while only 8 were stopped by Gehan at the first stage, with high rates of EPD triggering the more frequent early stopping by the DESR. The discrepant seven studies ultimately accepted H nul at the end of the second stage under Gehan stopping rules. For the remaining eight studies allowed to continue to the second stage by the Gehan stopping rules and the DESR, conclusions on H nul were consistent between the rules.
The DESR is designed to find drugs that have either a desirable rate of response or a desirably low level of early progression. However, because it is designed to find the 'good' drugs among a mixed (50/50) population of drugs having either good response or early progression rates, it appears to require a higher response rate at the end of stage one to allow recruitment of stage two than that required if response is considered in isolation. For this reason, compared with the Gehan stopping rule, the DESR was more likely to stop trials after the first stage of accrual despite a single response being observed in stage I. Conversely, as noted above, the DESR was less likely than the Fleming rules to stop a study at stage I despite a lack of any response, as EPD rates were low enough that the drugs under study might have met the specified level for an interesting agent.
For trials in which response is the clear priority, a set of rules devoted to response only may be more appropriate. However, in the present age of molecularly targeted anti-cancer agents, the likelihood of an investigational agent inducing tumour shrinkage or preventing tumour growth is often unclear prior to initiating phase II studies.
In the absence of suitable rules, examples are readily found of investigators setting a primary endpoint of response, a drug failing to meet that response, but the drug being declared interesting for further study based on other desirable characteristics [18,19] Other authors have investigated the use of multiple endpoints in phase II trials. Zee et al generated a set of  stopping rules similar to the DESR, but later found that the rules generated had poorer power than intended [14,15]. However, results for the comparisons between DESR and the stopping rules of Zee with Gehan's stopping rules were very similar in the same data set [17]. Although only the design parameter pair epd alt = 0.4, epd nul = 0.6 was considered in the paper which applied their rules [17], both the DESR and the stopping rules of Zee et al stop the first 15 trials at stage I and reject H nul after stage II in the remaining trials, with high EPD rates being the common reason for early stopping. Conversely, considering drugs studied under the Fleming stopping rules, the DESR was less likely to accept H nul at the end of stage I, and so to recruit to stage II. The conclusions at the end of stage II were more difficult to compare, as many of the actual trials did not recruit to the second stage. While the DESR remained more likely to reject H nul for the design parameter pair epd alt = 0.4, epd nul = 0.6, it may have been less likely to reject H nul with the pair epd alt = 0.3, epd nul = 0.5, suggesting the sensitivity of the results to changes in the design EPD parameters.
In an analogous paper, Panageas et al consider a rule set where response is divided into complete and partial response, and levels of interest and disinterest are again specified for the null and alternate hypothesis [20]. This rule set is potentially attractive for highly responsive cancers such as germ cell tumours, where complete responses are more frequent. However, it may be less applicable in the setting of most phase II trials involving previously treated malignancies and targeted drugs with uncertain tumour effects. In this setting, complete responses may be infrequent, and modest response rates or non-progression may suggest drug activity and lead to drug approval [8]. A slight modification to this design can be made which substitutes response and stable disease for complete response and partial response, similar to the DESR design. However, the study power calculated when using the Panageas design may actually be overestimated, thus underestimating the number of patients needed. This is because power is calculated assuming r alt and epd alt are simultaneously at the exact minimum response rate and maximum early progressive  disease rate of interest for further study for the novel agent. The DESR design using the borderline method varies r alt and epd alt while maintaining power. Both endpoints do not have to be simultaneously at the boundary of interest, potentially giving a more accurate estimate of statistical power. One limitation to the present study is that it applies arbitrary epd alt and epd nul pairs to existing data. Individualized epd rates may be more relevant to a given drug and give different results, although the pairs chosen were felt to be commonly plausible. Additionally, although the results presented are only for trials in which the H nul for response rate is 0.05, the DESR method can be implemented for trials with higher null response rates. This comparison was not performed due to a critical lack of published phase II trials which present response and EPD rates at both stage I and II. It is also unknown whether actual efficacy might have been seen when the DESR rejected H nul but the Fleming rule did not, as subsequent phase III studies were not conducted.

Conclusion
In conclusion, while the number of trials in our study is small, different patterns of early stopping and final rejection of H nul are evident with the addition of EPD as an endpoint. With limited follow-up in terms of phase III studies, the final benefit in terms of drug development is not certain. However, the DESR may shorten studies where response rates are low but high EPD rates suggest the ultimate efficacy will be poor. Conversely, the DESR will allow accrual to the second stage in the absence of response when there are few patients with EPD, and this may allow more sensitive detection of drug activity. Based on the comparisons in this paper, the epd alt = 0.3, epd nul = 0.5 pair appears to offer the better balance of these outcomes, but the design parameters for a particular trial should be individualized.