A multi-arm multi-stage clinical trial design for binary outcomes with application to tuberculosis

Background Randomised controlled trials are becoming increasingly costly and time-consuming. In 2011, Royston and colleagues proposed a particular class of multi-arm multi-stage (MAMS) designs intended to speed up the evaluation of new treatments in phase II and III clinical trials. Their design, which controls the type I error rate and power for each pairwise comparison, discontinues randomisation to poorly performing arms at interim analyses if they fail to show a pre-specified level of benefit over the control arm. Arms in which randomisation is continued to the final stage of the trial are compared against the control on a definitive time-to-event outcome measure. To increase efficiency, interim comparisons can be made on an intermediate time-to-event outcome which is on the causal pathway to the definitive outcome. Methods We adapt Royston’s MAMS design to binary outcomes observed at the end of a fixed follow-up period and analysed using an absolute difference in proportions. We apply the design to tuberculosis (TB), an area where many new drugs are in development, and demonstrate how it can greatly accelerate the evaluation of new TB regimens. We use simulations to support the extensions to the methodology and to investigate the amount of bias in the estimated treatment effects of arms in which randomisation is ceased at the first interim analysis and arms which continue to the final stage of the trial. Results The proposed seamless phase II/III TB trial designs are shown to greatly reduce sample size requirements and trial duration compared to conducting separate phase II and III trials. The bias in the estimated treatment effects for the definitive outcome is shown to be small, especially when treatment selection is based on an intermediate outcome or when a reanalysis is performed at the planned end of the trial after all recruited patients have completed follow-up. Conclusions The proposed designs are practical and could be used in a variety of disease areas. They hold considerable promise for speeding up the evaluation of new treatments particularly in TB where many new regimens will soon be available for testing in phase II and phase III trials.


Background
In recent years the pace of drug development in some disease areas has rapidly increased. Despite this, there has been a slowdown in the rate of new therapies reaching patients [1]. This is largely due to the increasing cost and inefficiency of the drug development process and that most new treatments have no clear benefit over standard care. As a result, the US Food and Drug Administration http://www.biomedcentral.com/1471-2288/ 13/139 in the near future is vast. Evaluating each new treatment against a control in separate two-arm trials will not only require a huge amount of resources but may deny patients access to the most effective, simplest and shortest new regimens as early as possible. Innovative trial designs which are able to efficiently assess multiple new treatments simultaneously are therefore urgently needed.
In discussing such an issue, Phillips et al. [5] have suggested the use of the multi-arm multi-stage (MAMS) design described by Royston et al. [6]. This particular type of MAMS design, which controls the type I error rate and power for each pairwise comparison, streamlines treatment evaluation in two ways. First, comparing multiple new regimens against a single, common control arm removes the need for separate control arms in multiple two-arm trials and reduces the overall required sample size. For example, comparing four experimental arms in parallel to a single control (five-armed trial) reduces the required sample size by 37% compared to four separate two-arm trials if no adjustments for multiple testing are made. In general, comparing K experimental arms to a single control reduces the overall sample size by a factor of (K − 1)/2K compared to K separate two-arm trials [7].
Secondly, the analysis of a MAMS trial is conducted in stages. At the end of each stage recruitment to an experimental arm is stopped if it fails to show sufficient evidence of an advantage over the control arm (lack-of-benefit). If an experimental arm passes the final stage of the study then it is deemed to be superior (or non-inferior, depending on the objective) to the control. The efficiency of this procedure can be greatly increased by using an outcome in the intermediate stages which is observed earlier and on the causal pathway to the final, definitive outcome of the trial, although it does not necessarily have to be a surrogate [8,9]. For example, the MAMS design may be used for a seamless phase II/III trial where the intermediate outcome is that used in a phase II trial while a phase III outcome is of primary interest in the final stage. Using an intermediate outcome in this way allows interim analyses to be conducted sooner and so recruitment to poorly performing arms can be stopped much earlier than if the primary outcome of the trial was used throughout. If a suitable intermediate outcome is unavailable then the MAMS design may still be used, for example, as a standalone phase II or III trial. The multi-stage aspect of the design removes the need to recruit a fixed sample size to all experimental arms in the trial and can further reduce the sample size compared to multi-arm fixed sample designs.
The sample size calculation for the MAMS design described by Royston et al. [6] is only applicable to timeto-event outcomes where a hazard ratio is typically the summary statistic used to compare an experimental treatment against a control. It is therefore applicable to trials in oncology, for example, where time to an event such as death is often used as a primary endpoint. The STAM-PEDE trial in prostate cancer [8] for instance, uses this particular type of MAMS design. However, if it is to be more widely used in other disease areas then the methodology needs extending to other types of outcome.
In TB, a commonly used outcome measure for phase II trials is the absolute difference in the proportion of patients who have a negative culture status eight weeks after commencing therapy [10][11][12]. In phase III, the absolute difference in the proportion of patients who either fail to respond to their allocated treatment or relapse after completing treatment is the outcome of choice and is usually assessed 1-2 years after randomisation [13]. In this paper we use these examples as motivation for extending the design to binary intermediate and definitive outcomes observed at the end of fixed follow-up periods and analysed using an absolute difference in proportions. The benefits of this design and issues surrounding it are explored and simulation studies using examples in a TB context are used to verify the methodology and to investigate the bias in treatment effect estimates.

Overview of proposed design
Let I denote the intermediate and D the definitive outcome of a MAMS trial. The same null and alternative hypotheses are used for all experimental arms to allow interim analyses to be conducted simultaneously. The sample size requirement is therefore the same for each pairwise comparison in each stage and so the sample size calculation can be developed by first considering a single experimental arm, E, against a control, C.
For a MAMS trial with s stages, let π E i and π C i denote the true event rates in the ith stage of the trial in an experimental arm and the control arm respectively (i = 1, . . . , s). If the same outcome is used throughout the trial (I = D) then π E i and π C i are constant for all i. If the intermediate and definitive outcomes differ (I = D) the values π E s and π C s correspond to the true treatment effects for the definitive outcome and π E i and π C i are constant for all i < s and correspond to the intermediate outcome.
The null and alternative hypotheses for the true absolute risk difference at the ith interim analysis, θ i = π E i − π C i , are, without loss of generality, The value θ 0 i is constant for all i if the same outcome measure is used throughout the trial (I = D). Other-http://www.biomedcentral.com/1471-2288/13/139 wise θ 0 s corresponds to the definitive outcome and θ 0 i is constant for all i < s for the intermediate outcome. In superiority analyses, θ 0 i is usually taken to be 0 to represent no difference under the null hypothesis. By contrast, non-inferiority analyses use a value of θ 0 i to represent that E is slightly inferior to C under the null hypothesis.
Having specified the null and alternative hypotheses above, the one-sided significance level, α i , and power, ω i , for each pairwise comparison is chosen for each stage of the trial. It is recommended to use a high power in each stage, for example, 90% or 95%, in order to achieve high overall power for the trial [6]. A large significance level should be used in the first stage to allow the first interim analysis to occur early on in the trial. Over subsequent stages significance levels are decreased to avoid stages becoming redundant. For trials with 6 or fewer stages Royston et al. [6] suggest a 'rule of thumb' of α i = 0.5 i for stages i = 1, . . . , s − 1 and α s = 0.025 in the final stage to mimic a conventional two-sided test at the 5% level. However, further research by Barthel et al. [14] and Choodari-Oskooei et al. [15] have suggested using a significance level between 0.2 and 0.3 in the first stage to reduce bias and error rates.
At the ith interim analysis recruitment continues to experimental arms whose treatment effect estimate on the intermediate outcome is significant at the 100α i % level, otherwise consideration is given for ceasing further randomisations to it. If the treatment effect estimate on the definitive outcome is significant at the 100α s % level in the final analysis then the experimental treatment is declared superior to the control arm (or non-inferior, depending on the objective).

Sample size calculation
Since each stage has its own significance level and power we can effectively consider each stage as an independent trial. Common formulae can therefore be used to obtain the required sample size for each interim analysis. For example, the required sample size for the control arm in the ith interim analysis, n C i , can be calculated using [16,17] where θ 1 i is the minimum effect that one would like to find with high probability for the outcome in the ith stage (usually the minimally clinically important difference), π 1 i = π C i + θ 1 i is the target event rate in the experimental arm under H 1 , z k is the kth percentile of the standard normal distribution and the E : C allocation ratio is A : 1 so that A patients are randomised to each experimental arm for every patient allocated to control.
For a MAMS trial with K i experimental arms present in stage i the total sample size required for the ith interim analysis is then

Consequences of a fixed follow-up period
Often in clinical trials, patients are followed-up for a set period of time after randomisation before outcomes are observed. For example, in phase II TB trials the endpoint of interest is often culture status 8 weeks after randomisation. An immediate consequence of delayed observations is that patients may withdraw or become lost-to-follow-up before their outcome is observed. If it is likely that outcome data will not be available for some proportion of patients, λ i , on the outcome in the ith stage of the study, then the required sample size calculated using (2) should be multiplied by 1/(1 − λ i ) to maintain the desired level of power for a complete-case analysis. It should be noted that such an analysis assumes that missing data occur completely at random which might not be plausible, in which case appropriate imputation techniques should be applied [18]. For simplicity, the loss-to-follow-up rate λ i is assumed to be constant throughout the trial for each outcome. One might normally expect a higher loss rate for D than I as it requires a longer follow-up period, however, it may be easier to obtain the former, particularly if it can be ascertained from medical records (for example, death), in which case a lower attrition rate on D may be a plausible assumption.
Another consequence of delayed observations is that interim analyses cannot take place as soon as the required sample size has been recruited and randomised. Since recruitment is continuous, the delay in obtaining data on an outcome means that there will be patients at each interim analysis who have been recruited to the trial but who have yet to have their outcome observed. For example, if the follow-up period is six months and the recruitment rate is 100 patients per year, then an extra 50 patients will be recruited to the trial but will not complete follow-up by the time of the database freeze for the interim analysis.
These extra patients who are randomised to arms which are subsequently dropped at the interim analysis will also not contribute towards any future interim analyses. However, for reasons concerning bias (see 'Results' section) these patients should still be followed up for their intermediate and definitive outcomes and included in a final analysis of their allocated arm against all control arm patients randomised concurrently at the planned end of the trial.
The delay in starting the next stage of the trial caused by data cleaning, analysis, various committee meetings and http://www.biomedcentral.com/1471-2288/13/139 changing the randomisation codes (if necessary) increases the number of patients allocated to an arm which may imminently be dropped from the trial. A possible solution to avoid randomising patients during this interval and the follow-up period is to suspend recruitment once the required sample size has accrued and then recommence it at the start of the next stage. However, this is not recommended since it is likely to prolong the duration of the trial by slowing the overall recruitment rate [5].

Calculating the stage durations
The total expected delay, γ i , between recruiting the last of the n i patients required for the ith interim analysis and the beginning of the next stage of the trial incorporates the follow-up period for the outcome plus the additional delays caused by the analysis. Denoting by N i the total number of patients recruited to the arms remaining in the study by the end of stage i, the number of patients that need to be recruited during stage i for the upcoming interim analysis,ñ i , is where N 0 = 0 and K i is the number of active experimental arms in the ith stage of the study. It follows that the duration of stage i is where r i is the overall recruitment rate in the ith stage (assumed to be constant within each stage). The cumulative number of patients allocated to all treatment arms still recruiting at the end of each intermediate stage is then In the final stage recruitment to the trial may be terminated as soon as N s = n s /(1 − λ i ) patients have been allocated to the remaining treatment arms. It would not be necessary to continue recruitment beyond this point, since there are no more analyses planned beyond the final analysis.
The stage end-times, t i , are obtained by summing the durations of all preceding stages; t i = i k=1 d k . These values are particularly useful as they roughly predict when interim analyses will occur and so help to organise data monitoring and trial steering committee meetings in advance.

Probability of passing each stage
Using similar formulae to Royston et al. [6] the probabilities of an experimental arm passing the first i stages of a MAMS trial are i is the correlation between the treatment effects in stages j and k under hypothesis H h . The calculation of these correlations is outlined in the appendix.
Clearly, A 1 = α 1 and 1 = ω 1 . The most important values are A s and s which are the overall type I error rate, α, and power, ω, respectively for a single experimental arm compared to the control. Note that here we have only calculated the pairwise type I error rate and powerthe issue of familywise error rates for trials with multiple experimental arms is raised in the discussion.
Other values of interest, particularly in a seamless phase II/III design, are A s−1 and s−1 which denote the probability of continuing recruitment to an arm in the final (phase III) stage of the trial under H 0 and H 1 respectively. Phase III trials are often resource intensive and lengthy and the same may be true for the final stage of a MAMS trial if the intermediate and definitive outcomes differ. Therefore it is important to have a reasonably small value of A s−1 and a large value of s−1 to increase the chance of only recruiting patients to effective treatments in the final stage.
As shown in the appendix, the calculation of A s and s when I = D requires estimates of either the probability of a patient experiencing both outcomes or the probability of experiencing the definitive outcome given they have had the intermediate outcome (positive predictive value, PPV). The latter is arguably easier to specify as it only requires an assumption for a single outcome, D (given that I has occurred), rather than two (both I and D). As the correlations between treatment effects, and therefore A s and s , increase as either of these probabilities tend to 1, we recommend slightly overestimating them to obtain a conservative estimate of the pairwise type I error rate.

Application to tuberculosis
To illustrate how this design might be applied and assess the benefits of the MAMS design in a TB setting we used the methodology above to calculate the sample size for phase II and seamless phase II/III two-arm twostage TB trials. Seamless designs are an effective tool for streamlining treatment evaluation as they remove the long interlude between phases which is required for presenting the phase II results and for designing, approving http://www.biomedcentral.com/1471-2288/13/139 and funding the phase III trial. Furthermore, by reusing phase II patients in the analysis of the phase III outcome, seamless designs offer greater efficiency over the traditional approach of conducting phase II and III trials separately [19].
The phase II two-arm two-stage designs were based upon a recent study by Dorman et al. [10] that substituted moxifloxacin for isoniazid in the standard TB regimen during the intensive phase (first two months) of treatment. The outcome in this study was culture status (a marker for whether a patient has TB or not) 8 weeks after randomisation and was also used as the basis for the intermediate outcome in the seamless phase II/III designs. The phase III aspect was based on the ongoing REMox TB trial (controlled comparison of two moxifloxacin containing treatment shortening regimens in pulmonary tuberculosis) that investigates the effect of two four month regimens against the standard six month regimen on relapse rates 18 months after randomisation [20]. This trial uses a Bonferroni-adjusted one-sided significance level of 1.25% for each treatment arm to ensure the overall type I error rate is no higher than 2.5%. For this example we considered only one experimental arm from REMox and thus used a one-sided significance level of 2.5%. The designs of these standalone phase II and phase III trials are summarised in Table 1.
Examples of two-arm two-stage phase II and phase II/III TB trials were generated using a conventional significance level (2.5%) and power (90%) in the final stage. Significance levels of 20% and 50% and powers of 90% and 95% were explored in the first stage. Delays of 4 and 14 weeks for observing a patient's culture status after randomisation Design parameters for a phase II (based on Dorman et al. [10]) and a phase III (based on REMox [20]) TB trial. *Calculated assuming independence between trials. **An additional 6 week delay is typically required to determine culture status. ***Sample sizes estimated using equations (1) and (2). NI = non-inferiority.
were used to explore their effect on the efficiency of a trial. The latter was chosen as it is the current delay in observing a patient's culture status after randomisation due to the 8 week follow-up period plus 6 week wait for detecting absence of TB (in liquid medium). A 4 week delay was also chosen as it is not yet certain whether culture status at 8 weeks is an appropriate intermediate outcome for long-term relapse, and observing it after 4 weeks may be more suitable. Furthermore, the 6 week wait is unlikely to exist in future as techniques for immediate detection of TB are developed [21] and so observing status at 4 weeks may represent the shortest possible delay for this outcome. Examples of two-arm two-stage phase II and phase II/III TB trial designs based on these parameters are shown in Tables 2 and 3 respectively and discussed in the 'Results' section.
The efficiency of each design was measured by its expected sample size (ESS), that is, the mean number of patients recruited to the trial before it is terminated [22], calculated under the null hypothesis. ESS was compared between designs with roughly similar overall operating characteristics to determine which is likely to require fewer resources when the experimental treatment is ineffective. For a single-stage trial, such as those in Table 1, the ESS is equal to the overall sample size since there is no opportunity for stopping before the planned end of the trial (except perhaps in extreme circumstances such as overwhelming efficacy of an arm).
To calculate the overall operating characteristics in the seamless designs an estimate of the positive predictive value, that is, the probability of a patient not relapsing or being classed as a treatment failure given that they have a negative culture, was obtained from a meta-analysis by Horne et al. [23] who estimated it to be 95% (95% CI (95%, 96%)) for cultures taken at 2 months. This value was assumed to be the same under H 0 and H 1 for each intermediate outcome.

Simulation study
Performing an analysis in a MAMS trial which ignores the stopping guidelines for lack-of-benefit may result in biased treatment effect estimates [15]. Choodari-Oskooei et al. [15] investigated the extent of this bias for two-arm multi-stage trials with time-to-event outcomes. For arms stopped at the first interim analysis for lack-of-benefit they showed that on average the estimated treatment effects appeared slightly less effective than their corresponding true values. However, the bias was markedly reduced by continuing to follow-up patients on the intermediate and definitive outcomes and reanalysing the data at the planned end of the trial. For truly effective arms, they showed that the bias in the estimated treatment effects on the definitive outcome at the final stage analysis was of no practical importance. http://www.biomedcentral.com/1471-2288/13/139  In the time-to-event case, interim analyses occur when a pre-specified number of events have been observed in the control arm. In arms in which recruitment is stopped early there is scope for continuing to follow-up patients who have not yet experienced the event(s) of interest and including them in a reanalysis at the planned end of the trial to obtain a less biased estimate of the treatment effect. This is also applicable when outcomes are observed at the end of a fixed follow-up period since not all patients will have had both their intermediate and definitive outcomes observed by each interim analysis.
A simulation study was conducted using the two-stage phase II and phase II/III TB trial designs shown in Tables 2  and 3 respectively, to quantify the bias of treatment effects estimated on the definitive outcome at: In addition to bias, the proportion of arms for which recruitment is stopped at the first interim analysis and the proportion which continue recruiting to the final stage of the trial, as well as the pairwise type I error rate, power and correlation between stages were determined in the simulations and compared to their corresponding calculated values. For each design shown in Tables 2 and 3, the bias associated with the following four pairs of underlying treatment effects for the culture status (CS) and relapse outcomes (R) were investigated in the simulations: A. θ CS = −5%, θ R = −10% (treatment effects worse than those under H 0 ) B. θ CS = 0%, θ R = −6% (treatment effects under H 0 (see Table 1)) C. θ CS = 8%, θ R = −3% (treatment effects between those under H 0 and H 1 ) D. θ CS = 13%, θ R = 0% (treatment effects under H 1 (see Table 1)) By assessing bias in scenarios (a), (b) and (c) for this variety of treatment effects, recommendations can be made for designing multi-stage trials which reduce bias, thus improving the accuracy of treatment effect estimates which might be used, for example, in future metaanalyses, policy-making decisions or the design of future trials.

Simulation methods
To perform the bias assessment and assess the accuracy of the calculation of the pairwise operating characteristics, individual patient data were simulated for each phase II and phase II/III design under treatment effects A-D. In each case 40,000 replicates were generated to estimate pass/fail rates to an accuracy of at least 0.5% at the 5% significance level. For each patient, missing value indicators for the I and D outcomes were drawn from Bernoulli distributions with parameters derived from Table 1. In the designs where I = D the probability of observing the definitive outcome was not conditional on observing the intermediate outcome. Although this reduces the correlation between stages compared to the calculation given in the appendix where all patients with a missing intermediate outcome are also assumed to have a missing definitive outcome, these different assumptions will indicate the robustness of the calculation of the overall type I error rate and power.
Patient outcomes were drawn from Bernoulli distributions with control arm event rates derived from Table 1. The underlying event rates for experimental arms with underlying effects A-D were found by adding on the corresponding treatment effects shown above. Since the phase III outcome (relapse) is dependent on culture status, the event rate will differ according to whether a patient's culture status is positive (CS = 0), negative (CS = 1) or missing. The positive predictive value (PPV=P(R = 1|CS = 1)) is the relapse event rate for patients with a positive culture status and the estimate from Horne et al. (95%) [23] was assumed for all arms. The probability P(R = 1|CS = 0) for each treatment arm was then found by rearranging the formula P(R = 1) = P(R = 1|CS = 1)P(CS = 1) Unconditional event rates were used for patients with missing intermediate outcomes.
When simulating each trial, analyses were triggered once the pre-determined number of control arm patients had their outcome of interest observed. The pairwise type I error rate and power for each design was calculated as the proportion of arms simulated under H 0 (treatment arm B) and H 1 (treatment arm D) respectively which passed all stages of the trial. For each underlying treatment effect in each design, the absolute bias in scenarios (a), (b) and (c) was calculated as the average deviation of all treatment effect estimates from the true value. Table 2 summarises the sample sizes and durations of phase II two-arm two-stage trials which use culture status at 4 or 8 weeks of follow-up as the primary endpoint for both the intermediate and definitive outcomes. A constant recruitment rate of 200 patients/year was assumed in both stages.

Examples of phase II TB trials
The results show that the maximum sample sizes of the two-stage designs shown in Table 2 are higher than the corresponding fixed sample sizes, however, their expected sample sizes are much lower as they allow recruitment to be stopped early if the experimental treatment does not show sufficient benefit at the first stage. Increasing the power in the first stage reduces the difference between the maximum and fixed sample sizes, however, this also increases the expected sample sizes due to a larger first stage. Thus, a balance needs to be found between the two measures. As expected, the correlation between stages increases as the gap between analyses decreases, however, this only marginally increases the type I error rate and power.
Although designs (ii) and (iv) have similar overall operating characteristics, the design which uses a first stage significance level of α 1 = 50% (design (ii)) has a much smaller expected sample size. On the other hand, designs (i) and (iii) also have similar overall operating characteristics but are approximately equally efficient. Unsurprisingly, the ESS is smaller when using a shorter follow-up period since fewer patients are recruited during the first stage of the trial. All two-stage designs have the same maximum sample size as they use the same final stage operating characteristics. http://www.biomedcentral.com/1471-2288/13/139 There appears little advantage in using these two-stage designs over a single-stage design for two-arm phase II TB trials, however, if multiple treatments are to be evaluated in a single trial then stopping guidelines for lack-of-benefit will become much more useful. Due to the current length of follow-up for culture status (8 weeks) and short length of phase II trials, using more than two-stages is unlikely to improve efficiency.

Examples of seamless phase II/III TB trials
Examples of seamless two-stage TB trials are presented in Table 3. A constant recruitment rate of 200 patients/year was assumed for the intermediate (phase II) stage and a much higher recruitment rate of 800 patients/year was used for the second (phase III) stage. Under these assumptions the maximum duration of each design is no longer than 5 years. If similar recruitment rates are assumed for the fixed sample designs shown in Table 1 then the maximum duration of conducting both trials separately is approximately 8.5 years assuming a modest delay between phases of 3 years. Furthermore, the overall power of the seamless designs (over 80%) is much higher than that for conducting trials separately (68%) and maximum sample sizes are over 100 patients lower.
The between-stage correlations in these designs are much lower than those in the phase II designs for two reasons. Firstly, the positive predictive value is effectively 1 in designs with I = D (see Appendix) whereas the seamless designs use a slightly lower value (0.95). Secondly, the interim and final analyses are much further apart in terms of sample size than in the phase II designs, which further reduces the correlation. Although not problematic, the immediate consequence of lower between-stage correlation is a reduction in both the pairwise type I error rate and power, and so the stagewise operating characteristics may have to be increased to achieved the desired level for each measure.
A downside of the seamless designs presented in Table 3, as illustrated by the high ESS, is that ineffective arms have a reasonable chance of proceeding to the final stage of the trial due to the high significance level used in the first stage. To combat this, the large gap between the first and final analyses means that an extra intermediate stage could be added to the trial. For example, adding a second intermediate stage with 95% power and a 10% significance level to design (vi) in Table 3 reduces the ESS to 377 with only a 3% reduction in overall power. This loss can be recovered by slightly increasing the stagewise powers. Identifying MAMS designs which maintain the overall operating characteristics but have desirable properties such as minimising the expected or maximum sample sizes is an area of ongoing research.
Clearly there is much more benefit in using the MAMS design for seamless phase II/III TB trials than for phase II alone. We have demonstrated the savings in time and resources that can be achieved in using seamless two-arm two-stage trials over conducting each phase separately. For multi-arm multi-stage seamless trials, the savings will potentially be much greater compared to conducting separate phase II and phase III trials for each experimental treatment. Table 4 shows that the overall type I error rate, power and correlation between stages estimated from the simulations of the designs shown in Tables 2 and 3 agree very well with the corresponding calculated values. As expected, when I = D the correlation between stages estimated from the simulations is slightly lower than the calculated values, however, this leads to only a negligible difference between the overall type I error rates and powers showing that the calculation is robust to the degree of dependence between observing each outcome. Table 5 summarises the simulation results for the proportion of arms dropped at the end of the first stage and the absolute bias in their treatment effect estimates on the definitive outcome at the interim analysis and after all remaining patients have completed follow-up. The proportion of arms dropped under H 0 (treatment effect B) and H 1 (treatment effect D) is as expected given the significance level and power in this stage.

Bias in arms dropped at the first analysis
The results show that, on average, treatment effects are underestimated in arms which do not show sufficient benefit at the first interim analysis. When I = D the absolute bias in such arms is particularly high when a high significance level (50%) and relatively low power (90%) is used (design (i) in Table 2), in other words, the earlier the interim analysis occurs. In this design the magnitude of the absolute bias is over 9% under H 0 . However, the bias is markedly reduced in a reanalysis after all remaining patients have had their outcome observed, with a greater reduction in bias when using a longer follow-up period or, more generally, when more patients can be added to the reanalysis. In this particular example, the magnitude of the absolute bias under H 0 decreases from 9.5% to 6.5% for 4 week follow-up and to 4.6% if outcome observation is delayed by 14 weeks after randomisation.
When using a relatively low significance level in the first stage (e.g. 20%) the bias is of no practical importance in arms which are likely to be stopped at that analysis, particularly after follow-up is complete. When I = D, http://www.biomedcentral.com/1471-2288/13/139 Overall type I error rates, powers and correlations between stages obtained from simulations of designs (i)-(viii) in Tables 2 and 3. Key: α 1 = stage 1 significance level, ω 1 = stage 1 power, ρ|H h = correlation between stages under hypothesis H h , α = overall type I error rate, ω = overall power. Hats indicate values estimated from simulations.  Simulation results showing the proportion of trials stopped at the first interim analysis and the absolute bias for such arms in the estimated treatment effect on D at the interim analysis and after all remaining patients have been followed up. Key: α 1 = significance level in stage 1, θ D = underlying treatment effect on the definitive outcome. *Plus an additional 6 week delay to determine culture status. http://www.biomedcentral.com/1471-2288/13/139 the bias in the treatment effect estimates for D is much lower than when the same outcome is used throughout the trial, even when using a high significance level in the first stage. Table 6 shows that treatment effects estimated at the final planned analysis of the trial are overestimated on average, although the bias is generally not as large as it is for arms dropped at the first analysis. The results suggest that bias decreases the further the interim analysis is in terms of sample size from the final analysis (i.e. as the correlation between stages decreases) and when the chance of proceeding to the final stage of the trial is higher, as is the case for effective arms.

Bias in arms reaching the final analysis
In the examples used in Table 6 the bias is practically zero in all cases when I = D, even for ineffective arms. This is due to the very low correlation between stages in these designs (roughly 0.1). However, even when the correlation is higher, for example when I = D, the bias is still approximately zero for arms which are likely to proceed to the final stage. Bias is higher for ineffective arms, however, in a well-designed MAMS trial such arms should have little chance of reaching the final stage.

Discussion
We have successfully adapted the MAMS design initially developed by Royston et al. [6] to binary outcomes which are observed at the end of a fixed follow-up period and analysed using an absolute difference in proportions. Throughout this paper we have used TB as an example of a disease area where a MAMS approach could dramatically speed up treatment evaluation compared to the traditional approach of separate, two-arm phase II and III trials. Savings in time and resources are particularly large when using the MAMS design to incorporate both phase II and phase III into a single seamless trial, however, savings are still likely to be made when using it to design multi-arm phase II trials. Many new and repurposed drugs are currently in clinical development for TB and so a huge number of new regimens are likely to be available for testing in  phase II and III trials in the near future. Evaluating them in separate, single stage trials will not only be costly but will prolong the discovery of a simpler and shorter effective regimen by decades. Use of novel trial designs such as the MAMS design is therefore urgently required. Further work is needed to determine the best intermediate outcome for long-term relapse before the MAMS design described here can be used to evaluate TB treatments in a seamless phase II/III trial. The methods used by Barthel et al. [14], who evaluated the performance of the MAMS design for time-to-event outcomes in four cancer trials, could be applied to past TB trials. If the rate at which trials are incorrectly stopped for lack-of-benefit on culture status at eight weeks is high then other intermediate outcomes will need considering, such as culture status at other time points. Another candidate for the intermediate outcome is time to culture conversion, which is increasingly being used in phase II trials and is arguably a more reliable surrogate endpoint than culture status at a single time point [24]. Although surrogacy is not a requirement for an intermediate outcome it is likely that a surrogate outcome will be a reliable choice. An ongoing trial conducted by the PanACEA consortium with a MAMS design (ClinicalTrials.gov identifier NCT01785186) is using this endpoint but since this is a phase II trial the definitive outcome is also time to culture conversion. Incorporating this outcome into a MAMS design with a binary definitive outcome will require further extensions to the methodology which we are currently developing.
The amount of bias likely to be generated in various examples of phase II and phase II/III TB trials was investigated and was shown to be of no practical importance in arms reaching the final analysis, particularly in effective arms or when treatment selection is based on an intermediate outcome. In general, the bias at the final analysis increases as the treatment effects estimated at each stage become more correlated. This is caused by having short stage durations in which only a small amount of new data can be collected. Ensuring that stages are adequately spaced is not only practical from the perspective of everyone involved in the trial but it will also limit the amount of bias likely to be generated.
As shown by Choodari-Oskooei et al. [15], we also found that having an early first interim analysis increased the bias of treatment effect estimates in arms dropped at this analysis, particularly when the intermediate and definitive outcomes were identical. Bias was markedly reduced in a reanalysis after all patients had completed follow-up. It should be noted, however, that the average treatment effect in arms which are stopped early for lack-of-benefit (i.e. are statistically non-significant) will necessarily appear less effective than their true value [25]. Freidlin and Korn [26] suggest that the most appropriate comparator for the x% of trials stopped at the first interim analysis is the average treatment effect estimate of the same outcome in the corresponding x% most extreme trials in the fixed sample-size design (the design that has no interim analyses). When taking this into consideration the bias estimates in Table 5 are nearly halved (data not shown).
A calculation for the overall type I error rate for a single experimental arm was described, thus allowing control of this measure. However, in a multi-arm trial it may be more important to control the familywise type I error rate (FWER), that is, the probability of rejecting at least one true null hypothesis at the end of the trial. Freidlin et al. [7] argue that this decision depends on the clinical questions that the trial is addressing. For example, if a multi-armed trial was used purely for efficiency reasons and the interpretation of the results of one arm has no influence over the results of other arms then they argue that no control of the FWER needs to be made. On the other hand, if in some way the treatment arms are related, such as different doses or schedules of the same treatment, then multiplicity adjustment should be made. Others have said that if a multi-arm design is to be used in a confirmatory trial then FWER control is a requirement [27,28].
For the MAMS design described here, a crude method for ensuring that the FWER is no higher than some prespecified level is to apply a Bonferonni correction to the pairwise type I error rate: i.e. in a trial with K experimental arms, a pairwise α equal to FWER/K could be used. However, such a correction can be too conservative and may result in a trial which is much larger than might be necessary, thus losing efficiency. More accurate methods for controlling the FWER in the strong sense (i.e. under any parameter configuration) are therefore required and is a subject of ongoing research. Alternatively, other MAMS designs which allow stopping for lack-of-benefit and control the FWER are available [29][30][31][32].

Conclusions
The methodology presented in this paper is aimed at reducing the amount of time and resources required to obtain reliable results from clinical studies. A Stata program for designing MAMS trials with binary outcomes is available from the authors upon request. Further work is ongoing into finding MAMS designs which are the most efficient in terms of the expected or maximum sample size or a mixture of the two for a given overall pairwise or familywise type I error rate and power. In TB, the MAMS design will have the greatest impact in phase II/III seamless designs, however, considerable savings are also likely to be made in other disease areas. http://www.biomedcentral.com/1471-2288/13/139

Appendix: Estimating the correlation matrices
Before A i and i can be calculated the correlation matrices, R 0 i and R 1 i , whose (j, k)th entries are the correlations between the treatment effects in stages j and k under H 0 and H 1 respectively, are required. We begin with a general case where the binary outcomes of interest in stages j and k are different. Suppose outcome X is the outcome of interest in stage j and outcome Y is of interest in stage k with j < k and denote the observed treatment effects bŷ θ j =π E j −π C j andθ k =π E k −π C k respectively. If π h i = π C i + θ h i is the target experimental arm event rate under hypothesis H h then the standard deviation of θ h i in its normal approximation is Assuming success rates between treatment arms are independent, the correlation betweenθ j andθ k under hypothesis H h (h = 0, 1), denoted by ρ h (j,k) , is Denote by X C m and Y C m the observed X and Y outcomes respectively for the mth patient in the control arm (X C m , Y C m ∈ {0, 1}) where X C m is observed during or before stage j and Y C m is observed during or before stage k (j < k). The covariance between the control arm event rates in stage j on the X outcome and stage k on the Y outcome is Cov(π C j ,π C k ) = Cov Assuming observations from different patients are independent implies E(X C l Y C m ) = E(X C l )E(Y C m ) if l = m and so where π C (j,k) is the probability of a patient experiencing both the X and Y outcomes in the control arm. A similar argument for the covariance of event rates between stages in an experimental arm under H h gives It follows that The values π C (j,k) and π h (j,k) may be estimated from prior knowledge or, if estimates of the positive predictive value in each arm are available, that is, the probability of a patient having a Y event given that they have had an X event, then from the definition of conditional probability π C (j,k) = P Y C m = 1|X C m = 1 π C j and π h (j,k) = P Y h m = 1|X h m = 1 π C j .
If the outcomes of interest in stages j and k are the same then equation (3) simplifies. Clearly the positive predictive value is now 1 and so π C (j,k) = π C j and π h (j,k) = π h j . Then