Testing non-inferiority of a new treatment in three-arm clinical trials with binary endpoints

Background A two-arm non-inferiority trial without a placebo is usually adopted to demonstrate that an experimental treatment is not worse than a reference treatment by a small pre-specified non-inferiority margin due to ethical concerns. Selection of the non-inferiority margin and establishment of assay sensitivity are two major issues in the design, analysis and interpretation for two-arm non-inferiority trials. Alternatively, a three-arm non-inferiority clinical trial including a placebo is usually conducted to assess the assay sensitivity and internal validity of a trial. Recently, some large-sample approaches have been developed to assess the non-inferiority of a new treatment based on the three-arm trial design. However, these methods behave badly with small sample sizes in the three arms. This manuscript aims to develop some reliable small-sample methods to test three-arm non-inferiority. Methods Saddlepoint approximation, exact and approximate unconditional, and bootstrap-resampling methods are developed to calculate p-values of the Wald-type, score and likelihood ratio tests. Simulation studies are conducted to evaluate their performance in terms of type I error rate and power. Results Our empirical results show that the saddlepoint approximation method generally behaves better than the asymptotic method based on the Wald-type test statistic. For small sample sizes, approximate unconditional and bootstrap-resampling methods based on the score test statistic perform better in the sense that their corresponding type I error rates are generally closer to the prespecified nominal level than those of other test procedures. Conclusions Both approximate unconditional and bootstrap-resampling test procedures based on the score test statistic are generally recommended for three-arm non-inferiority trials with binary outcomes.


Background
The objective of a non-inferiority trial is to demonstrate the efficacy of an experimental treatment not being inferior to a reference treatment by some pre-specified noninferiority margin. Many authors considered two-arm non-inferiority trials without a placebo since the comparison between the experimental and reference treatments is direct and the potential ethical problems encountered in traditional placebo-controlled trials are avoided (for example, see Dunnett and Gent [1], Tango [2], and Tang et al. [3]). However, there are two major concerns for two-arm non-inferiority trials [4]. The first issue is the *Correspondence: nstang@ynu.edu.cn 1 Department of Statistics, Yunnan University, No.2 Cuihu North Road, 650091 Kunming, China Full list of author information is available at the end of the article choice of the non-inferiority margin, which is the clinically acceptable amount or a combination of statistical reasoning and clinical judgement. The other issue is the evaluation of assay sensitivity, which refers to the ability of a trial to differentiate an effective treatment from a less effective or ineffective treatment [5]. Without a placebo arm, the assay sensitivity of a trail is not demonstrable from the trial data and ones must rely on some external information (e.g., historical placebo trails) for the reference treatment [4]. Without the trial assay sensitivity, any non-inferiority testing results from the comparison of the experimental and reference treatments will become unconvincing. There are some indications where it is considered ethically acceptable to continue to randomize patients to placebo despite the fact that an effective treatment exists and there is interest in seeing not only http://www.biomedcentral.com/1471-2288/ 14/134 whether the new treatment works at all but also how it measures up to accepted therapy. In this case, a three-arm non-inferiority clinical trail including the experimental treatment, an active reference treatment and a placebo is usually conducted to assess assay sensitivity and internal validation of a trail [6]. Indeed, three-arm trials are recommended in the guidelines of the ICH (The International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use) and EMEA/CPMP (European Medicines Agency/Committee for Proprietary Medical Products) as a useful approach to the assessment of assay sensitivity and internal validation (e.g., see [7]).
Statistical inference based on three-arm non-inferiority clinical trials with normally distributed outcomes has received considerable attention in recent years. For example, Koch and Tangen [8] and Pigeot et al. [9] considered the problem of three-arm non-inferiority testing for normally distributed endpoints with a common but unknown variance. Koti [10] presented a new approach for normally distributed endpoints based on the Fieller-Hinkley distribution. Hasler, Vonk and Hothorn [11] proposed the usage of the t-distribution in the presence of heteroscedasticity. Hida and Tango [7] proposed a test procedure for assessing the assay sensitivity with a pre-specified margin defined as a difference between treatments in the presence of homoscedasticity. Ghosh, Nathoo, Gönen and Tiwari [12] developed a Bayesian approach in the presence of heteroscedasticity by incorporating both parametric and semi-parametric models. Gamalo, Muthukumarana, Ghosh and Tiwari [13] extended the existing generalized p-value approach for assessing the non-inferiority of a new treatment in a three-arm trial.
Recently, some statistical methods have also been developed for three-arm non-inferiority testing with binary endpoints. For example, Tang and Tang [14] proposed two asymptotic approaches for testing three-arm noninferiority via rate difference based on Wald-type and score test statistics. Kieser and Friede (2007) revisited the performance of Tang and Tang's [14] asymptotic test statistics via simulation studies and derived approximate sample size formulae for achieving the desired power. Munk, Mielke, Skipka and Freitag [15] developed likelihood ratio tests. Li and Gao [4] used the closed testing principle to establish the hierarchical testing procedure and proposed a group sequential type design. Liu, Tzeng and Tsou [16] presented a three-step testing procedure and derived an optimal sample size allocation rule in an ethical and reliable manner that minimizes the total sample size.
All aforementioned approaches for testing noninferiority of a new treatment in a three-arm clinical trial with binary endpoints are based on large sample theory, and their accuracy has long been suspected and criticized when sample sizes are small or the data structure is sparse. To the best of our knowledge, limited work have been done to address these issues. Motivated by Jensen [17], we derive saddlepoint approximations to the cumulative distribution functions of Wald-type, score and likelihood ratio test statistics. Inspired by Tang and Tang [18], we also propose the exact unconditional, approximate unconditional and Bootstrap-resampling p-value calculation procedures for testing three-arm non-inferiority with small sample sizes.
The rest of this article is organized as follows. We first review three test statistics for assessing non-inferiority of a new treatment in three-arm clinical trials with binary endpoints. We also propose saddlepoint approximation, exact and approximate unconditional, and bootstrapresampling approaches for calculating p-values. Simulation studies are conducted to investigate the performance of all test statistics based on different p-value calculation approaches in terms of type I error rate and power. An example is analyzed to demonstrate our methodologies. Finally, we discuss the performance of our proposed methodologies and present some conclusions.

Model
Let consider a clinical trial with the test (T), reference (R) and placebo (P) treatments, and assume their primary clinical outcomes X T , X R and X P be independent and binomially distributed as X T ∼ Bin(n T , π T ), X R ∼ Bin(n R , π R ) and X P ∼ Bin(n P , π P ), respectively. Here, X T , X R and X P are the numbers of responses in groups T, R and P, respectively, π T , π R and π P represent their corresponding response probabilities with higher probability indicating a more favorable outcome, and n T , n R and n P denote their corresponding sample sizes. Thus, the joint probability density function of (x T , x R , x P ) is given by It can be easily shown from Equation (2.1) that the maximum likelihood estimates (MLEs) of π T , π R and π P are given byπ T = x T /n T ,π R = x R /n R andπ P = x P /n P , respectively.

Test statistics
Following Hida and Tango [7], to test the non-inferiority of the experimental treatment to the reference with the assay sensitivity in a three-arm trial, we have to simultaneously demonstrate (i) the superiority of the experimental treatment to the placebo, (ii) the non-inferiority of the experimental treatment to the reference with a noninferiority margin > 0, and (iii) the superiority of the http://www.biomedcentral.com/1471-2288/14/134 reference treatment to the placebo by more than . That is, π T , π R and π P must satisfy the following inequalities: π P < π R − < π T , which can be written as the following two hypotheses: Similar to Pigeot et al. [9], we take the margin as a fraction f of the effect size of the reference treatment, i.e., = f (π R − π P ). Generally, one can select f = 1/2 and 1/3 [14]. Thus, the second hypothesis can be expressed as K 0 : π R ≤ π P versus K 1 : π R > π P . If K 0 is rejected, letting f = 1 − θ yields the following non-inferiority hypothesis: where θ ∈ (0, 1) is a fixed retention fraction [8]. Rejecting H 0 implies that the test treatment preserves at least 100θ% of the efficacy of the reference treatment compared to placebo [19]. Similar to Tang and Tang [14], we only consider hypothesis H 0 and assume that K 0 is rejected at some pre-given significant level. Thus, the non-inferiority hypothesis (2.2) can be rewritten as The non-inferiority hypothesis (2.3) can be expressed as The restricted maximum likelihood estimates (RMLEs) (denoted byπ T ,π R ,π P ) of π T , π R and π P can be computed as follows. If the MLEsπ T ,π R ,π P of π T , π R , π P satisfy the conditions:π T − θπ R − (1 − θ)π P ≤ 0 and π R −π P > 0, we takeπ T =π T ,π R =π R andπ P = π P ; otherwise, the RMLEs can be calculated by setting π T = θπ R + (1 − θ)π P in the likelihood function (2.1) and maximizing it with respect to π R and π P . For the latter, it follows from Equation (2.1) that the RMLEs of π R and π P can be obtained by simultaneously solving the following equations in the parameter space = {(π P , π R ) : 0 ≤ π P < π R ≤ 1}: It is possible that there is no point (π P , π R ) ∈ such that it satisfies the above equations, which implies that the likelihood function given in Equation (2.1) attains its maximum on the boundary of the parameter space .
Following Tang and Tang [14], ψ can be estimated bŷ ψ =π T − θπ R − (1 − θ)π P , and its variance is given by var is some appropriate estimate of π = (π T , π R , π P ), for example, takingπ to beπ = (π T ,π R ,π P ) orπ = (π T ,π R ,π P ) which is the RMLE of π. Thus, the statistics for testing hypothesis (2.4) are given by which are asymptotically distributed as the standard normal distribution under H 0 as n T , n R and n P are sufficiently large. Hence, non-inferiority can be claimed if [20] and T R is the test statistic given by Farrington and Manning [21] for two-arm noninferiority trials.
The signed root of the likelihood ratio statistic for testing hypothesis (2.4) is given by which is asymptotically distributed as the standard normal distribution under H 0 as n T , n R and n P are sufficiently

p-value calculation methods
The non-inferiority hypothesis (2.2) can be claimed via the p-value method with the rule: H 0 is rejected if the p-value is less than or equal to the prespecified significance level α. In what follows, we introduce five approaches for cal-

(1) Asymptotic method (AM)
It follows from the above arguments that all statistics T j 's (j = W , R, L) asymptotically follow the standard normal distribution under the null hypothesis H 0 : ψ ≤ 0. Thus, the asymptotic p-value for testing hypothesis (2.2) via statistic The above asymptotic approach for calculating p-value of testing hypothesis (2.2) via statistic T j (j = W , RW , L) is established under the large sample theory. Its accuracy has long been suspected and criticized, especially when n T , http://www.biomedcentral.com/1471-2288/14/134 n R and/or n P are small since the skewness of the underlying binomial distributions is not taken into consideration. Some higher order corrections such as the saddlepoint approximation [17] have been proposed to improve the accuracy of the normal approximation. In what follows, we will derive saddlepoint approximations to distributions of the three test statistics.

(2) Saddlepoint approximation method (SAM)
Since X T , X R and X P are independent and X i ∼ Bin(n i , π i ) (i = T, R, P), the moment generating function ofψ is given by with the cumulant generating function being where −1 ≤ t ≤ 1. Thus, the first two derivatives of the cumulant generating function K(t) are given bẏ π P e (θ−1)t/n P 1−π P +π P e (θ−1)t/n P , and π P e (θ−1)t/n P n P (1−π P +π P e (θ−1)t/n P ) 2 , respectively. To obtain the saddlepoint approximation to P(ψ ≥ b), we need to solve the following saddlepoint equation:K(t) = b whose unique solution is denoted ast. Following Jing and Robinson [22], the saddlepoint approximation to the cumulative distribution function of statistiĉ ψ is given by where ω = sgn(t) 2{tb − K(t)} and υ =t K (t). Thus, the saddlepoint approximation to P T j ≥ t o j |H 0 (j = W , R, L) is given by ,Â j is the unique solution to equation: , and H 2 = n R n PπR (1 −π R )π P (1 −π P ).

(3) Exact unconditional method (EUM)
When sample sizes (i.e., n T , n R , n P ) are small, asymptotic methods may yield inflated type I error rates and their exact versions may provide reliable alternative. Under H 0 : ψ ≤ 0 with π P < π R , parameters π R and π P must belong to the following constrained parameter space = and empty set otherwise}. Under the null hypothesis, the probability density function (2.1) can be reexpressed by π T = ψ + θπ R + (1 − θ)π P with π R , π P and ψ being nuisance parameters. These nuisance parameters can be eliminated by maximizing the null likelihood over the complete domain . Similar to Tang and Tang [18], the exact unconditional p-value for testing H 0 :

(4) Approximate unconditional method (AUM)
According to Tang and Tang [18] and Tang, Tang and Rosner [23], the exact unconditional test is always conservative, i.e., its corresponding type I error rate is always less than or equal to the prespecified significance level. Following Tang and Tang [18], these nuisance parameters can be eliminated by evaluating their values at their corresponding RMLEs under ψ = 0. The approximate unconditional p-value for testing H 0 : ψ ≤ 0 via statis-

(5) Bootstrap-resampling method (BTM)
Hypothesis testing based on the bootstrap-resampling method is usually recommended when sample sizes (i.e., n T , n R and n P ) are small [24] or data structure is sparse (e.g., x T or x R or x P is close to zero or n T , n R and n P , http://www.biomedcentral.com/1471-2288/14/134 respectively). Given the observation x o T , x o R , x o P , we compute the RMLEsπ T ,π R andπ P of parameters π T , π R and π P , and calculate the observed value t 0 j of statistic T j (j = W , R, L). Based on the RMLEsπ T ,π R andπ P , we generate B bootstrap samples . . , B from the following distribution: x b k ∼ Bin(n k ,π k ) for k = T, R and P. For each of the B bootstrap samples, we compute the observed value t b j of statistic T j (j = W , R, L). Hence, an approximate p-value for testing H 0 : ψ ≤ 0 via statistic T j based on

Simulation study
Simulation studies are conducted to investigate the performance of various test statistics together with the five p-value calculation methods in small-sample designs (e.g., n = 30 and 60, where n = n P +n R +n T with the allocation ratios λ P :λ R :λ T =1:n R /n P :n T /n P taking to be 1:1:1, 1:2:2 and 1:2:3) in terms of type I error rate and power. For each (n P , n R , n T ), we consider the following probability settings [19]: π P = 0.05, 0.10, 0.15, . . . , 0.50, π R = π P + 0.05, π P + 0.10, . . . , 0.95, and π T = θπ R + (1 − θ)π P , which corresponds to a total of 11,340 configurations of (π P , π R , π T ), and the following two non-inferiority margins: θ = 0.6 and 0.8. The nominal level is taken to be α = 0.05. For the given values of n and allocation ratio λ P :λ R :λ T , n k is given by n = nλ k /(λ P + λ R + λ T ) for = P, R and T. Thus, given n, allocation ratio and (π P , π R , π T ), the type I error rate for testing hypothesis H 0 : ψ ≤ 0 versus H 1 : ψ > 0 via test statistic T j (j = W , R, L) at the significance level α is calculated by

Simulation study
To compare the performance of AM, SAM, EUM, AUM and BTM together with test statistics T W , T R and T L under the balanced and unbalanced designs, Figure 1 presents boxplots of their corresponding type I error rates for n = 30 and 60, and λ P :λ R :λ T =1:1:1, 1:2:2 and 1:2:3, where AMk, SAk, EUk, AUk and BTk represent AM, SAM, EUM, AUM and BTM for test statistic T k with k=W, R and L, respectively. Here, each boxplot in Figure 1 contains 2 (i.e., the number of non-inferiority margins)×11, 340 (i.e., the number of configurations for (π P , π R , π T ))=22,680 data points. From Figure 1,  AUM and BTM outperform the other three p-value calculation procedures, and (ii) T R behaves better than the other two test statistics regardless of p-value calculation procedures. Fifth, the median of the type I error rates becomes more close to the prespecified nominal level as the total sample size n increases, whilst at the same time the variability of the type I error rates decreases. Sixth, the variability of the type I error rates for unbalanced designs is not significantly different from that for the balanced designs.
To investigate the sensitivity of various p-value calculation procedures (i.e., AM, SAM, EUM, AUM and BTM) to different test statistics, Figure 2 presents boxplots of their corresponding type I error rates against π P for test statistics T W , T R and T L . Examination of Figure 2 shows that there is no significant effect of π P on the type I error rate.
We also calculate powers of the five p-value calculation procedures together with the three test statistics at the nominal level α = 0.05 when π T = π R and θ = 0.6 with the following settings: n = 30 and 60, π P = 0.15 and 0.3, and π R = 0.5, 0.8 and 0.95 for the balanced allocation 1:1:1 and unbalanced allocation 1:2:3. Results are reported in Table 1. Examination of Table 1 indicates that (i) T R is generally more powerful than T W and T L for the EUM http://www.biomedcentral.com/1471-2288/14/134   Table 1 Exact powers (%) of various test procedures together with three statistics when π T = π R with n = 30 and 60, θ = 0.6 and α = 0.05

Real data example
An example from a pharmacological study of patients with functional dyspepsia (FD) and a placebo-controlled trail of subjects with acute migraine is used to illustrate our proposed methodologies. This example has been analyzed by Holtmann et al. [25] and Tang and Tang [14]. In this example, cisapride and simethicone can be regarded as the existing reference and new experimental treatments, respectively. In that study, among n = 178 patients of FD, n P = 61, n R = 59 and n T = 58 were randomized and treated in a doubly dummy technique with placebo, cisapride and simethicone, respectively; adverse events (e.g., diarrhea and pain) were happened in x P = 7, x R = 10 and x T = 12 patients treated with placebo, cisapride and simethicone, respectively. It is of interest to test if simethicone is not inferior to cisapride in terms of rate of reporting adverse event in the presence of placebo. Given θ = 0.6 and 0.8, the corresponding p-values for testing H 0 : (π T − π P )/(π R − π P ) ≤ θ versus H 1 : (π T − π P )/(π R − π P ) > θ based on the five p-value calculation procedures and three test statistics are reported in Table 2. By Table 2, there is no evidence to show that simethicone is noninferior to cisapride in the presence of placebo at the nominal level α = 0.05, which is consistent with that given in Tang and Tang [14].

Discussion
Simulation results demonstrate that our proposed score test statistic outperforms other test statistics in terms of type I error rate and power under our considered settings. The approximate unconditional and bootstrapresampling methods perform better than other p-value calculation procedures in the sense that their corresponding type I error rates are closer to the prespecified nominal level and their corresponding powers are larger than those of other p-value calculation procedures. The exact unconditional method is conservative and timeconsuming when sample sizes are large (e.g., see the 6th column in Table 3). The asymptotic tests are liberal since their type I error rates are greater than the prespecified nominal level α = 0.05 in most cases. Comparing the approximate and exact unconditional methods, the approximate unconditional method provides a good alternative to the exact unconditional method in terms of computing time (e.g., see the 6th and 7th columns in Table 3) and type I error rate when sample sizes are large. In contrast, the computing burden of the bootstrapresampling method is heavier than that of the approximate unconditional method (e.g., see the last two columns in Table 3).
In this article, we concentrate on a three-arm noninferiority trial with binary endpoints in which the marginal is defined as a fraction of the unknown difference in response probabilities between reference and placebo. The corresponding hypothesis (i.e., H 0 : π T −π P π R −π P ≤ θ or H 0 : π T −θπ R −(1−θ)π P ≤ 0) is considered since it is simple and only one single hypothesis is involved (e.g., see [6,9,14]). However, three-arm non-inferiority hypotheses with the marginal defined as the prespecified difference between treatments have received a considerable attention in recent years (e.g., see [5,7]). They can be generally classified as the union type hypotheses (i.e., H U0 : Table 3 Computing time (minutes) of the Type I error rates for 11340 configurations of (π P , π R , π T ) together with three test statistics under five test methods π R ≥ h P (π P ) or π R ≥ h T (π T )) or the intersection type hypotheses (i.e., H U0 : π R ≥ h P (π P ) and π R ≥ h T (π T )), where h P (.) and h T (.) are any functions [15]. For specific choices of h P (.) and h R (.), this includes, for examples, hypotheses on the differences, the relative risks or the odds ratio of the proportions. While the union type hypotheses are suitable for showing both the superiority of the standard treatment as compared to placebo and the inferiority of the test treatment as compared to the standard treatment, the intersection type hypotheses are suitable for showing the test treatment is as effective as the standard or placebo treatments. We are working on statistical inference on a three-arm non-inferiority trial with the margin being a prespecifided difference between treatments when the primary endpoints are binary.

Conclusions
According to the aforementioned observations, we can draw the following conclusions. In terms of type I error rates and powers, the approximate unconditional and bootstrap-resampling methods with score test statistic are recommended for hypothesis testing purpose when sample sizes are small in a three-arm non-inferiority trial. In terms of time-consuming and type I error rates and powers, the approximate unconditional method with score test statistic behaves the best among our considered p-value calculation procedures and test statistics.