- Research article
- Open Access
- Published:

# Testing non-inferiority of a new treatment in three-arm clinical trials with binary endpoints

*BMC Medical Research Methodology*
**volume 14**, Article number: 134 (2014)

## Abstract

### Background

A two-arm non-inferiority trial without a placebo is usually adopted to demonstrate that an experimental treatment is not worse than a reference treatment by a small pre-specified *non-inferiority margin* due to ethical concerns. *Selection of the non-inferiority margin* and *establishment of assay sensitivity* are two major issues in the design, analysis and interpretation for two-arm non-inferiority trials. Alternatively, a three-arm non-inferiority clinical trial including a placebo is usually conducted to assess the assay sensitivity and internal validity of a trial. Recently, some large-sample approaches have been developed to assess the non-inferiority of a new treatment based on the three-arm trial design. However, these methods behave badly with small sample sizes in the three arms. This manuscript aims to develop some reliable small-sample methods to test three-arm non-inferiority.

### Methods

Saddlepoint approximation, exact and approximate unconditional, and bootstrap-resampling methods are developed to calculate p-values of the Wald-type, score and likelihood ratio tests. Simulation studies are conducted to evaluate their performance in terms of type I error rate and power.

### Results

Our empirical results show that the saddlepoint approximation method generally behaves better than the asymptotic method based on the Wald-type test statistic. For small sample sizes, approximate unconditional and bootstrap-resampling methods based on the score test statistic perform better in the sense that their corresponding type I error rates are generally closer to the prespecified nominal level than those of other test procedures.

### Conclusions

Both approximate unconditional and bootstrap-resampling test procedures based on the score test statistic are generally recommended for three-arm non-inferiority trials with binary outcomes.

## Background

The objective of a non-inferiority trial is to demonstrate the efficacy of an experimental treatment not being inferior to a reference treatment by some pre-specified non-inferiority margin. Many authors considered two-arm non-inferiority trials without a placebo since the comparison between the experimental and reference treatments is direct and the potential ethical problems encountered in traditional placebo-controlled trials are avoided (for example, see Dunnett and Gent [1], Tango [2], and Tang et al. [3]). However, there are two major concerns for two-arm non-inferiority trials [4]. The first issue is the choice of the non-inferiority margin, which is the clinically acceptable amount or a combination of statistical reasoning and clinical judgement. The other issue is the evaluation of assay sensitivity, which refers to the ability of a trial to differentiate an effective treatment from a less effective or ineffective treatment [5]. Without a placebo arm, the assay sensitivity of a trail is not demonstrable from the trial data and ones must rely on some external information (e.g., historical placebo trails) for the reference treatment [4]. Without the trial assay sensitivity, any non-inferiority testing results from the comparison of the experimental and reference treatments will become unconvincing. There are some indications where it is considered ethically acceptable to continue to randomize patients to placebo despite the fact that an effective treatment exists and there is interest in seeing not only whether the new treatment works at all but also how it measures up to accepted therapy. In this case, a three-arm non-inferiority clinical trail including the experimental treatment, an active reference treatment and a placebo is usually conducted to assess assay sensitivity and internal validation of a trail [6]. Indeed, three-arm trials are recommended in the guidelines of the ICH (The International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use) and EMEA/CPMP (European Medicines Agency/Committee for Proprietary Medical Products) as a useful approach to the assessment of assay sensitivity and internal validation (e.g., see [7]).

Statistical inference based on three-arm non-inferiority clinical trials with normally distributed outcomes has received considerable attention in recent years. For example, Koch and Tangen [8] and Pigeot et al. [9] considered the problem of three-arm non-inferiority testing for normally distributed endpoints with a common but unknown variance. Koti [10] presented a new approach for normally distributed endpoints based on the Fieller-Hinkley distribution. Hasler, Vonk and Hothorn [11] proposed the usage of the *t*-distribution in the presence of heteroscedasticity. Hida and Tango [7] proposed a test procedure for assessing the assay sensitivity with a pre-specified margin defined as a difference between treatments in the presence of homoscedasticity. Ghosh, Nathoo, Gönen and Tiwari [12] developed a Bayesian approach in the presence of heteroscedasticity by incorporating both parametric and semi-parametric models. Gamalo, Muthukumarana, Ghosh and Tiwari [13] extended the existing generalized p-value approach for assessing the non-inferiority of a new treatment in a three-arm trial.

Recently, some statistical methods have also been developed for three-arm non-inferiority testing with binary endpoints. For example, Tang and Tang [14] proposed two asymptotic approaches for testing three-arm non-inferiority via rate difference based on Wald-type and score test statistics. Kieser and Friede (2007) revisited the performance of Tang and Tang’s [14] asymptotic test statistics via simulation studies and derived approximate sample size formulae for achieving the desired power. Munk, Mielke, Skipka and Freitag [15] developed likelihood ratio tests. Li and Gao [4] used the closed testing principle to establish the hierarchical testing procedure and proposed a group sequential type design. Liu, Tzeng and Tsou [16] presented a three-step testing procedure and derived an optimal sample size allocation rule in an ethical and reliable manner that minimizes the total sample size.

All aforementioned approaches for testing non-inferiority of a new treatment in a three-arm clinical trial with binary endpoints are based on large sample theory, and their accuracy has long been suspected and criticized when sample sizes are small or the data structure is sparse. To the best of our knowledge, limited work have been done to address these issues. Motivated by Jensen [17], we derive saddlepoint approximations to the cumulative distribution functions of Wald-type, score and likelihood ratio test statistics. Inspired by Tang and Tang [18], we also propose the exact unconditional, approximate unconditional and Bootstrap-resampling *p*-value calculation procedures for testing three-arm non-inferiority with small sample sizes.

The rest of this article is organized as follows. We first review three test statistics for assessing non-inferiority of a new treatment in three-arm clinical trials with binary endpoints. We also propose saddlepoint approximation, exact and approximate unconditional, and bootstrap-resampling approaches for calculating *p*-values. Simulation studies are conducted to investigate the performance of all test statistics based on different *p*-value calculation approaches in terms of type I error rate and power. An example is analyzed to demonstrate our methodologies. Finally, we discuss the performance of our proposed methodologies and present some conclusions.

## Methods

### Model

Let consider a clinical trial with the test (T), reference (R) and placebo (P) treatments, and assume their primary clinical outcomes *X*
_{
T
}, *X*
_{
R
} and *X*
_{
P
} be independent and binomially distributed as *X*
_{
T
}∼Bin(*n*
_{
T
},*π*
_{
T
}), *X*
_{
R
}∼Bin(*n*
_{
R
},*π*
_{
R
}) and *X*
_{
P
}∼Bin(*n*
_{
P
},*π*
_{
P
}), respectively. Here, *X*
_{
T
},*X*
_{
R
} and *X*
_{
P
} are the numbers of responses in groups T, R and P, respectively, *π*
_{
T
},*π*
_{
R
} and *π*
_{
P
} represent their corresponding response probabilities with higher probability indicating a more favorable outcome, and *n*
_{
T
},*n*
_{
R
} and *n*
_{
P
} denote their corresponding sample sizes. Thus, the joint probability density function of (*x*
_{
T
},*x*
_{
R
},*x*
_{
P
}) is given by

It can be easily shown from Equation (2.1) that the maximum likelihood estimates (MLEs) of *π*
_{
T
}, *π*
_{
R
} and *π*
_{
P
} are given by {\widehat{\pi}}_{T}={x}_{T}/{n}_{T}, {\widehat{\pi}}_{R}={x}_{R}/{n}_{R} and {\widehat{\pi}}_{P}={x}_{P}/{n}_{P}, respectively.

### Test statistics

Following Hida and Tango [7], to test the non-inferiority of the experimental treatment to the reference with the assay sensitivity in a three-arm trial, we have to simultaneously demonstrate (i) the superiority of the experimental treatment to the placebo, (ii) the non-inferiority of the experimental treatment to the reference with a non-inferiority margin *Δ*>0, and (iii) the superiority of the reference treatment to the placebo by more than *Δ*. That is, *π*
_{
T
}, *π*
_{
R
} and *π*
_{
P
} must satisfy the following inequalities: *π*
_{
P
}<*π*
_{
R
}−*Δ*<*π*
_{
T
}, which can be written as the following two hypotheses:

Similar to Pigeot et al. [9], we take the margin *Δ* as a fraction *f* of the effect size of the reference treatment, i.e., *Δ*=*f*(*π*
_{
R
}−*π*
_{
P
}). Generally, one can select *f*=1/2 and 1/3 [14]. Thus, the second hypothesis can be expressed as *K*
_{0}:*π*
_{
R
}≤*π*
_{
P
} versus *K*
_{1}:*π*
_{
R
}>*π*
_{
P
}. If *K*
_{0} is rejected, letting *f*=1−*θ* yields the following non-inferiority hypothesis:

where *θ*∈(0,1) is a fixed retention fraction [8]. Rejecting *H*
_{0} implies that the test treatment preserves at least 100*θ* *%* of the efficacy of the reference treatment compared to placebo [19]. Similar to Tang and Tang [14], we only consider hypothesis *H*
_{0} and assume that *K*
_{0} is rejected at some pre-given significant level. Thus, the non-inferiority hypothesis (2.2) can be rewritten as

Let *ψ*=*π*
_{
T
}−*θ* *π*
_{
R
}−(1−*θ*)*π*
_{
P
}. The non-inferiority hypothesis (2.3) can be expressed as

The restricted maximum likelihood estimates (RMLEs) (denoted by {\stackrel{~}{\pi}}_{T},{\stackrel{~}{\pi}}_{R},{\stackrel{~}{\pi}}_{P}) of *π*
_{
T
}, *π*
_{
R
} and *π*
_{
P
} can be computed as follows. If the MLEs {\widehat{\pi}}_{T},{\widehat{\pi}}_{R},{\widehat{\pi}}_{P} of *π*
_{
T
},*π*
_{
R
},*π*
_{
P
} satisfy the conditions: {\widehat{\pi}}_{T}-\theta {\widehat{\pi}}_{R}-(1-\theta ){\widehat{\pi}}_{P}\le 0 and {\widehat{\pi}}_{R}-{\widehat{\pi}}_{P}>0, we take {\stackrel{~}{\pi}}_{T}={\widehat{\pi}}_{T}, {\stackrel{~}{\pi}}_{R}={\widehat{\pi}}_{R} and {\stackrel{~}{\pi}}_{P}={\widehat{\pi}}_{P}; otherwise, the RMLEs can be calculated by setting *π*
_{
T
}=*θ* *π*
_{
R
}+(1−*θ*)*π*
_{
P
} in the likelihood function (2.1) and maximizing it with respect to *π*
_{
R
} and *π*
_{
P
}. For the latter, it follows from Equation (2.1) that the RMLEs of *π*
_{
R
} and *π*
_{
P
} can be obtained by simultaneously solving the following equations in the parameter space *Θ*={(*π*
_{
P
},*π*
_{
R
}):0≤*π*
_{
P
}<*π*
_{
R
}≤1}:

It is possible that there is no point (*π*
_{
P
},*π*
_{
R
}) ∈*Θ* such that it satisfies the above equations, which implies that the likelihood function given in Equation (2.1) attains its maximum on the boundary of the parameter space *Θ*.

Following Tang and Tang [14], *ψ* can be estimated by \widehat{\psi}={\widehat{\pi}}_{T}-\theta {\widehat{\pi}}_{R}-(1-\theta ){\widehat{\pi}}_{P}, and its variance is given by \text{var}\left(\widehat{\psi}\right)={\pi}_{T}(1-{\pi}_{T})/{n}_{T}+{\theta}^{2}{\pi}_{R}(1-{\pi}_{R})/{n}_{R}+{(1-\theta )}^{2}{\pi}_{P}(1-{\pi}_{P})/{n}_{P}, which can be estimated by {\sigma}^{2}\left(\stackrel{\u0306}{\pi}\right)\stackrel{\Delta}{=}\hat{\text{var}}\left(\widehat{\psi}\right)={\stackrel{\u0306}{\pi}}_{T}(1-{\stackrel{\u0306}{\pi}}_{T})/{n}_{T}+{\theta}^{2}{\stackrel{\u0306}{\pi}}_{R}(1-{\stackrel{\u0306}{\pi}}_{R})/{n}_{R}+{(1-\theta )}^{2}{\stackrel{\u0306}{\pi}}_{P}(1-{\stackrel{\u0306}{\pi}}_{P})/{n}_{P}, where \stackrel{\u0306}{\pi}=({\stackrel{\u0306}{\pi}}_{T},{\stackrel{\u0306}{\pi}}_{R},{\stackrel{\u0306}{\pi}}_{P}) is some appropriate estimate of *π*=(*π*
_{
T
},*π*
_{
R
},*π*
_{
P
}), for example, taking \stackrel{\u0306}{\pi} to be \widehat{\pi}=({\widehat{\pi}}_{T},{\widehat{\pi}}_{R},{\widehat{\pi}}_{P}) or \stackrel{~}{\pi}=({\stackrel{~}{\pi}}_{T},{\stackrel{~}{\pi}}_{R},{\stackrel{~}{\pi}}_{P}) which is the RMLE of *π*. Thus, the statistics for testing hypothesis (2.4) are given by

which are asymptotically distributed as the standard normal distribution under *H*
_{0} as *n*
_{
T
}, *n*
_{
R
} and *n*
_{
P
} are sufficiently large. Hence, non-inferiority can be claimed if *T*
_{
W
}>*z*
_{1−α
} (or *T*
_{
R
}>*z*
_{1−α
}), where *z*
_{1−α
} is the (1−*α*)-quantile of the standard normal distribution. When *π*
_{
P
}=0, *T*
_{
W
} is the Wald-type statistic proposed in Blackwelder [20] and *T*
_{
R
} is the test statistic given by Farrington and Manning [21] for two-arm noninferiority trials.

The signed root of the likelihood ratio statistic for testing hypothesis (2.4) is given by

which is asymptotically distributed as the standard normal distribution under *H*
_{0} as *n*
_{
T
}, *n*
_{
R
} and *n*
_{
P
} are sufficiently large, where \ell \left(\pi \right)={x}_{T}\text{log}\left({\pi}_{T}\right)+({n}_{T}-{x}_{T})\text{log}(1-{\pi}_{T})+{x}_{R}\text{log}\left({\pi}_{R}\right)+({n}_{R}-{x}_{R})\text{log}(1-{\pi}_{R})+{x}_{P}\text{log}\left({\pi}_{P}\right)+({n}_{P}-{x}_{P})\text{log}(1-{\pi}_{P})+\mathcal{C} with \mathcal{C}=log\{{n}_{T}!{n}_{R}!{n}_{P}!\}-log\{{x}_{T}!{x}_{R}!{x}_{P}!({n}_{T}-{x}_{T})!({n}_{R}-{x}_{R})!({n}_{P}-{x}_{P})!\}. Thus, non-inferiority can be claimed if *T*
_{
L
}>*z*
_{1−α
}.

###
*p*-value calculation methods

The non-inferiority hypothesis (2.2) can be claimed via the *p*-value method with the rule: *H*
_{0} is rejected if the *p*-value is less than or equal to the prespecified significance level *α*. In what follows, we introduce five approaches for calculating *p*-values based on {t}_{j}^{0}, which is the observed value of test statistic *T*
_{
j
} (*j*=*W*,*R*,*L*) for the observed value \left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}\right) of (*X*
_{
T
},*X*
_{
R
},*X*
_{
P
}).

#### (1) Asymptotic method (AM)

It follows from the above arguments that all statistics *T*
_{
j
}’s (*j*=*W*,*R*,*L*) asymptotically follow the standard normal distribution under the null hypothesis *H*
_{0}:*ψ*≤0. Thus, the asymptotic *p*-value for testing hypothesis (2.2) via statistic *T*
_{
j
} (*j*=*W*,*R*,*L*) based on \left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}\right) can be calculated by {p}_{j}^{\mathit{\text{AM}}}\left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}\right)=P\left({T}_{j}\ge {t}_{j}^{o}|{H}_{0}\right)=1-\Phi \left({t}_{j}^{o}\right), where *Φ*(·) is the standard normal distribution function.

The above asymptotic approach for calculating *p*-value of testing hypothesis (2.2) via statistic *T*
_{
j
} (*j*=*W*,*R* *W*,*L*) is established under the large sample theory. Its accuracy has long been suspected and criticized, especially when *n*
_{
T
}, *n*
_{
R
} and/or *n*
_{
P
} are small since the skewness of the underlying binomial distributions is not taken into consideration. Some higher order corrections such as the saddlepoint approximation [17] have been proposed to improve the accuracy of the normal approximation. In what follows, we will derive saddlepoint approximations to distributions of the three test statistics.

#### (2) Saddlepoint approximation method (SAM)

Since *X*
_{
T
}, *X*
_{
R
} and *X*
_{
P
} are independent and *X*
_{
i
}∼Bin(*n*
_{
i
},*π*
_{
i
}) (*i*=*T*,*R*,*P*), the moment generating function of \widehat{\psi} is given by

with the cumulant generating function being

where −1≤*t*≤1. Thus, the first two derivatives of the cumulant generating function *K*(*t*) are given by

respectively. To obtain the saddlepoint approximation to P(\widehat{\psi}\ge b), we need to solve the following saddlepoint equation: \stackrel{\u0307}{K}\left(t\right)=b whose unique solution is denoted as \widehat{t}. Following Jing and Robinson [22], the saddlepoint approximation to the cumulative distribution function of statistic \widehat{\psi} is given by

where \omega =\text{sgn}\left(\widehat{t}\right)\sqrt{2\{\widehat{t}b-K(\widehat{t}\left)\right\}} and \upsilon =\widehat{t}\sqrt{\stackrel{\u0308}{K}\left(\widehat{t}\right)}. Thus, the saddlepoint approximation to P\left({T}_{j}\ge {t}_{j}^{o}|{H}_{0}\right) (*j*=*W*,*R*,*L*) is given by

where {\omega}_{j}^{o}=\text{sgn}\left({\xc2}_{j}\right)\sqrt{2\left\{{\xc2}_{j}{t}_{j}^{o}-K({\xc2}_{j}/{B}_{j})\right\}} and {\upsilon}_{j}^{o}={\xc2}_{j}{B}_{j}^{-1}\sqrt{\stackrel{\u0308}{K}({\xc2}_{j}/{B}_{j})}, {\xc2}_{j} is the unique solution to equation: \stackrel{\u0307}{K}({\xc2}_{j}/{B}_{j})={t}_{j}^{o}{B}_{j} for *j*=*W*,*R* with {B}_{W}=\sigma \left(\widehat{\pi}\right) and {B}_{R}=\sigma \left(\stackrel{~}{\pi}\right), {\omega}_{L}^{o}=\text{sgn}\left(\widehat{\psi}\right)\sqrt{2\left\{\ell \right(\widehat{\pi})-\ell (\stackrel{~}{\pi}\left)\right\}} and {\upsilon}_{L}^{o}=\widehat{\psi}\sqrt{{n}_{T}{\mathcal{\mathscr{H}}}_{1}/{\mathcal{\mathscr{H}}}_{2}} with {\mathcal{\mathscr{H}}}_{1}={n}_{T}{n}_{R}{n}_{P}(\theta {\widehat{\pi}}_{R}+(1-\theta \left){\widehat{\pi}}_{P}\right)(1-\theta {\widehat{\pi}}_{R}-(1-\theta \left){\widehat{\pi}}_{P}\right){\widehat{\pi}}_{R}(1-{\widehat{\pi}}_{R}){\widehat{\pi}}_{P}(1-{\widehat{\pi}}_{P}), and {\mathcal{\mathscr{H}}}_{2}={n}_{R}{n}_{P}{\stackrel{~}{\pi}}_{R}(1-{\stackrel{~}{\pi}}_{R}){\stackrel{~}{\pi}}_{P}(1-{\stackrel{~}{\pi}}_{P}).

#### (3) Exact unconditional method (EUM)

When sample sizes (i.e., *n*
_{
T
},*n*
_{
R
},*n*
_{
P
}) are small, asymptotic methods may yield inflated type I error rates and their exact versions may provide reliable alternative. Under *H*
_{0}:*ψ*≤0 with *π*
_{
P
}<*π*
_{
R
}, parameters *π*
_{
R
} and *π*
_{
P
} must belong to the following constrained parameter space *Ω*={(*π*
_{
P
},*π*
_{
R
}):0≤*π*
_{
P
}<*π*
_{
R
}≤1 if −*θ* *π*
_{
R
}<*ψ*<0, (−*ψ*−*θ* *π*
_{
R
})/(1−*θ*)≤*π*
_{
P
}<*π*
_{
R
}<1 if −*π*
_{
R
}<*ψ*≤−*θ* *π*
_{
R
}, and empty set otherwise }. Under the null hypothesis, the probability density function (2.1) can be reexpressed by *π*
_{
T
}=*ψ*+*θ* *π*
_{
R
}+(1−*θ*)*π*
_{
P
} with *π*
_{
R
},*π*
_{
P
} and *ψ* being nuisance parameters. These nuisance parameters can be eliminated by maximizing the null likelihood over the complete domain *Ω*. Similar to Tang and Tang [18], the exact unconditional *p*-value for testing *H*
_{0}:*ψ*≤0 via statistic *T*
_{
j
} (*j*=*W*,*R*,*L*) based on \left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}\right) is defined as

where

and I\left\{{T}_{j}({x}_{T},{x}_{R},{x}_{P})\ge {t}_{j}^{o}\right\} is 1 if {T}_{j}({x}_{T},{x}_{R},{x}_{P})\ge {t}_{j}^{o} and 0 otherwise.

#### (4) Approximate unconditional method (AUM)

According to Tang and Tang [18] and Tang, Tang and Rosner [23], the exact unconditional test is always conservative, i.e., its corresponding type I error rate is always less than or equal to the prespecified significance level. Following Tang and Tang [18], these nuisance parameters can be eliminated by evaluating their values at their corresponding RMLEs under *ψ*=0. The approximate unconditional *p*-value for testing *H*
_{0}:*ψ*≤0 via statistic *T*
_{
j
} (*j*=*W*,*R*,*L*) based on \left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}\right) can be defined as {p}_{j}^{\text{AU}}\left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}\right)=P\left({T}_{j}\ge {t}_{j}^{o}|\psi =0,{\pi}_{R}={\stackrel{~}{\pi}}_{R},{\pi}_{P}={\stackrel{~}{\pi}}_{P}\right).

#### (5) Bootstrap-resampling method (BTM)

Hypothesis testing based on the bootstrap-resampling method is usually recommended when sample sizes (i.e., *n*
_{
T
}, *n*
_{
R
} and *n*
_{
P
}) are small [24] or data structure is sparse (e.g., *x*
_{
T
} or *x*
_{
R
} or *x*
_{
P
} is close to zero or *n*
_{
T
}, *n*
_{
R
} and *n*
_{
P
}, respectively). Given the observation \left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}\right), we compute the RMLEs {\stackrel{~}{\pi}}_{T},{\stackrel{~}{\pi}}_{R} and {\stackrel{~}{\pi}}_{P} of parameters *π*
_{
T
},*π*
_{
R
} and *π*
_{
P
}, and calculate the observed value {t}_{j}^{0} of statistic *T*
_{
j
} (*j*=*W*,*R*,*L*). Based on the RMLEs {\stackrel{~}{\pi}}_{T},{\stackrel{~}{\pi}}_{R} and {\stackrel{~}{\pi}}_{P}, we generate *B* bootstrap samples \left\{\left({x}_{T}^{b},{x}_{R}^{b},{x}_{P}^{b}\right):b=1,\dots ,B\right\} from the following distribution: {x}_{k}^{b}\sim \text{Bin}({n}_{k},{\stackrel{~}{\pi}}_{k}) for *k*=*T*,*R* and *P*. For each of the *B* bootstrap samples, we compute the observed value {t}_{j}^{b} of statistic *T*
_{
j
} (*j*=*W*,*R*,*L*). Hence, an approximate *p*-value for testing *H*
_{0}:*ψ*≤0 via statistic *T*
_{
j
} based on \left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}\right) is given by {\widehat{p}}_{j}^{\mathit{\text{BT}}}\left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}\right)=\frac{1}{B}\sum _{b=1}^{B}I\left({t}_{j}^{b}\ge {t}_{j}^{0}\right).

For any given observation \left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}\right), test statistic *T*
_{
j
} (*j*=*W*,*R*,*L*) and *p*-value calculation method, we reject the null hypothesis *H*
_{0} at the significance level *α* if {p}_{j}^{k}\left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}\right)\le \alpha for *k*=AM, SA, EU, AU and BT.

### Simulation study

Simulation studies are conducted to investigate the performance of various test statistics together with the five *p*-value calculation methods in small-sample designs (e.g., *n*=30 and 60, where *n*=*n*
_{
P
}+*n*
_{
R
}+*n*
_{
T
} with the allocation ratios *λ*
_{
P
}: *λ*
_{
R
}: *λ*
_{
T
}=1: *n*
_{
R
}/*n*
_{
P
}: *n*
_{
T
}/*n*
_{
P
} taking to be 1:1:1, 1:2:2 and 1:2:3) in terms of type I error rate and power. For each (*n*
_{
P
},*n*
_{
R
},*n*
_{
T
}), we consider the following probability settings [19]: *π*
_{
P
}=0.05,0.10,0.15,…,0.50, *π*
_{
R
}=*π*
_{
P
}+0.05,*π*
_{
P
}+0.10,…,0.95, and *π*
_{
T
}=*θ* *π*
_{
R
}+(1−*θ*)*π*
_{
P
}, which corresponds to a total of 11,340 configurations of (*π*
_{
P
},*π*
_{
R
},*π*
_{
T
}), and the following two non-inferiority margins: *θ*=0.6 and 0.8. The nominal level is taken to be *α*=0.05. For the given values of *n* and allocation ratio *λ*
_{
P
}: *λ*
_{
R
}: *λ*
_{
T
}, *n*
_{
k
} is given by *n*
_{
ℓ
}=*n* *λ*
_{
k
}/(*λ*
_{
P
}+*λ*
_{
R
}+*λ*
_{
T
}) for *ℓ*=*P*,*R* and *T*. Thus, given *n*, allocation ratio and (*π*
_{
P
},*π*
_{
R
},*π*
_{
T
}), the type I error rate for testing hypothesis *H*
_{0}:*ψ*≤0 versus *H*
_{1}:*ψ*>0 via test statistic *T*
_{
j
} (*j*=*W*,*R*,*L*) at the significance level *α* is calculated by

for *k*=*A* *M*,*S* *A* *M*,*E* *U* *M*,*A* *U* *M* and *BTM*, whilst the corresponding power can be evaluated by replacing *H*
_{0} in f\left({x}_{T}^{o},{x}_{R}^{o},{x}_{P}^{o}|{\pi}_{T},{\pi}_{R},{\pi}_{P},{H}_{0}\right) by *H*
_{1}.

## Results

### Simulation study

To compare the performance of AM, SAM, EUM, AUM and BTM together with test statistics *T*
_{
W
}, *T*
_{
R
} and *T*
_{
L
} under the balanced and unbalanced designs, Figure 1 presents boxplots of their corresponding type I error rates for *n*=30 and 60, and *λ*
_{
P
}: *λ*
_{
R
}: *λ*
_{
T
}=1:1:1, 1:2:2 and 1:2:3, where AMk, SAk, EUk, AUk and BTk represent AM, SAM, EUM, AUM and BTM for test statistic *T*
_{k} with k=W, R and L, respectively. Here, each boxplot in Figure 1 contains 2 (i.e., the number of non-inferiority margins) ×11,340 (i.e., the number of configurations for (*π*
_{
P
},*π*
_{
R
},*π*
_{
T
}))=22,680 data points. From Figure 1, we have the following findings. First, the medians of the type I error rates based on AUM and BTM are closer to the prespecified nominal level *α*=0.05 than those based on the other three *p*-value calculation methods for all three test statistics under consideration. Second, for AUM and BTM, the medians of the type I error rates for test statistics *T*
_{
W
} and *T*
_{
R
}, which are 0.0495 and 0.0501 for AUM and 0.0494 and 0.0494 for BTM respectively, are closer to *α*=0.05 than those for test statistic *T*
_{
L
}, which are 0.0442 for AUM and 0.0442 for BTM. Third, for AM, SAM and EUM, their corresponding medians of type I error rates are 0.0649, 0.0455 and 0.0260 for test statistic *T*
_{
W
}, 0.0504, 0.0455 and 0.0488 for test statistic *T*
_{
R
}, and 0.0663, 0.1285 and 0.0332 for test statistic *T*
_{
L
}, respectively, which indicate that (i) the AM is liberal for test statistics *T*
_{
W
} and *T*
_{
L
}, whilst it is valid for test statistic *T*
_{
R
}; (ii) the SAM can improve the accuracy of the normal approximation for test statistics *T*
_{
W
} and *T*
_{
R
}; and (iii) the EUM is conservative for all test statistics. Fourth, the proportions of configurations whose type I error rates lie in the interval (0.045,0.055) for AM, SAM, EUM, AUM and BTM are 0.0747, 0.4691, 0.0710, 0.5154 and 0.7994 for *T*
_{
W
}, 0.5605, 0.4605, 0.4753, 0.7167 and 0.8370 for *T*
_{
R
}, and 0.0784, 0.0800, 0.0691, 0.4056 and 0.4889 for *T*
_{
L
}, respectively, which show that (i) AUM and BTM outperform the other three *p*-value calculation procedures, and (ii) *T*
_{
R
} behaves better than the other two test statistics regardless of *p*-value calculation procedures. Fifth, the median of the type I error rates becomes more close to the prespecified nominal level as the total sample size *n* increases, whilst at the same time the variability of the type I error rates decreases. Sixth, the variability of the type I error rates for unbalanced designs is not significantly different from that for the balanced designs.

To investigate the sensitivity of various *p*-value calculation procedures (i.e., AM, SAM, EUM, AUM and BTM) to different test statistics, Figure 2 presents boxplots of their corresponding type I error rates against *π*
_{
P
} for test statistics *T*
_{
W
}, *T*
_{
R
} and *T*
_{
L
}. Examination of Figure 2 shows that there is no significant effect of *π*
_{
P
} on the type I error rate.

We also calculate powers of the five *p*-value calculation procedures together with the three test statistics at the nominal level *α*=0.05 when *π*
_{
T
}=*π*
_{
R
} and *θ*=0.6 with the following settings: *n*=30 and 60, *π*
_{
P
}=0.15 and 0.3, and *π*
_{
R
}=0.5,0.8 and 0.95 for the balanced allocation 1:1:1 and unbalanced allocation 1:2:3. Results are reported in Table 1. Examination of Table 1 indicates that (i) *T*
_{
R
} is generally more powerful than *T*
_{
W
} and *T*
_{
L
} for the EUM except for *π*
_{
R
}=0.95 with the unbalanced designs, (ii) *T*
_{
W
} and *T*
_{
R
} have similar powers for AM, AUM and BTM under our considered settings, (iii) a slight power difference is observed between *T*
_{
R
} and *T*
_{
L
} for AUM and BTM, (iv) there is slight power difference between balanced and unbalanced designs, and (v) power increases as *n* increases regardless of *p*-value calculation procedures or test statistics. Hence, we would recommend both AUM and BTM with *T*
_{
R
} for hypothesis testing.

### Real data example

An example from a pharmacological study of patients with functional dyspepsia (FD) and a placebo-controlled trail of subjects with acute migraine is used to illustrate our proposed methodologies. This example has been analyzed by Holtmann et al. [25] and Tang and Tang [14]. In this example, cisapride and simethicone can be regarded as the existing reference and new experimental treatments, respectively. In that study, among *n*=178 patients of FD, *n*
_{
P
}=61, *n*
_{
R
}=59 and *n*
_{
T
}=58 were randomized and treated in a doubly dummy technique with placebo, cisapride and simethicone, respectively; adverse events (e.g., diarrhea and pain) were happened in *x*
_{
P
}=7, *x*
_{
R
}=10 and *x*
_{
T
}=12 patients treated with placebo, cisapride and simethicone, respectively. It is of interest to test if simethicone is not inferior to cisapride in terms of rate of reporting adverse event in the presence of placebo. Given *θ*=0.6 and 0.8, the corresponding *p*-values for testing *H*
_{0}:(*π*
_{
T
}−*π*
_{
P
})/(*π*
_{
R
}−*π*
_{
P
})≤*θ* versus *H*
_{1}:(*π*
_{
T
}−*π*
_{
P
})/(*π*
_{
R
}−*π*
_{
P
})>*θ* based on the five *p*-value calculation procedures and three test statistics are reported in Table 2. By Table 2, there is no evidence to show that simethicone is noninferior to cisapride in the presence of placebo at the nominal level *α*=0.05, which is consistent with that given in Tang and Tang [14].

## Discussion

Simulation results demonstrate that our proposed score test statistic outperforms other test statistics in terms of type I error rate and power under our considered settings. The approximate unconditional and bootstrap-resampling methods perform better than other *p*-value calculation procedures in the sense that their corresponding type I error rates are closer to the prespecified nominal level and their corresponding powers are larger than those of other *p*-value calculation procedures. The exact unconditional method is conservative and time-consuming when sample sizes are large (e.g., see the 6th column in Table 3). The asymptotic tests are liberal since their type I error rates are greater than the prespecified nominal level *α*=0.05 in most cases. Comparing the approximate and exact unconditional methods, the approximate unconditional method provides a good alternative to the exact unconditional method in terms of computing time (e.g., see the 6th and 7th columns in Table 3) and type I error rate when sample sizes are large. In contrast, the computing burden of the bootstrap-resampling method is heavier than that of the approximate unconditional method (e.g., see the last two columns in Table 3).

In this article, we concentrate on a three-arm non-inferiority trial with binary endpoints in which the marginal is defined as a fraction of the unknown difference in response probabilities between reference and placebo. The corresponding hypothesis (i.e., {H}_{0}:\frac{{\pi}_{T}-{\pi}_{P}}{{\pi}_{R}-{\pi}_{P}}\le \theta or *H*
_{0}:*π*
_{
T
}−*θ* *π*
_{
R
}−(1−*θ*)*π*
_{
P
}≤0) is considered since it is simple and only one single hypothesis is involved (e.g., see [6, 9, 14]). However, three-arm non-inferiority hypotheses with the marginal defined as the prespecified difference between treatments have received a considerable attention in recent years (e.g., see [5, 7]). They can be generally classified as the union type hypotheses (i.e., *H*
_{
U0}: *π*
_{
R
}≥*h*
_{
P
}(*π*
_{
P
}) or *π*
_{
R
}≥*h*
_{
T
}(*π*
_{
T
})) or the intersection type hypotheses (i.e., *H*
_{
U0}: *π*
_{
R
}≥*h*
_{
P
}(*π*
_{
P
}) and *π*
_{
R
}≥*h*
_{
T
}(*π*
_{
T
})), where *h*
_{
P
}(.) and *h*
_{
T
}(.) are any functions [15]. For specific choices of *h*
_{
P
}(.) and *h*
_{
R
}(.), this includes, for examples, hypotheses on the differences, the relative risks or the odds ratio of the proportions. While the union type hypotheses are suitable for showing both the superiority of the standard treatment as compared to placebo and the inferiority of the test treatment as compared to the standard treatment, the intersection type hypotheses are suitable for showing the test treatment is as effective as the standard or placebo treatments. We are working on statistical inference on a three-arm non-inferiority trial with the margin being a prespecifided difference between treatments when the primary endpoints are binary.

## Conclusions

According to the aforementioned observations, we can draw the following conclusions. In terms of type I error rates and powers, the approximate unconditional and bootstrap-resampling methods with score test statistic are recommended for hypothesis testing purpose when sample sizes are small in a three-arm non-inferiority trial. In terms of time-consuming and type I error rates and powers, the approximate unconditional method with score test statistic behaves the best among our considered *p*-value calculation procedures and test statistics.

## References

Dunnett CW, Gent M: Significance testing to establish equivalence between treatments with special reference to data in the form of 2 × 2 tables. Biometrics. 1977, 33: 593-602. 10.2307/2529457.

Tango T: Equivalence test and confidence interval for the difference in proportions for the paired-sample design. Stat Med. 1998, 17: 891-908. 10.1002/(SICI)1097-0258(19980430)17:8<891::AID-SIM780>3.0.CO;2-B.

Tang NS, Tang ML, Chan ISF: On tests of equivalence via non-unity relative risk for matched-pair design. Stat Med. 2003, 22: 1217-1233. 10.1002/sim.1213.

Li G, Gao S: A group sequential type design for three-arm non-inferiority trials with binary endpoints. Biom J. 2010, 52: 504-518. 10.1002/bimj.200900188.

Hida E, Tango T: Three-arm noninferiority trials with a prespecified margin for inference of the difference in the proportions of binary endpoints. J Biopharm Stat. 2013, 23: 774-789. 10.1080/10543406.2013.789893.

Koch GG, Röhmel J: Hypothesis testing in the gold standard design for proving the efficacy of an experimental treatment relative to placebo and a reference. J Biopharm Stat. 2004, 14: 315-325. 10.1081/BIP-120037182.

Hida E, Tango T: On the three-arm non-inferiority trial including a placebo with a prespecified margin. Stat Med. 2011, 30: 224-231. 10.1002/sim.4099.

Koch GG, Tangen CM: Nonparametric analysis of covariance and its role in non-inferiority clinical trials. Drug Inf J. 1999, 33: 1145-1159.

Pigeot I, Schafer J, Rohmel J, Hauschke D: Assessing non-inferiority of a new treatment in a three-arm clinical trial including a placebo. Stat Med. 2003, 22: 883-899. 10.1002/sim.1450.

Koti KM: Use of the fieller-hinkley distribution of the ratio of random variables in testing for noninferiority. J Biopharm Stat. 2007, 17: 215-228. 10.1080/10543400601177335.

Hasler M, Vonk R, Hothorn LA: Assessing non-inferiority of a new treatment in a three-arm trial in the presence of heteroscedasticity. Stat Med. 2008, 27: 490-503. 10.1002/sim.3052.

Ghosh P, Nathoo F, Gönen M, Tiwari RC: Assessing noninferiority in a three-arm trial using the bayesian approach. Stat Med. 2011, 30: 1795-1808. 10.1002/sim.4244.

Gamalo MA, Muthukumarana S, Ghosh P, Tiwari RC: A generalized p-value approach for assessing noninferiority in a three-arm trial. Stat Methods Med Res. 2013, 22: 261-277. 10.1177/0962280210395739.

Tang ML, Tang NS: Tests of non-inferiority via rate difference for three-arm clinical trials with placebo. J Biopharm Stat. 2004, 14: 337-347. 10.1081/BIP-120037184.

Munk A, Mielke M, Skipka G, Freitag G: Testing noninferiority in three-armed clinical trials based on likelihood ratio statistics. Canadiaan J Stat. 2007, 35: 413-431. 10.1002/cjs.5550350306.

Liu JT, Tzeng CS, Tsou HH: Establishing non-inferiority of a new treatment in a three-arm trial: apply a step-down hierarchical model in a papulopustular acne study and an oral prophylactic antibiotics study. Intl J Stat Med Res. 2014, 3: 11-20.

Jensen J: Saddlepoint Approximations. 1995, Oxford: Oxford Science Publications

Tang NS, Tang ML: Exact unconditional inference for risk ratio in a correlated 2 × 2 table with structural zero. Biometrics. 2002, 58: 972-980. 10.1111/j.0006-341X.2002.00972.x.

Kieser M, Friede T: Planning and analysis of three-arm non-inferiority trials with binary endpoints. Stat Med. 2007, 26: 253-273. 10.1002/sim.2543.

Blackwelder WC: Proving the null hypothesis in clinical trials. Control Clin Trials. 1982, 3: 345-353. 10.1016/0197-2456(82)90024-1.

Farrington CP, Manning G: Test statistics and sample size formulae for comparative binomial trials with null hypothesis of non-zero risk difference or non-unity relative risk. Stat Med. 1990, 9: 1447-1454. 10.1002/sim.4780091208.

Jing BY, Robinson J: Saddlepoint approximations for marginal and conditional probabilities of transformed variables. Ann Stat. 1994, 22: 1115-1132. 10.1214/aos/1176325620.

Tang ML, Tang NS, Rosner B: Statistical inference for correlated data in ophthalmologic studies. Stat Med. 2006, 25: 2271-2783.

Efron B, Tibshirani RJ: An Introduction to the Bootstrap. 1993, Boca Raton: Chapman & Hall

Holtmann G, Gschossmann J, Mayr P, Talley NJ: A randomized placebo-controlled trail of simethicone and cisapride for the treatment of patients with functional dyspepsia. Aliment Pharmacol Ther. 2002, 16: 1641-1648. 10.1046/j.1365-2036.2002.01322.x.

### Pre-publication history

The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/14/134/prepub

## Acknowledgements

This work was supported by the grants from the National Science Foundation of China (11225103), and Research Fund for the Doctoral Program of Higher Education of China (20115301110004). The work of the third author was partially supported by the General Research Fund from the Research Grants Council of the Hong Kong Special Administrative Region, China (UGC/FDS14/P01/14).

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors’ contributions

NST conceived of research questions, developed methods and revised the manuscript; BY carried out statistical analysis and drafted the manuscript; MLT interpreted results and revised the manuscript. All authors commented on successive drafts, and read and approved the final manuscript.

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## About this article

### Cite this article

Tang, NS., Yu, B. & Tang, ML. Testing non-inferiority of a new treatment in three-arm clinical trials with binary endpoints.
*BMC Med Res Methodol* **14**, 134 (2014). https://doi.org/10.1186/1471-2288-14-134

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1471-2288-14-134

### Keywords

- Approximate unconditional test
- Bootstrap-resampling test
- Non-inferiority trial
- Rate difference
- Saddlepoint approximation
- Three-arm design