This article has Open Peer Review reports available.
Testing non-inferiority of a new treatment in three-arm clinical trials with binary endpoints
© Tang et al.; licensee BioMed Central. 2014
Received: 4 August 2014
Accepted: 12 December 2014
Published: 18 December 2014
A two-arm non-inferiority trial without a placebo is usually adopted to demonstrate that an experimental treatment is not worse than a reference treatment by a small pre-specified non-inferiority margin due to ethical concerns. Selection of the non-inferiority margin and establishment of assay sensitivity are two major issues in the design, analysis and interpretation for two-arm non-inferiority trials. Alternatively, a three-arm non-inferiority clinical trial including a placebo is usually conducted to assess the assay sensitivity and internal validity of a trial. Recently, some large-sample approaches have been developed to assess the non-inferiority of a new treatment based on the three-arm trial design. However, these methods behave badly with small sample sizes in the three arms. This manuscript aims to develop some reliable small-sample methods to test three-arm non-inferiority.
Saddlepoint approximation, exact and approximate unconditional, and bootstrap-resampling methods are developed to calculate p-values of the Wald-type, score and likelihood ratio tests. Simulation studies are conducted to evaluate their performance in terms of type I error rate and power.
Our empirical results show that the saddlepoint approximation method generally behaves better than the asymptotic method based on the Wald-type test statistic. For small sample sizes, approximate unconditional and bootstrap-resampling methods based on the score test statistic perform better in the sense that their corresponding type I error rates are generally closer to the prespecified nominal level than those of other test procedures.
Both approximate unconditional and bootstrap-resampling test procedures based on the score test statistic are generally recommended for three-arm non-inferiority trials with binary outcomes.
The objective of a non-inferiority trial is to demonstrate the efficacy of an experimental treatment not being inferior to a reference treatment by some pre-specified non-inferiority margin. Many authors considered two-arm non-inferiority trials without a placebo since the comparison between the experimental and reference treatments is direct and the potential ethical problems encountered in traditional placebo-controlled trials are avoided (for example, see Dunnett and Gent , Tango , and Tang et al. ). However, there are two major concerns for two-arm non-inferiority trials . The first issue is the choice of the non-inferiority margin, which is the clinically acceptable amount or a combination of statistical reasoning and clinical judgement. The other issue is the evaluation of assay sensitivity, which refers to the ability of a trial to differentiate an effective treatment from a less effective or ineffective treatment . Without a placebo arm, the assay sensitivity of a trail is not demonstrable from the trial data and ones must rely on some external information (e.g., historical placebo trails) for the reference treatment . Without the trial assay sensitivity, any non-inferiority testing results from the comparison of the experimental and reference treatments will become unconvincing. There are some indications where it is considered ethically acceptable to continue to randomize patients to placebo despite the fact that an effective treatment exists and there is interest in seeing not only whether the new treatment works at all but also how it measures up to accepted therapy. In this case, a three-arm non-inferiority clinical trail including the experimental treatment, an active reference treatment and a placebo is usually conducted to assess assay sensitivity and internal validation of a trail . Indeed, three-arm trials are recommended in the guidelines of the ICH (The International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use) and EMEA/CPMP (European Medicines Agency/Committee for Proprietary Medical Products) as a useful approach to the assessment of assay sensitivity and internal validation (e.g., see ).
Statistical inference based on three-arm non-inferiority clinical trials with normally distributed outcomes has received considerable attention in recent years. For example, Koch and Tangen  and Pigeot et al.  considered the problem of three-arm non-inferiority testing for normally distributed endpoints with a common but unknown variance. Koti  presented a new approach for normally distributed endpoints based on the Fieller-Hinkley distribution. Hasler, Vonk and Hothorn  proposed the usage of the t-distribution in the presence of heteroscedasticity. Hida and Tango  proposed a test procedure for assessing the assay sensitivity with a pre-specified margin defined as a difference between treatments in the presence of homoscedasticity. Ghosh, Nathoo, Gönen and Tiwari  developed a Bayesian approach in the presence of heteroscedasticity by incorporating both parametric and semi-parametric models. Gamalo, Muthukumarana, Ghosh and Tiwari  extended the existing generalized p-value approach for assessing the non-inferiority of a new treatment in a three-arm trial.
Recently, some statistical methods have also been developed for three-arm non-inferiority testing with binary endpoints. For example, Tang and Tang  proposed two asymptotic approaches for testing three-arm non-inferiority via rate difference based on Wald-type and score test statistics. Kieser and Friede (2007) revisited the performance of Tang and Tang’s  asymptotic test statistics via simulation studies and derived approximate sample size formulae for achieving the desired power. Munk, Mielke, Skipka and Freitag  developed likelihood ratio tests. Li and Gao  used the closed testing principle to establish the hierarchical testing procedure and proposed a group sequential type design. Liu, Tzeng and Tsou  presented a three-step testing procedure and derived an optimal sample size allocation rule in an ethical and reliable manner that minimizes the total sample size.
All aforementioned approaches for testing non-inferiority of a new treatment in a three-arm clinical trial with binary endpoints are based on large sample theory, and their accuracy has long been suspected and criticized when sample sizes are small or the data structure is sparse. To the best of our knowledge, limited work have been done to address these issues. Motivated by Jensen , we derive saddlepoint approximations to the cumulative distribution functions of Wald-type, score and likelihood ratio test statistics. Inspired by Tang and Tang , we also propose the exact unconditional, approximate unconditional and Bootstrap-resampling p-value calculation procedures for testing three-arm non-inferiority with small sample sizes.
The rest of this article is organized as follows. We first review three test statistics for assessing non-inferiority of a new treatment in three-arm clinical trials with binary endpoints. We also propose saddlepoint approximation, exact and approximate unconditional, and bootstrap-resampling approaches for calculating p-values. Simulation studies are conducted to investigate the performance of all test statistics based on different p-value calculation approaches in terms of type I error rate and power. An example is analyzed to demonstrate our methodologies. Finally, we discuss the performance of our proposed methodologies and present some conclusions.
It can be easily shown from Equation (2.1) that the maximum likelihood estimates (MLEs) of π T , π R and π P are given by , and , respectively.
It is possible that there is no point (π P ,π R ) ∈Θ such that it satisfies the above equations, which implies that the likelihood function given in Equation (2.1) attains its maximum on the boundary of the parameter space Θ.
which are asymptotically distributed as the standard normal distribution under H 0 as n T , n R and n P are sufficiently large. Hence, non-inferiority can be claimed if T W >z 1−α (or T R >z 1−α ), where z 1−α is the (1−α)-quantile of the standard normal distribution. When π P =0, T W is the Wald-type statistic proposed in Blackwelder  and T R is the test statistic given by Farrington and Manning  for two-arm noninferiority trials.
which is asymptotically distributed as the standard normal distribution under H 0 as n T , n R and n P are sufficiently large, where with . Thus, non-inferiority can be claimed if T L >z 1−α .
p-value calculation methods
The non-inferiority hypothesis (2.2) can be claimed via the p-value method with the rule: H 0 is rejected if the p-value is less than or equal to the prespecified significance level α. In what follows, we introduce five approaches for calculating p-values based on , which is the observed value of test statistic T j (j=W,R,L) for the observed value of (X T ,X R ,X P ).
(1) Asymptotic method (AM)
It follows from the above arguments that all statistics T j ’s (j=W,R,L) asymptotically follow the standard normal distribution under the null hypothesis H 0:ψ≤0. Thus, the asymptotic p-value for testing hypothesis (2.2) via statistic T j (j=W,R,L) based on can be calculated by , where Φ(·) is the standard normal distribution function.
The above asymptotic approach for calculating p-value of testing hypothesis (2.2) via statistic T j (j=W,R W,L) is established under the large sample theory. Its accuracy has long been suspected and criticized, especially when n T , n R and/or n P are small since the skewness of the underlying binomial distributions is not taken into consideration. Some higher order corrections such as the saddlepoint approximation  have been proposed to improve the accuracy of the normal approximation. In what follows, we will derive saddlepoint approximations to distributions of the three test statistics.
(2) Saddlepoint approximation method (SAM)
where and , is the unique solution to equation: for j=W,R with and , and with , and .
(3) Exact unconditional method (EUM)
and is 1 if and 0 otherwise.
(4) Approximate unconditional method (AUM)
According to Tang and Tang  and Tang, Tang and Rosner , the exact unconditional test is always conservative, i.e., its corresponding type I error rate is always less than or equal to the prespecified significance level. Following Tang and Tang , these nuisance parameters can be eliminated by evaluating their values at their corresponding RMLEs under ψ=0. The approximate unconditional p-value for testing H 0:ψ≤0 via statistic T j (j=W,R,L) based on can be defined as .
(5) Bootstrap-resampling method (BTM)
Hypothesis testing based on the bootstrap-resampling method is usually recommended when sample sizes (i.e., n T , n R and n P ) are small  or data structure is sparse (e.g., x T or x R or x P is close to zero or n T , n R and n P , respectively). Given the observation , we compute the RMLEs and of parameters π T ,π R and π P , and calculate the observed value of statistic T j (j=W,R,L). Based on the RMLEs and , we generate B bootstrap samples from the following distribution: for k=T,R and P. For each of the B bootstrap samples, we compute the observed value of statistic T j (j=W,R,L). Hence, an approximate p-value for testing H 0:ψ≤0 via statistic T j based on is given by .
For any given observation , test statistic T j (j=W,R,L) and p-value calculation method, we reject the null hypothesis H 0 at the significance level α if for k=AM, SA, EU, AU and BT.
for k=A M,S A M,E U M,A U M and BTM, whilst the corresponding power can be evaluated by replacing H 0 in by H 1.
Exact powers ( % ) of various test procedures together with three statistics when π T = π R with n =30 and 60, θ =0 . 6and α =0 . 05
λ P : λ R : λ T
Real data example
Various p -values for the pharmacological data set at the nominal level α =5 %
Computing time (minutes) of the Type I error rates for 11340 configurations of ( π P , π R , π T ) together with three test statistics under five test methods
λ P :λ R :λ T
In this article, we concentrate on a three-arm non-inferiority trial with binary endpoints in which the marginal is defined as a fraction of the unknown difference in response probabilities between reference and placebo. The corresponding hypothesis (i.e., or H 0:π T −θ π R −(1−θ)π P ≤0) is considered since it is simple and only one single hypothesis is involved (e.g., see [6, 9, 14]). However, three-arm non-inferiority hypotheses with the marginal defined as the prespecified difference between treatments have received a considerable attention in recent years (e.g., see [5, 7]). They can be generally classified as the union type hypotheses (i.e., H U0: π R ≥h P (π P ) or π R ≥h T (π T )) or the intersection type hypotheses (i.e., H U0: π R ≥h P (π P ) and π R ≥h T (π T )), where h P (.) and h T (.) are any functions . For specific choices of h P (.) and h R (.), this includes, for examples, hypotheses on the differences, the relative risks or the odds ratio of the proportions. While the union type hypotheses are suitable for showing both the superiority of the standard treatment as compared to placebo and the inferiority of the test treatment as compared to the standard treatment, the intersection type hypotheses are suitable for showing the test treatment is as effective as the standard or placebo treatments. We are working on statistical inference on a three-arm non-inferiority trial with the margin being a prespecifided difference between treatments when the primary endpoints are binary.
According to the aforementioned observations, we can draw the following conclusions. In terms of type I error rates and powers, the approximate unconditional and bootstrap-resampling methods with score test statistic are recommended for hypothesis testing purpose when sample sizes are small in a three-arm non-inferiority trial. In terms of time-consuming and type I error rates and powers, the approximate unconditional method with score test statistic behaves the best among our considered p-value calculation procedures and test statistics.
This work was supported by the grants from the National Science Foundation of China (11225103), and Research Fund for the Doctoral Program of Higher Education of China (20115301110004). The work of the third author was partially supported by the General Research Fund from the Research Grants Council of the Hong Kong Special Administrative Region, China (UGC/FDS14/P01/14).
- Dunnett CW, Gent M: Significance testing to establish equivalence between treatments with special reference to data in the form of 2 × 2 tables. Biometrics. 1977, 33: 593-602. 10.2307/2529457.View ArticlePubMedGoogle Scholar
- Tango T: Equivalence test and confidence interval for the difference in proportions for the paired-sample design. Stat Med. 1998, 17: 891-908. 10.1002/(SICI)1097-0258(19980430)17:8<891::AID-SIM780>3.0.CO;2-B.View ArticlePubMedGoogle Scholar
- Tang NS, Tang ML, Chan ISF: On tests of equivalence via non-unity relative risk for matched-pair design. Stat Med. 2003, 22: 1217-1233. 10.1002/sim.1213.View ArticlePubMedGoogle Scholar
- Li G, Gao S: A group sequential type design for three-arm non-inferiority trials with binary endpoints. Biom J. 2010, 52: 504-518. 10.1002/bimj.200900188.View ArticlePubMedGoogle Scholar
- Hida E, Tango T: Three-arm noninferiority trials with a prespecified margin for inference of the difference in the proportions of binary endpoints. J Biopharm Stat. 2013, 23: 774-789. 10.1080/10543406.2013.789893.View ArticlePubMedGoogle Scholar
- Koch GG, Röhmel J: Hypothesis testing in the gold standard design for proving the efficacy of an experimental treatment relative to placebo and a reference. J Biopharm Stat. 2004, 14: 315-325. 10.1081/BIP-120037182.View ArticlePubMedGoogle Scholar
- Hida E, Tango T: On the three-arm non-inferiority trial including a placebo with a prespecified margin. Stat Med. 2011, 30: 224-231. 10.1002/sim.4099.View ArticlePubMedGoogle Scholar
- Koch GG, Tangen CM: Nonparametric analysis of covariance and its role in non-inferiority clinical trials. Drug Inf J. 1999, 33: 1145-1159.Google Scholar
- Pigeot I, Schafer J, Rohmel J, Hauschke D: Assessing non-inferiority of a new treatment in a three-arm clinical trial including a placebo. Stat Med. 2003, 22: 883-899. 10.1002/sim.1450.View ArticlePubMedGoogle Scholar
- Koti KM: Use of the fieller-hinkley distribution of the ratio of random variables in testing for noninferiority. J Biopharm Stat. 2007, 17: 215-228. 10.1080/10543400601177335.View ArticlePubMedGoogle Scholar
- Hasler M, Vonk R, Hothorn LA: Assessing non-inferiority of a new treatment in a three-arm trial in the presence of heteroscedasticity. Stat Med. 2008, 27: 490-503. 10.1002/sim.3052.View ArticlePubMedGoogle Scholar
- Ghosh P, Nathoo F, Gönen M, Tiwari RC: Assessing noninferiority in a three-arm trial using the bayesian approach. Stat Med. 2011, 30: 1795-1808. 10.1002/sim.4244.View ArticlePubMedGoogle Scholar
- Gamalo MA, Muthukumarana S, Ghosh P, Tiwari RC: A generalized p-value approach for assessing noninferiority in a three-arm trial. Stat Methods Med Res. 2013, 22: 261-277. 10.1177/0962280210395739.View ArticlePubMedGoogle Scholar
- Tang ML, Tang NS: Tests of non-inferiority via rate difference for three-arm clinical trials with placebo. J Biopharm Stat. 2004, 14: 337-347. 10.1081/BIP-120037184.View ArticlePubMedGoogle Scholar
- Munk A, Mielke M, Skipka G, Freitag G: Testing noninferiority in three-armed clinical trials based on likelihood ratio statistics. Canadiaan J Stat. 2007, 35: 413-431. 10.1002/cjs.5550350306.View ArticleGoogle Scholar
- Liu JT, Tzeng CS, Tsou HH: Establishing non-inferiority of a new treatment in a three-arm trial: apply a step-down hierarchical model in a papulopustular acne study and an oral prophylactic antibiotics study. Intl J Stat Med Res. 2014, 3: 11-20.Google Scholar
- Jensen J: Saddlepoint Approximations. 1995, Oxford: Oxford Science PublicationsGoogle Scholar
- Tang NS, Tang ML: Exact unconditional inference for risk ratio in a correlated 2 × 2 table with structural zero. Biometrics. 2002, 58: 972-980. 10.1111/j.0006-341X.2002.00972.x.View ArticlePubMedGoogle Scholar
- Kieser M, Friede T: Planning and analysis of three-arm non-inferiority trials with binary endpoints. Stat Med. 2007, 26: 253-273. 10.1002/sim.2543.View ArticlePubMedGoogle Scholar
- Blackwelder WC: Proving the null hypothesis in clinical trials. Control Clin Trials. 1982, 3: 345-353. 10.1016/0197-2456(82)90024-1.View ArticlePubMedGoogle Scholar
- Farrington CP, Manning G: Test statistics and sample size formulae for comparative binomial trials with null hypothesis of non-zero risk difference or non-unity relative risk. Stat Med. 1990, 9: 1447-1454. 10.1002/sim.4780091208.View ArticlePubMedGoogle Scholar
- Jing BY, Robinson J: Saddlepoint approximations for marginal and conditional probabilities of transformed variables. Ann Stat. 1994, 22: 1115-1132. 10.1214/aos/1176325620.View ArticleGoogle Scholar
- Tang ML, Tang NS, Rosner B: Statistical inference for correlated data in ophthalmologic studies. Stat Med. 2006, 25: 2271-2783.Google Scholar
- Efron B, Tibshirani RJ: An Introduction to the Bootstrap. 1993, Boca Raton: Chapman & HallView ArticleGoogle Scholar
- Holtmann G, Gschossmann J, Mayr P, Talley NJ: A randomized placebo-controlled trail of simethicone and cisapride for the treatment of patients with functional dyspepsia. Aliment Pharmacol Ther. 2002, 16: 1641-1648. 10.1046/j.1365-2036.2002.01322.x.View ArticlePubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/14/134/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.