Skip to content

Advertisement

  • Research article
  • Open Access
  • Open Peer Review

Testing non-inferiority of a new treatment in three-arm clinical trials with binary endpoints

BMC Medical Research Methodology201414:134

https://doi.org/10.1186/1471-2288-14-134

  • Received: 4 August 2014
  • Accepted: 12 December 2014
  • Published:
Open Peer Review reports

Abstract

Background

A two-arm non-inferiority trial without a placebo is usually adopted to demonstrate that an experimental treatment is not worse than a reference treatment by a small pre-specified non-inferiority margin due to ethical concerns. Selection of the non-inferiority margin and establishment of assay sensitivity are two major issues in the design, analysis and interpretation for two-arm non-inferiority trials. Alternatively, a three-arm non-inferiority clinical trial including a placebo is usually conducted to assess the assay sensitivity and internal validity of a trial. Recently, some large-sample approaches have been developed to assess the non-inferiority of a new treatment based on the three-arm trial design. However, these methods behave badly with small sample sizes in the three arms. This manuscript aims to develop some reliable small-sample methods to test three-arm non-inferiority.

Methods

Saddlepoint approximation, exact and approximate unconditional, and bootstrap-resampling methods are developed to calculate p-values of the Wald-type, score and likelihood ratio tests. Simulation studies are conducted to evaluate their performance in terms of type I error rate and power.

Results

Our empirical results show that the saddlepoint approximation method generally behaves better than the asymptotic method based on the Wald-type test statistic. For small sample sizes, approximate unconditional and bootstrap-resampling methods based on the score test statistic perform better in the sense that their corresponding type I error rates are generally closer to the prespecified nominal level than those of other test procedures.

Conclusions

Both approximate unconditional and bootstrap-resampling test procedures based on the score test statistic are generally recommended for three-arm non-inferiority trials with binary outcomes.

Keywords

  • Approximate unconditional test
  • Bootstrap-resampling test
  • Non-inferiority trial
  • Rate difference
  • Saddlepoint approximation
  • Three-arm design

Background

The objective of a non-inferiority trial is to demonstrate the efficacy of an experimental treatment not being inferior to a reference treatment by some pre-specified non-inferiority margin. Many authors considered two-arm non-inferiority trials without a placebo since the comparison between the experimental and reference treatments is direct and the potential ethical problems encountered in traditional placebo-controlled trials are avoided (for example, see Dunnett and Gent [1], Tango [2], and Tang et al. [3]). However, there are two major concerns for two-arm non-inferiority trials [4]. The first issue is the choice of the non-inferiority margin, which is the clinically acceptable amount or a combination of statistical reasoning and clinical judgement. The other issue is the evaluation of assay sensitivity, which refers to the ability of a trial to differentiate an effective treatment from a less effective or ineffective treatment [5]. Without a placebo arm, the assay sensitivity of a trail is not demonstrable from the trial data and ones must rely on some external information (e.g., historical placebo trails) for the reference treatment [4]. Without the trial assay sensitivity, any non-inferiority testing results from the comparison of the experimental and reference treatments will become unconvincing. There are some indications where it is considered ethically acceptable to continue to randomize patients to placebo despite the fact that an effective treatment exists and there is interest in seeing not only whether the new treatment works at all but also how it measures up to accepted therapy. In this case, a three-arm non-inferiority clinical trail including the experimental treatment, an active reference treatment and a placebo is usually conducted to assess assay sensitivity and internal validation of a trail [6]. Indeed, three-arm trials are recommended in the guidelines of the ICH (The International Conference on Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use) and EMEA/CPMP (European Medicines Agency/Committee for Proprietary Medical Products) as a useful approach to the assessment of assay sensitivity and internal validation (e.g., see [7]).

Statistical inference based on three-arm non-inferiority clinical trials with normally distributed outcomes has received considerable attention in recent years. For example, Koch and Tangen [8] and Pigeot et al. [9] considered the problem of three-arm non-inferiority testing for normally distributed endpoints with a common but unknown variance. Koti [10] presented a new approach for normally distributed endpoints based on the Fieller-Hinkley distribution. Hasler, Vonk and Hothorn [11] proposed the usage of the t-distribution in the presence of heteroscedasticity. Hida and Tango [7] proposed a test procedure for assessing the assay sensitivity with a pre-specified margin defined as a difference between treatments in the presence of homoscedasticity. Ghosh, Nathoo, Gönen and Tiwari [12] developed a Bayesian approach in the presence of heteroscedasticity by incorporating both parametric and semi-parametric models. Gamalo, Muthukumarana, Ghosh and Tiwari [13] extended the existing generalized p-value approach for assessing the non-inferiority of a new treatment in a three-arm trial.

Recently, some statistical methods have also been developed for three-arm non-inferiority testing with binary endpoints. For example, Tang and Tang [14] proposed two asymptotic approaches for testing three-arm non-inferiority via rate difference based on Wald-type and score test statistics. Kieser and Friede (2007) revisited the performance of Tang and Tang’s [14] asymptotic test statistics via simulation studies and derived approximate sample size formulae for achieving the desired power. Munk, Mielke, Skipka and Freitag [15] developed likelihood ratio tests. Li and Gao [4] used the closed testing principle to establish the hierarchical testing procedure and proposed a group sequential type design. Liu, Tzeng and Tsou [16] presented a three-step testing procedure and derived an optimal sample size allocation rule in an ethical and reliable manner that minimizes the total sample size.

All aforementioned approaches for testing non-inferiority of a new treatment in a three-arm clinical trial with binary endpoints are based on large sample theory, and their accuracy has long been suspected and criticized when sample sizes are small or the data structure is sparse. To the best of our knowledge, limited work have been done to address these issues. Motivated by Jensen [17], we derive saddlepoint approximations to the cumulative distribution functions of Wald-type, score and likelihood ratio test statistics. Inspired by Tang and Tang [18], we also propose the exact unconditional, approximate unconditional and Bootstrap-resampling p-value calculation procedures for testing three-arm non-inferiority with small sample sizes.

The rest of this article is organized as follows. We first review three test statistics for assessing non-inferiority of a new treatment in three-arm clinical trials with binary endpoints. We also propose saddlepoint approximation, exact and approximate unconditional, and bootstrap-resampling approaches for calculating p-values. Simulation studies are conducted to investigate the performance of all test statistics based on different p-value calculation approaches in terms of type I error rate and power. An example is analyzed to demonstrate our methodologies. Finally, we discuss the performance of our proposed methodologies and present some conclusions.

Methods

Model

Let consider a clinical trial with the test (T), reference (R) and placebo (P) treatments, and assume their primary clinical outcomes X T , X R and X P be independent and binomially distributed as X T Bin(n T ,π T ), X R Bin(n R ,π R ) and X P Bin(n P ,π P ), respectively. Here, X T ,X R and X P are the numbers of responses in groups T, R and P, respectively, π T ,π R and π P represent their corresponding response probabilities with higher probability indicating a more favorable outcome, and n T ,n R and n P denote their corresponding sample sizes. Thus, the joint probability density function of (x T ,x R ,x P ) is given by
f ( x T , x R , x P | π T , π R , π P ) = n T x T n R x R n P x P π T x T ( 1 π T ) n T x T π R x R × ( 1 π R ) n R x R π P x P ( 1 π P ) n P x P .
(2.1)

It can be easily shown from Equation (2.1) that the maximum likelihood estimates (MLEs) of π T , π R and π P are given by π ̂ T = x T / n T , π ̂ R = x R / n R and π ̂ P = x P / n P , respectively.

Test statistics

Following Hida and Tango [7], to test the non-inferiority of the experimental treatment to the reference with the assay sensitivity in a three-arm trial, we have to simultaneously demonstrate (i) the superiority of the experimental treatment to the placebo, (ii) the non-inferiority of the experimental treatment to the reference with a non-inferiority margin Δ>0, and (iii) the superiority of the reference treatment to the placebo by more than Δ. That is, π T , π R and π P must satisfy the following inequalities: π P <π R Δ<π T , which can be written as the following two hypotheses:
H 0 : π T π R Δ versus H 1 : π T > π R Δ ,
K 0 : π R π P + Δ versus K 1 : π R > π P + Δ.
Similar to Pigeot et al. [9], we take the margin Δ as a fraction f of the effect size of the reference treatment, i.e., Δ=f(π R π P ). Generally, one can select f=1/2 and 1/3 [14]. Thus, the second hypothesis can be expressed as K 0:π R π P versus K 1:π R >π P . If K 0 is rejected, letting f=1−θ yields the following non-inferiority hypothesis:
H 0 : π T π P π R π P θ versus H 1 : π T π P π R π P > θ ,
(2.2)
where θ(0,1) is a fixed retention fraction [8]. Rejecting H 0 implies that the test treatment preserves at least 100θ % of the efficacy of the reference treatment compared to placebo [19]. Similar to Tang and Tang [14], we only consider hypothesis H 0 and assume that K 0 is rejected at some pre-given significant level. Thus, the non-inferiority hypothesis (2.2) can be rewritten as
H 0 : π T θ π R ( 1 θ ) π P 0 versus H 1 : π T θ π R ( 1 θ ) π P > 0 .
(2.3)
Let ψ=π T θ π R −(1−θ)π P . The non-inferiority hypothesis (2.3) can be expressed as
H 0 : ψ 0 versus H 1 : ψ > 0 .
(2.4)
The restricted maximum likelihood estimates (RMLEs) (denoted by π ~ T , π ~ R , π ~ P ) of π T , π R and π P can be computed as follows. If the MLEs π ̂ T , π ̂ R , π ̂ P of π T ,π R ,π P satisfy the conditions: π ̂ T θ π ̂ R ( 1 θ ) π ̂ P 0 and π ̂ R π ̂ P > 0 , we take π ~ T = π ̂ T , π ~ R = π ̂ R and π ~ P = π ̂ P ; otherwise, the RMLEs can be calculated by setting π T =θ π R +(1−θ)π P in the likelihood function (2.1) and maximizing it with respect to π R and π P . For the latter, it follows from Equation (2.1) that the RMLEs of π R and π P can be obtained by simultaneously solving the following equations in the parameter space Θ={(π P ,π R ):0≤π P <π R ≤1}:
x T n T ( θ π R + ( 1 θ ) π P ) ( θ π R + ( 1 θ ) π P ) ( 1 θ π R ( 1 θ ) π P ) = n R π R x R θ π R ( 1 π R ) = n P π P x P ( 1 θ ) π P ( 1 π P ) .

It is possible that there is no point (π P ,π R ) Θ such that it satisfies the above equations, which implies that the likelihood function given in Equation (2.1) attains its maximum on the boundary of the parameter space Θ.

Following Tang and Tang [14], ψ can be estimated by ψ ̂ = π ̂ T θ π ̂ R ( 1 θ ) π ̂ P , and its variance is given by var ( ψ ̂ ) = π T ( 1 π T ) / n T + θ 2 π R ( 1 π R ) / n R + ( 1 θ ) 2 π P ( 1 π P ) / n P , which can be estimated by σ 2 ( π ̆ ) = Δ var ̂ ( ψ ̂ ) = π ̆ T ( 1 π ̆ T ) / n T + θ 2 π ̆ R ( 1 π ̆ R ) / n R + ( 1 θ ) 2 π ̆ P ( 1 π ̆ P ) / n P , where π ̆ = ( π ̆ T , π ̆ R , π ̆ P ) is some appropriate estimate of π=(π T ,π R ,π P ), for example, taking π ̆ to be π ̂ = ( π ̂ T , π ̂ R , π ̂ P ) or π ~ = ( π ~ T , π ~ R , π ~ P ) which is the RMLE of π. Thus, the statistics for testing hypothesis (2.4) are given by
T W = ψ ̂ / σ ( π ̂ ) and T R = ψ ̂ / σ ( π ~ ) ,

which are asymptotically distributed as the standard normal distribution under H 0 as n T , n R and n P are sufficiently large. Hence, non-inferiority can be claimed if T W >z 1−α (or T R >z 1−α ), where z 1−α is the (1−α)-quantile of the standard normal distribution. When π P =0, T W is the Wald-type statistic proposed in Blackwelder [20] and T R is the test statistic given by Farrington and Manning [21] for two-arm noninferiority trials.

The signed root of the likelihood ratio statistic for testing hypothesis (2.4) is given by
T L = sgn ( ψ ̂ ) 2 { ( π ̂ ) ( π ~ ) } ,

which is asymptotically distributed as the standard normal distribution under H 0 as n T , n R and n P are sufficiently large, where ( π ) = x T log ( π T ) + ( n T x T ) log ( 1 π T ) + x R log ( π R ) + ( n R x R ) log ( 1 π R ) + x P log ( π P ) + ( n P x P ) log ( 1 π P ) + C with C = log { n T ! n R ! n P ! } log { x T ! x R ! x P ! ( n T x T ) ! ( n R x R ) ! ( n P x P ) ! } . Thus, non-inferiority can be claimed if T L >z 1−α .

p-value calculation methods

The non-inferiority hypothesis (2.2) can be claimed via the p-value method with the rule: H 0 is rejected if the p-value is less than or equal to the prespecified significance level α. In what follows, we introduce five approaches for calculating p-values based on t j 0 , which is the observed value of test statistic T j (j=W,R,L) for the observed value x T o , x R o , x P o of (X T ,X R ,X P ).

(1) Asymptotic method (AM)

It follows from the above arguments that all statistics T j ’s (j=W,R,L) asymptotically follow the standard normal distribution under the null hypothesis H 0:ψ≤0. Thus, the asymptotic p-value for testing hypothesis (2.2) via statistic T j (j=W,R,L) based on x T o , x R o , x P o can be calculated by p j AM x T o , x R o , x P o = P T j t j o | H 0 = 1 Φ ( t j o ) , where Φ(·) is the standard normal distribution function.

The above asymptotic approach for calculating p-value of testing hypothesis (2.2) via statistic T j (j=W,R W,L) is established under the large sample theory. Its accuracy has long been suspected and criticized, especially when n T , n R and/or n P are small since the skewness of the underlying binomial distributions is not taken into consideration. Some higher order corrections such as the saddlepoint approximation [17] have been proposed to improve the accuracy of the normal approximation. In what follows, we will derive saddlepoint approximations to distributions of the three test statistics.

(2) Saddlepoint approximation method (SAM)

Since X T , X R and X P are independent and X i Bin(n i ,π i ) (i=T,R,P), the moment generating function of ψ ̂ is given by
φ ( t ) = 1 π T + π T e t / n T n T 1 π R + π R e θt / n R n R × 1 π P + π P e ( θ 1 ) t / n P n P ,
with the cumulant generating function being
K ( t ) = n T log 1 π T + π T e t / n T + n R log 1 π R + π R e θt / n R + n P log 1 π P + π P e ( θ 1 ) t / n P ,
where −1≤t≤1. Thus, the first two derivatives of the cumulant generating function K(t) are given by
K ̇ ( t ) = π T e t / n T 1 π T + π T e t / n T + θ π R e θt / n R 1 π R + π R e θt / n R + ( θ 1 ) π P e ( θ 1 ) t / n P 1 π P + π P e ( θ 1 ) t / n P , and K ̈ ( t ) = ( 1 π T ) π T e t / n T n T ( 1 π T + π T e t / n T ) 2 + θ 2 ( 1 π R ) π R e / n R n R ( 1 π R + π R e / n R ) 2 + ( θ 1 ) 2 ( 1 π P ) π P e ( θ 1 ) t / n P n P ( 1 π P + π P e ( θ 1 ) t / n P ) 2 ,
respectively. To obtain the saddlepoint approximation to P ( ψ ̂ b ) , we need to solve the following saddlepoint equation: K ̇ ( t ) = b whose unique solution is denoted as t ̂ . Following Jing and Robinson [22], the saddlepoint approximation to the cumulative distribution function of statistic ψ ̂ is given by
P ( ψ ̂ b ) 1 Φ ( ω ) + ϕ ( ω ) ( 1 / υ 1 / ω ) ,
where ω = sgn ( t ̂ ) 2 { t ̂ b K ( t ̂ ) } and υ = t ̂ K ̈ ( t ̂ ) . Thus, the saddlepoint approximation to P T j t j o | H 0 (j=W,R,L) is given by
p j SA x T o , x R o , x P o = P T j t j o | H 0 1 Φ ω j o + ϕ ω j o 1 / v j o 1 / ω j o ,

where ω j o = sgn ( Â j ) 2 Â j t j o K ( Â j / B j ) and υ j o = Â j B j 1 K ̈ ( Â j / B j ) , Â j is the unique solution to equation: K ̇ ( Â j / B j ) = t j o B j for j=W,R with B W = σ ( π ̂ ) and B R = σ ( π ~ ) , ω L o = sgn ( ψ ̂ ) 2 { ( π ̂ ) ( π ~ ) } and υ L o = ψ ̂ n T 1 / 2 with 1 = n T n R n P ( θ π ̂ R + ( 1 θ ) π ̂ P ) ( 1 θ π ̂ R ( 1 θ ) π ̂ P ) π ̂ R ( 1 π ̂ R ) π ̂ P ( 1 π ̂ P ) , and 2 = n R n P π ~ R ( 1 π ~ R ) π ~ P ( 1 π ~ P ) .

(3) Exact unconditional method (EUM)

When sample sizes (i.e., n T ,n R ,n P ) are small, asymptotic methods may yield inflated type I error rates and their exact versions may provide reliable alternative. Under H 0:ψ≤0 with π P <π R , parameters π R and π P must belong to the following constrained parameter space Ω={(π P ,π R ):0≤π P <π R ≤1 if −θ π R <ψ<0, (−ψθ π R )/(1−θ)≤π P <π R <1 if −π R <ψ≤−θ π R , and empty set otherwise }. Under the null hypothesis, the probability density function (2.1) can be reexpressed by π T =ψ+θ π R +(1−θ)π P with π R ,π P and ψ being nuisance parameters. These nuisance parameters can be eliminated by maximizing the null likelihood over the complete domain Ω. Similar to Tang and Tang [18], the exact unconditional p-value for testing H 0:ψ≤0 via statistic T j (j=W,R,L) based on x T o , x R o , x P o is defined as
p j EU x T o , x R o , x P o = sup ψ 0 sup ( π R , π P ) Ω P T j t j o | ψ , π R , π P ,
where
P T j t j o | ψ , π R , π P = x T = 0 n T x R = 0 n R x P = 0 n P × n T ! x T ! ( n T x T ) ! n R ! x R ! ( n R x R ) ! n P ! x P ! ( n P x P ) ! × ( ψ + θ π R + ( 1 θ ) π P ) x T × ( 1 ψ θ π R ( 1 θ ) π P ) n T x T × π R x R ( 1 π R ) n R x R π P x P ( 1 π P ) n P x P I × T j ( x T , x R , x P ) t j o ,

and I T j ( x T , x R , x P ) t j o is 1 if T j ( x T , x R , x P ) t j o and 0 otherwise.

(4) Approximate unconditional method (AUM)

According to Tang and Tang [18] and Tang, Tang and Rosner [23], the exact unconditional test is always conservative, i.e., its corresponding type I error rate is always less than or equal to the prespecified significance level. Following Tang and Tang [18], these nuisance parameters can be eliminated by evaluating their values at their corresponding RMLEs under ψ=0. The approximate unconditional p-value for testing H 0:ψ≤0 via statistic T j (j=W,R,L) based on x T o , x R o , x P o can be defined as p j AU x T o , x R o , x P o = P T j t j o | ψ = 0 , π R = π ~ R , π P = π ~ P .

(5) Bootstrap-resampling method (BTM)

Hypothesis testing based on the bootstrap-resampling method is usually recommended when sample sizes (i.e., n T , n R and n P ) are small [24] or data structure is sparse (e.g., x T or x R or x P is close to zero or n T , n R and n P , respectively). Given the observation x T o , x R o , x P o , we compute the RMLEs π ~ T , π ~ R and π ~ P of parameters π T ,π R and π P , and calculate the observed value t j 0 of statistic T j (j=W,R,L). Based on the RMLEs π ~ T , π ~ R and π ~ P , we generate B bootstrap samples x T b , x R b , x P b : b = 1 , , B from the following distribution: x k b Bin ( n k , π ~ k ) for k=T,R and P. For each of the B bootstrap samples, we compute the observed value t j b of statistic T j (j=W,R,L). Hence, an approximate p-value for testing H 0:ψ≤0 via statistic T j based on x T o , x R o , x P o is given by p ̂ j BT x T o , x R o , x P o = 1 B b = 1 B I t j b t j 0 .

For any given observation x T o , x R o , x P o , test statistic T j (j=W,R,L) and p-value calculation method, we reject the null hypothesis H 0 at the significance level α if p j k x T o , x R o , x P o α for k=AM, SA, EU, AU and BT.

Simulation study

Simulation studies are conducted to investigate the performance of various test statistics together with the five p-value calculation methods in small-sample designs (e.g., n=30 and 60, where n=n P +n R +n T with the allocation ratios λ P : λ R : λ T =1: n R /n P : n T /n P taking to be 1:1:1, 1:2:2 and 1:2:3) in terms of type I error rate and power. For each (n P ,n R ,n T ), we consider the following probability settings [19]: π P =0.05,0.10,0.15,…,0.50, π R =π P +0.05,π P +0.10,…,0.95, and π T =θ π R +(1−θ)π P , which corresponds to a total of 11,340 configurations of (π P ,π R ,π T ), and the following two non-inferiority margins: θ=0.6 and 0.8. The nominal level is taken to be α=0.05. For the given values of n and allocation ratio λ P : λ R : λ T , n k is given by n =n λ k /(λ P +λ R +λ T ) for =P,R and T. Thus, given n, allocation ratio and (π P ,π R ,π T ), the type I error rate for testing hypothesis H 0:ψ≤0 versus H 1:ψ>0 via test statistic T j (j=W,R,L) at the significance level α is calculated by
α j k = x T o = 0 n T x R o = 0 n R x P o = 0 n P f x T o , x R o , x P o | π T , π R , π P , H 0 × I p j k x T o , x R o , x P o α

for k=A M,S A M,E U M,A U M and BTM, whilst the corresponding power can be evaluated by replacing H 0 in f x T o , x R o , x P o | π T , π R , π P , H 0 by H 1.

Results

Simulation study

To compare the performance of AM, SAM, EUM, AUM and BTM together with test statistics T W , T R and T L under the balanced and unbalanced designs, Figure 1 presents boxplots of their corresponding type I error rates for n=30 and 60, and λ P : λ R : λ T =1:1:1, 1:2:2 and 1:2:3, where AMk, SAk, EUk, AUk and BTk represent AM, SAM, EUM, AUM and BTM for test statistic T k with k=W, R and L, respectively. Here, each boxplot in Figure 1 contains 2 (i.e., the number of non-inferiority margins) ×11,340 (i.e., the number of configurations for (π P ,π R ,π T ))=22,680 data points. From Figure 1, we have the following findings. First, the medians of the type I error rates based on AUM and BTM are closer to the prespecified nominal level α=0.05 than those based on the other three p-value calculation methods for all three test statistics under consideration. Second, for AUM and BTM, the medians of the type I error rates for test statistics T W and T R , which are 0.0495 and 0.0501 for AUM and 0.0494 and 0.0494 for BTM respectively, are closer to α=0.05 than those for test statistic T L , which are 0.0442 for AUM and 0.0442 for BTM. Third, for AM, SAM and EUM, their corresponding medians of type I error rates are 0.0649, 0.0455 and 0.0260 for test statistic T W , 0.0504, 0.0455 and 0.0488 for test statistic T R , and 0.0663, 0.1285 and 0.0332 for test statistic T L , respectively, which indicate that (i) the AM is liberal for test statistics T W and T L , whilst it is valid for test statistic T R ; (ii) the SAM can improve the accuracy of the normal approximation for test statistics T W and T R ; and (iii) the EUM is conservative for all test statistics. Fourth, the proportions of configurations whose type I error rates lie in the interval (0.045,0.055) for AM, SAM, EUM, AUM and BTM are 0.0747, 0.4691, 0.0710, 0.5154 and 0.7994 for T W , 0.5605, 0.4605, 0.4753, 0.7167 and 0.8370 for T R , and 0.0784, 0.0800, 0.0691, 0.4056 and 0.4889 for T L , respectively, which show that (i) AUM and BTM outperform the other three p-value calculation procedures, and (ii) T R behaves better than the other two test statistics regardless of p-value calculation procedures. Fifth, the median of the type I error rates becomes more close to the prespecified nominal level as the total sample size n increases, whilst at the same time the variability of the type I error rates decreases. Sixth, the variability of the type I error rates for unbalanced designs is not significantly different from that for the balanced designs.
Figure 1
Figure 1

Boxplots of the type I error rates of various test procedures together with three statistics when testing the non-inferiority hypothesis (2 . 2) at α =0 . 05. AMk, SAk, EUk, AUk and BTk represent the AM, SA, EU, AU and BT test procedures with test statistic T k for k = W, R and T, respectively.

To investigate the sensitivity of various p-value calculation procedures (i.e., AM, SAM, EUM, AUM and BTM) to different test statistics, Figure 2 presents boxplots of their corresponding type I error rates against π P for test statistics T W , T R and T L . Examination of Figure 2 shows that there is no significant effect of π P on the type I error rate.
Figure 2
Figure 2

Boxplots of the type I error rates of various test procedures together with three statistics against π P when testing the non-inferiority hypothesis (2 . 2) at α =0 . 05. EUk, AUk and BTk represent the EU, AU and BT test procedures with statistic T k for k = W,R and T, respectively.

We also calculate powers of the five p-value calculation procedures together with the three test statistics at the nominal level α=0.05 when π T =π R and θ=0.6 with the following settings: n=30 and 60, π P =0.15 and 0.3, and π R =0.5,0.8 and 0.95 for the balanced allocation 1:1:1 and unbalanced allocation 1:2:3. Results are reported in Table 1. Examination of Table 1 indicates that (i) T R is generally more powerful than T W and T L for the EUM except for π R =0.95 with the unbalanced designs, (ii) T W and T R have similar powers for AM, AUM and BTM under our considered settings, (iii) a slight power difference is observed between T R and T L for AUM and BTM, (iv) there is slight power difference between balanced and unbalanced designs, and (v) power increases as n increases regardless of p-value calculation procedures or test statistics. Hence, we would recommend both AUM and BTM with T R for hypothesis testing.
Table 1

Exact powers ( % ) of various test procedures together with three statistics when π T = π R with n =30 and 60, θ =0 . 6and α =0 . 05

    

AM

SAM

EUM

AUM

BTM

n

λ P : λ R : λ T

π P

π R

T W

T R

T L

T W

T R

T L

T W

T R

T L

T W

T R

T L

T W

T R

T L

30

1:1:1

0.15

0.5

13.4

12.3

13.9

43.6

21.2

22.5

5.0

18.1

15.3

18.2

18.2

14.9

17.9

18.0

16.2

   

0.8

44.2

43.9

38.0

42.6

72.4

30.0

28.1

43.5

40.3

42.1

42.1

39.7

43.7

43.9

42.1

   

0.95

86.0

85.8

75.2

95.6

97.7

36.8

67.8

79.1

76.0

74.2

74.2

71.3

75.8

75.4

75.0

  

0.3

0.5

8.8

7.5

8.3

39.9

23.2

14.8

2.9

10.7

8.5

10.8

10.8

9.3

11.1

11.1

9.1

   

0.8

30.1

29.2

22.0

21.1

35.6

26.6

15.9

29.2

24.8

30.0

30.0

26.8

30.2

30.7

29.0

   

0.95

66.3

64.1

45.6

80.6

88.3

26.4

42.5

52.7

49.8

59.4

59.4

57.1

60.0

59.7

59.4

 

1:2:3

0.15

0.5

14.2

11.9

18.6

33.2

28.2

24.7

13.2

17.9

13.7

21.7

21.7

19.7

20.3

20.0

18.4

   

0.8

43.6

41.4

48.3

58.9

81.3

29.0

43.4

45.4

36.3

53.1

52.3

44.8

51.5

51.1

44.4

   

0.95

83.0

82.9

83.9

97.1

98.8

36.0

85.6

79.6

82.6

85.9

84.3

79.9

85.4

84.9

80.6

  

0.3

0.5

8.6

7.2

10.3

36.2

25.0

18.7

7.9

10.8

7.6

12.4

12.2

10.5

11.9

11.7

9.8

   

0.8

29.7

27.4

30.1

26.1

57.9

25.9

28.9

31.9

22.7

35.0

33.6

28.2

34.4

33.7

29.5

   

0.95

62.4

61.8

62.8

85.5

95.3

36.9

66.4

63.0

60.9

65.8

64.5

62.3

66.3

65.9

64.0

60

1:1:1

0.15

0.5

19.0

19.0

18.4

33.7

47.5

24.3

10.7

11.3

14.2

29.3

29.4

28.3

28.0

28.1

27.2

   

0.8

65.8

67.8

59.6

85.2

92.4

38.9

55.0

56.3

48.7

71.4

71.4

71.4

71.1

71.1

70.7

   

0.95

97.9

98.6

96.6

96.6

96.7

50.3

95.9

96.7

89.3

97.7

97.7

97.7

97.7

97.7

97.7

  

0.3

0.5

9.7

9.4

9.5

43.2

21.5

16.9

4.7

5.3

4.2

17.0

17.2

15.3

14.1

14.3

13.1

   

0.8

46.5

47.1

39.8

54.6

79.8

33.2

35.4

36.9

37.1

50.9

50.9

50.8

49.7

50.3

50.0

   

0.95

91.3

93.3

85.7

95.3

96.0

47.0

85.9

87.8

71.3

88.0

88.0

88.0

89.6

89.3

89.5

 

1:2:3

0.15

0.5

20.5

20.2

22.2

29.6

53.8

27.9

24.1

22.1

24.2

31.0

30.8

28.3

31.7

31.1

28.2

   

0.8

72.5

72.5

69.1

92.3

96.4

40.9

73.9

73.3

79.2

77.0

76.9

75.9

78.3

78.1

76.7

   

0.95

98.6

98.6

98.0

99.9

99.9

50.6

98.6

98.9

92.4

99.1

99.0

99.0

98.5

98.5

98.4

  

0.3

0.5

10.3

10.0

10.3

42.9

25.0

20.1

12.3

10.3

10.1

15.8

15.7

13.5

15.8

15.4

13.4

   

0.8

49.3

49.2

45.1

64.6

84.1

36.2

52.0

48.4

38.0

52.0

52.0

51.0

55.5

55.4

54.1

   

0.95

90.4

90.3

88.7

99.1

99.4

51.3

90.9

92.5

82.4

92.1

92.0

92.0

91.8

91.7

91.7

Real data example

An example from a pharmacological study of patients with functional dyspepsia (FD) and a placebo-controlled trail of subjects with acute migraine is used to illustrate our proposed methodologies. This example has been analyzed by Holtmann et al. [25] and Tang and Tang [14]. In this example, cisapride and simethicone can be regarded as the existing reference and new experimental treatments, respectively. In that study, among n=178 patients of FD, n P =61, n R =59 and n T =58 were randomized and treated in a doubly dummy technique with placebo, cisapride and simethicone, respectively; adverse events (e.g., diarrhea and pain) were happened in x P =7, x R =10 and x T =12 patients treated with placebo, cisapride and simethicone, respectively. It is of interest to test if simethicone is not inferior to cisapride in terms of rate of reporting adverse event in the presence of placebo. Given θ=0.6 and 0.8, the corresponding p-values for testing H 0:(π T π P )/(π R π P )≤θ versus H 1:(π T π P )/(π R π P )>θ based on the five p-value calculation procedures and three test statistics are reported in Table 2. By Table 2, there is no evidence to show that simethicone is noninferior to cisapride in the presence of placebo at the nominal level α=0.05, which is consistent with that given in Tang and Tang [14].
Table 2

Various p -values for the pharmacological data set at the nominal level α =5 %

 

θ=0.6

 

θ=0.8

Test method

T W

T R

T L

 

T W

T R

T L

AM

0.173

0.162

0.164

 

0.234

0.229

0.230

SAM

0.494

0.494

0.140

 

0.497

0.497

0.162

EUM

0.185

0.181

0.192

 

0.233

0.202

0.210

AUM

0.166

0.165

0.186

 

0.232

0.230

0.249

BTM

0.504

0.502

0.519

 

0.516

0.514

0.530

Discussion

Simulation results demonstrate that our proposed score test statistic outperforms other test statistics in terms of type I error rate and power under our considered settings. The approximate unconditional and bootstrap-resampling methods perform better than other p-value calculation procedures in the sense that their corresponding type I error rates are closer to the prespecified nominal level and their corresponding powers are larger than those of other p-value calculation procedures. The exact unconditional method is conservative and time-consuming when sample sizes are large (e.g., see the 6th column in Table 3). The asymptotic tests are liberal since their type I error rates are greater than the prespecified nominal level α=0.05 in most cases. Comparing the approximate and exact unconditional methods, the approximate unconditional method provides a good alternative to the exact unconditional method in terms of computing time (e.g., see the 6th and 7th columns in Table 3) and type I error rate when sample sizes are large. In contrast, the computing burden of the bootstrap-resampling method is heavier than that of the approximate unconditional method (e.g., see the last two columns in Table 3).
Table 3

Computing time (minutes) of the Type I error rates for 11340 configurations of ( π P , π R , π T ) together with three test statistics under five test methods

λ P :λ R :λ T

θ

n

AM

SAM

EUM

AUM

BTM

1:2:3

0.6

30

3.3

269

2920

55.75

11700

  

60

3.8

356

130950

357.3

20700

In this article, we concentrate on a three-arm non-inferiority trial with binary endpoints in which the marginal is defined as a fraction of the unknown difference in response probabilities between reference and placebo. The corresponding hypothesis (i.e., H 0 : π T π P π R π P θ or H 0:π T θ π R −(1−θ)π P ≤0) is considered since it is simple and only one single hypothesis is involved (e.g., see [6, 9, 14]). However, three-arm non-inferiority hypotheses with the marginal defined as the prespecified difference between treatments have received a considerable attention in recent years (e.g., see [5, 7]). They can be generally classified as the union type hypotheses (i.e., H U0: π R h P (π P ) or π R h T (π T )) or the intersection type hypotheses (i.e., H U0: π R h P (π P ) and π R h T (π T )), where h P (.) and h T (.) are any functions [15]. For specific choices of h P (.) and h R (.), this includes, for examples, hypotheses on the differences, the relative risks or the odds ratio of the proportions. While the union type hypotheses are suitable for showing both the superiority of the standard treatment as compared to placebo and the inferiority of the test treatment as compared to the standard treatment, the intersection type hypotheses are suitable for showing the test treatment is as effective as the standard or placebo treatments. We are working on statistical inference on a three-arm non-inferiority trial with the margin being a prespecifided difference between treatments when the primary endpoints are binary.

Conclusions

According to the aforementioned observations, we can draw the following conclusions. In terms of type I error rates and powers, the approximate unconditional and bootstrap-resampling methods with score test statistic are recommended for hypothesis testing purpose when sample sizes are small in a three-arm non-inferiority trial. In terms of time-consuming and type I error rates and powers, the approximate unconditional method with score test statistic behaves the best among our considered p-value calculation procedures and test statistics.

Declarations

Acknowledgements

This work was supported by the grants from the National Science Foundation of China (11225103), and Research Fund for the Doctoral Program of Higher Education of China (20115301110004). The work of the third author was partially supported by the General Research Fund from the Research Grants Council of the Hong Kong Special Administrative Region, China (UGC/FDS14/P01/14).

Authors’ Affiliations

(1)
Department of Statistics, Yunnan University, No.2 Cuihu North Road, 650091 Kunming, China
(2)
Department of Mathematics and Statistics, Hang Seng Management College, Hang Seng Link, Siu Lek Yuen, Shatin NT, Hong Kong, China

References

  1. Dunnett CW, Gent M: Significance testing to establish equivalence between treatments with special reference to data in the form of 2 × 2 tables. Biometrics. 1977, 33: 593-602. 10.2307/2529457.View ArticlePubMedGoogle Scholar
  2. Tango T: Equivalence test and confidence interval for the difference in proportions for the paired-sample design. Stat Med. 1998, 17: 891-908. 10.1002/(SICI)1097-0258(19980430)17:8<891::AID-SIM780>3.0.CO;2-B.View ArticlePubMedGoogle Scholar
  3. Tang NS, Tang ML, Chan ISF: On tests of equivalence via non-unity relative risk for matched-pair design. Stat Med. 2003, 22: 1217-1233. 10.1002/sim.1213.View ArticlePubMedGoogle Scholar
  4. Li G, Gao S: A group sequential type design for three-arm non-inferiority trials with binary endpoints. Biom J. 2010, 52: 504-518. 10.1002/bimj.200900188.View ArticlePubMedGoogle Scholar
  5. Hida E, Tango T: Three-arm noninferiority trials with a prespecified margin for inference of the difference in the proportions of binary endpoints. J Biopharm Stat. 2013, 23: 774-789. 10.1080/10543406.2013.789893.View ArticlePubMedGoogle Scholar
  6. Koch GG, Röhmel J: Hypothesis testing in the gold standard design for proving the efficacy of an experimental treatment relative to placebo and a reference. J Biopharm Stat. 2004, 14: 315-325. 10.1081/BIP-120037182.View ArticlePubMedGoogle Scholar
  7. Hida E, Tango T: On the three-arm non-inferiority trial including a placebo with a prespecified margin. Stat Med. 2011, 30: 224-231. 10.1002/sim.4099.View ArticlePubMedGoogle Scholar
  8. Koch GG, Tangen CM: Nonparametric analysis of covariance and its role in non-inferiority clinical trials. Drug Inf J. 1999, 33: 1145-1159.Google Scholar
  9. Pigeot I, Schafer J, Rohmel J, Hauschke D: Assessing non-inferiority of a new treatment in a three-arm clinical trial including a placebo. Stat Med. 2003, 22: 883-899. 10.1002/sim.1450.View ArticlePubMedGoogle Scholar
  10. Koti KM: Use of the fieller-hinkley distribution of the ratio of random variables in testing for noninferiority. J Biopharm Stat. 2007, 17: 215-228. 10.1080/10543400601177335.View ArticlePubMedGoogle Scholar
  11. Hasler M, Vonk R, Hothorn LA: Assessing non-inferiority of a new treatment in a three-arm trial in the presence of heteroscedasticity. Stat Med. 2008, 27: 490-503. 10.1002/sim.3052.View ArticlePubMedGoogle Scholar
  12. Ghosh P, Nathoo F, Gönen M, Tiwari RC: Assessing noninferiority in a three-arm trial using the bayesian approach. Stat Med. 2011, 30: 1795-1808. 10.1002/sim.4244.View ArticlePubMedGoogle Scholar
  13. Gamalo MA, Muthukumarana S, Ghosh P, Tiwari RC: A generalized p-value approach for assessing noninferiority in a three-arm trial. Stat Methods Med Res. 2013, 22: 261-277. 10.1177/0962280210395739.View ArticlePubMedGoogle Scholar
  14. Tang ML, Tang NS: Tests of non-inferiority via rate difference for three-arm clinical trials with placebo. J Biopharm Stat. 2004, 14: 337-347. 10.1081/BIP-120037184.View ArticlePubMedGoogle Scholar
  15. Munk A, Mielke M, Skipka G, Freitag G: Testing noninferiority in three-armed clinical trials based on likelihood ratio statistics. Canadiaan J Stat. 2007, 35: 413-431. 10.1002/cjs.5550350306.View ArticleGoogle Scholar
  16. Liu JT, Tzeng CS, Tsou HH: Establishing non-inferiority of a new treatment in a three-arm trial: apply a step-down hierarchical model in a papulopustular acne study and an oral prophylactic antibiotics study. Intl J Stat Med Res. 2014, 3: 11-20.Google Scholar
  17. Jensen J: Saddlepoint Approximations. 1995, Oxford: Oxford Science PublicationsGoogle Scholar
  18. Tang NS, Tang ML: Exact unconditional inference for risk ratio in a correlated 2 × 2 table with structural zero. Biometrics. 2002, 58: 972-980. 10.1111/j.0006-341X.2002.00972.x.View ArticlePubMedGoogle Scholar
  19. Kieser M, Friede T: Planning and analysis of three-arm non-inferiority trials with binary endpoints. Stat Med. 2007, 26: 253-273. 10.1002/sim.2543.View ArticlePubMedGoogle Scholar
  20. Blackwelder WC: Proving the null hypothesis in clinical trials. Control Clin Trials. 1982, 3: 345-353. 10.1016/0197-2456(82)90024-1.View ArticlePubMedGoogle Scholar
  21. Farrington CP, Manning G: Test statistics and sample size formulae for comparative binomial trials with null hypothesis of non-zero risk difference or non-unity relative risk. Stat Med. 1990, 9: 1447-1454. 10.1002/sim.4780091208.View ArticlePubMedGoogle Scholar
  22. Jing BY, Robinson J: Saddlepoint approximations for marginal and conditional probabilities of transformed variables. Ann Stat. 1994, 22: 1115-1132. 10.1214/aos/1176325620.View ArticleGoogle Scholar
  23. Tang ML, Tang NS, Rosner B: Statistical inference for correlated data in ophthalmologic studies. Stat Med. 2006, 25: 2271-2783.Google Scholar
  24. Efron B, Tibshirani RJ: An Introduction to the Bootstrap. 1993, Boca Raton: Chapman & HallView ArticleGoogle Scholar
  25. Holtmann G, Gschossmann J, Mayr P, Talley NJ: A randomized placebo-controlled trail of simethicone and cisapride for the treatment of patients with functional dyspepsia. Aliment Pharmacol Ther. 2002, 16: 1641-1648. 10.1046/j.1365-2036.2002.01322.x.View ArticlePubMedGoogle Scholar
  26. Pre-publication history

    1. The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/14/134/prepub

Copyright

© Tang et al.; licensee BioMed Central. 2014

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Advertisement