 Research
 Open Access
 Published:
Conducting the noninferiority test for the means with unknown coefficient of variation in a threearm trial
BMC Medical Research Methodology volumeÂ 23, ArticleÂ number:Â 183 (2023)
Abstract
Background
The noninferiority test is a reasonable approach to assessing a new treatment in a threearm trial. The threearm trial consists of a placebo, reference, and an experimental treatment. The noninferiority is often measured by the mean differences between the experimental and the placebo groups relative to the mean differences between the reference and the placebo groups.
Methods
To cope with possible estimation distortion due to the allowance of heteroskedasticity, we adjust the measurement of noninferiority by the incorporation of coefficient of variation (CV) of the experimental, the reference and the placebo groups. In this research, we propose a generalized \(p\)value based method (GPVbased method) to facilitate noninferiority tests for the means with unknown coefficient of variation in a threearm trial.
Results
The simulation results show that the GPVbased method can not only adequately control type I error rate at nominal level better but also provide power higher than those from Delta method and the empirical bootstrap method, which verifies the feasibility of our adjustment.
Conclusions
We revise the measurement of noninferiority by deducting the CV of each kind of treatment from the average effect of trials. CVs are included in the noninferiority explicitly to help prevent possible estimating distortion if heteroskedasticity is allowed. Through the simulation study, the performance of GPVbased method for facilitating noninferiority tests for the means with unknown CV in a threearm trial is better than those from empirical bootstrap method and Delta method for small, medium and large sample sizes. Hence, the GPVbased method is recommended to be used to conduct the noninferiority test for the means with unknown CV in a threearm trial. The GPVbased method still performs well in nonnormality cases.
Background
The goal of a noninferiority test is to determine whether the experimental treatment is statistically not inferior to the active control in a clinical trial. The threearm clinical trial for noninferiority test is validated by the recommendation from U.S. Food and Drug Administration (FDA). The threearm trial, consisting of a placebo, reference, and an experimental treatment, shows the substantial superiority of the comparator over the placebo which is assessed prior to the comparison of reference and new experiment treatment [1]. Pigeot et al. [2] formulated the problem of noninferiority test in threearm trial as a ratio, which is the mean in experimental groups to the mean in reference groups, while deducting the mean in placebo groups respectively. Under a given threshold Î±_{0} (say 0.8), if the alternative hypothesis holds true, then it implies that the efficacy of the experimental group relative to that of the placebo group is more than Î±_{0}Ă—100% of the efficacy of the reference compound relative to that of the placebo group. Under normality and homogeneous variance assumption, Pigeot et al. [2] developed a test statistic in tdistribution to construct the confidence interval for the hypothesis of ratio by Fiellerâ€™s method. Meanwhile, Hasler et al. [3] derived a tdistributed test statistic under the variance heteroscedasticity assumption and the confidence intervals based on Fiellerâ€™s method.
In the above literatures, the test statistic of a noninferiority test in the threearm trial is the sample mean difference between the experimental and placebo groups denominated by that between the reference and placebo groups in the threearm trial. Itâ€™s well perceived that the sample mean is an unbiased estimator for population mean. Casting aside the unbiasedness, Searls [4] proposed an estimator for mean that includes a known coefficient of variation (CV) in advance, which has a minimum mean square error. In Wu and Hsieh [5], through estimating the population mean of treatment effects in a threearm rial by Searlsâ€™ estimator rather than traditional simple sample mean, they show that Searlsâ€™ estimator performs better, in terms of empirical size and empirical power. Thangjai et al. [6] derives the expectation and variance of Searlsâ€™ estimator (with unknown CV). Moreover, Thangjai et al. [6] also constructed the confidence intervals for mean and difference of means of normal distributions with unknown coefficients of variation. In this study, we try to use the concept of Thangjai et al. [6] to propose the noninferiority test procedure in the threearm trial in which the noninferiority is measured as the mean difference with unknown coefficient of variation between the experimental and the placebo groups relative to that between the reference and the placebo groups. Since the assumption of heterogeneous variances complicates the distributions of estimators of the difference between the mean with unknown CV of the experimental and the placebo groups relative to that between the reference and the placebo groups, it is a challenge to measure the noninferiorities of new treatments in the threearm clinical trial. Consequently, we propose the generalized \(p\)value based method (hereafter GPVbased method) that is the statistical test procedure to assess the noninferiority test in the threearm trial under heterogeneous variances assumption with unknown coefficient of variation of treatments.
Typically, in the threearm noninferiority tests, variances of the effects of trials are assumed to be homogeneous. But if the variances are heterogeneous, the impacts of heteroskedasticity on the test results are evaluated less times. The heteroskedasticity is an issue frequently encountered in the field of econometrics, which results in the problem of biased variance estimates and hence distorts the results of hypothesis tests such as CHOWâ€™s coefficient stability test, Studentâ€™s ttest, and Fisherâ€™s Ftest [7]. Though earlier researches use the tests on variances to detect whether heteroskedasticity exists in the model, Li and Yao [8] and Tovohery et al. [7] use the coefficient of variation (CV) to detect such problem. Inspired by Searls [4], in this research, we explicitly incorporate CV into the mean of the observations of trials, that is, substituting the population mean by Searlsâ€™ estimator in measuring the noninferiority, to mitigate the impacts of heteroskedasticity on the test results.
Tsui and Weerahandi [9] explicitly defined the generalized test variables (GTVs), showing that the generalized \(p\)value (GPV) is an exact probability in an extreme region accordingly. Based on their contribution, Tsui and Weerahandi [9] demonstrated that how small sample solution can be provided with GPVs to the cases where nuisance parameters emerge such that testing procedures are difficult to be conducted. Since the proposal of the idea of GPVs, they are applied to several hypothesis test subjects. For instance, Liao et al. [10, 11] applied the GPV to tolerance intervals; McNally et al. [12] conducted individual and population bioequivalence tests by GPVs; Mathew and Webb [13] constructed the GPVs and GCIs for variance components; Gamage [14] applied GPVs to MANOVA; with the concept of GPVs, Li et al. [15] measured the difference in paired partial area under the receiver operating characteristic (ROC) curves to construct a noninferiority test for diagnostic accuracy. Gamalo et al. [16] proposed a GPV approach to assessing the noninferiority in a threearm trial, in which the hypothesis test taken into account is the same as those in Hasler et al. [3].
The article is organized as follows. The statistical problem of the noninferiority hypothesis test with unknown CV in threearm trial is formulated and the test procedures implemented in bootstrap method and Delta method are derived in the second part of the article. In addition, we propose the GPVbased test for the ratio of mean differences which explicitly incorporating the unknown CV to assess the noninferiority in a threearm trial in the second part of the article. Furthermore, the empirical size and power of the proposed testing procedures are examined in simulation studies under a variety of scenarios. The proposed method is applied to a numerical example in the literature. Conclusion and some remarks are drawn in finally.
Methods
Let the clinical observations of experimental treatment, reference, and placebo groups be respectively denoted as \(X_{E,i}\),\(X_{R,j}\) and \(X_{P,k}\), which are mutually independent and normally distributed with expectations \(\mu_{E}\), \(\mu_{R}\) and \(\mu_{P}\), and unknown variances \(\sigma_{E}^{2}\),\(\sigma_{R}^{2}\) and \(\sigma_{P}^{2}\), respectively. Since the variance in the reference group is the gold standard in the threearm trial, to allow for a fair standard of noninferiority test, in this study, we assume that the variance of the experimental treatment group is equal to that of the reference group, but which is heterogeneous to that of the placebo group. Specifically, \(X_{{E,{\kern 1pt} i}} \sim N\left( {\mu_{E} ,\sigma_{E}^{2} } \right),{\kern 1pt} {\kern 1pt} i = 1, \ldots ,n_{E}\); \(X_{{R,{\kern 1pt} j}} \sim N\left( {\mu_{R} ,\sigma_{R}^{2} } \right),{\kern 1pt} j = 1, \ldots ,n_{R}\); and \(X_{{P,{\kern 1pt} k}} \sim N\left( {\mu_{P} ,\sigma_{P}^{2} } \right),{\kern 1pt} {\kern 1pt} k = 1, \ldots ,n_{P}\), where \(\sigma_{E}^{2} = \sigma_{R}^{2}\), and \(n_{E}\),\({\kern 1pt} {\kern 1pt} n_{R}\) and \(n_{P}\) can be unequal. Firstly, establishing the statistical testing problem
where \(\theta_{E} = \frac{{n_{E} \mu_{E} }}{{n_{E} + \left( {{{\sigma_{E}^{2} } \mathord{\left/ {\vphantom {{\sigma_{E}^{2} } {\mu_{E}^{2} }}} \right. \kern0pt} {\mu_{E}^{2} }}} \right)}}\), \(\theta_{R} = \frac{{n_{R} \mu_{R} }}{{n_{R} + \left( {{{\sigma_{R}^{2} } \mathord{\left/ {\vphantom {{\sigma_{R}^{2} } {\mu_{R}^{2} }}} \right. \kern0pt} {\mu_{R}^{2} }}} \right)}}\), \(\theta_{P} = \frac{{n_{P} \mu_{P} }}{{n_{P} + \left( {{{\sigma_{P}^{2} } \mathord{\left/ {\vphantom {{\sigma_{P}^{2} } {\mu_{P}^{2} }}} \right. \kern0pt} {\mu_{P}^{2} }}} \right)}}\), where \(\sigma_{E}^{2} = \sigma_{R}^{2}\) and \(\delta_{0}\) is a relevant noninferiority threshold. For \(\xi_{0} \in (0,1)\), we specify \(\delta_{0}\) as a proportion of the difference between \(\theta_{E}\) and \(\theta_{R}\) by \(\delta_{0} = (\xi_{0}  1)(\theta_{R}  \theta_{P} )\). Then rewriting the hypothesis based on the ratio of the differences in means with unknown CV yields
where \(\xi_{0}\) represents the effectiveness threshold between 0 and 1. The value of \(\theta_{R}  \theta_{P}\) is necessarily greater than 0. Because the threshold value \(\xi_{0}\) is defined as a proportion of the difference \(\theta_{R}  \theta_{P}\), it is important to select proper reference or positive control. In this way, the evaluation of the noninferiority in the threearm trial is specified as a ration of difference in population mean with unknown CV, as is discusses in the background of the text.
Empirical bootstrap method
The bootstrap method has become a widely used technique for statistical inference problem in which either the underlying distributional assumptions are not normal distribution, or the sample statistic is not feasible to derive its distribution under the null hypothesis (Efron and Tibshirani [17]). Now that the variance of experimental treatment group is equal to that of reference group (which is heterogeneous to that of the placebo group), we use the residual method to construct the empirical bootstrap procedure to assess the noninferiority of a new treatment in a threearm trial. The residual method is somewhat similar to the percentile method, except that it is based on the bootstrap distribution of residuals from the original estimate [18]. The empirical bootstrap procedure can be obtained as follows.

Step1: Suppose that \({\mathbf{x}}_{E} = \left( {x_{E,1} , \ldots ,x_{{E,n_{E} }} } \right)\),\({\mathbf{x}}_{R} = \left( {x_{R,1} , \ldots ,x_{{R,n_{R} }} } \right)\) and \({\mathbf{x}}_{P} = \left( {x_{P,1} , \ldots ,x_{{P,n_{P} }} } \right)\) denote the clinical observations for experimental, reference and placebo groups, respectively. Generate a bootstrap sample \({\mathbf{x}}^{*b} = \left( {{\mathbf{x}}_{E}^{*b} ,{\mathbf{x}}_{R}^{*b} ,{\mathbf{x}}_{P}^{*b} } \right)\) with replacement from the original sample \({\mathbf{x}} = \left( {{\mathbf{x}}_{E} ,{\mathbf{x}}_{R} ,{\mathbf{x}}_{P} } \right)\) and draw samples with replacement from each group with sample sizes \(n_{E}\), \(n_{R}\) and \(n_{P}\), respectively.

Step 2: Compute \(\hat{\xi }^{*b} = \frac{{\widehat{\theta }_{E}^{*b}  \widehat{\theta }_{P}^{*b} }}{{\widehat{\theta }_{R}^{*b}  \widehat{\theta }_{P}^{*b} }}\) from data \({\mathbf{x}}^{*b}\) and \(e^{*b} = \hat{\xi }^{*b}  \widehat{\xi }\) is calculated for each bootstrap sample, where \(\hat{\xi }\) is the estimate from the original data.

Step 3: Repeat step1 and step2 process \(b = 1, \cdots ,B\) times independently.

Step 4: Let \(e_{(1  \alpha )100\% }^{*b}\) be the \((1  \alpha )100\%\) quantile of the bootstrap values of \(e^{*b}\), and compute the \(L_{{\widehat{\xi }^{b} }} = \widehat{\xi }  e_{(1  \alpha )100\% }^{*b}\).
Then, noninferiority can be claimed if \(L_{{\widehat{\xi }^{b} }} > \xi_{0}\).
Delta method
Let \(\xi_{1} = \theta_{E}  \theta_{P}\) be the difference of population mean with unknown CV in experimental group and placebo group and let \(\xi_{2} = \theta_{R}  \theta_{P}\) be the difference of population mean with unknown CV in reference group and placebo group. Therefore, the expectations and variances of \(\hat{\xi }_{1}\) and \(\hat{\xi }_{2}\) can be obtained by Thangjai [6]. The Delta method is proposed in Dorfman [19]. Such method is the result of the application of the concept of Taylor's theorem (series expansion) to construct the normal distribution of the estimators in complex forms asymptotically. Accordingly, the threshold, \(\widehat{\xi } = \frac{{\widehat{\xi }_{1} }}{{\widehat{\xi }_{2} }}\) is distributed asymptotically as.
where
When the null hypothesis holds, for the noninferiority hypothesis test in terms of population mean with unknown CV as shown in (1), the rejection region constructed under Delta method is.
where \(z_{\alpha }\) denotes the upper \(\alpha\) critical point of the standard normal distribution.
The GPVbased method
Suppose \({\mathbf{X}}\) to be the random variable whose PDF is \(f({\mathbf{X}};\zeta )\), where \(\zeta = (\xi ,\eta )\). The \(\xi\) is parameter of interest such that \(\xi = \frac{{\theta_{E}  \theta_{P} }}{{\theta_{R}  \theta_{P} }}\) and \(\eta\) denotes a vector of nuisance parameters. Let \({\mathbf{x}}\) be the observed value of the random variable \({\mathbf{X}}\). The statistic \(T = T\left( {{\mathbf{X}};{\mathbf{x}},\zeta } \right)\) is said to be a generalized test variable if the following three properties hold.

Property A: Fixing \({\mathbf{x}}\) and let \(\zeta = (\xi_{0} ,\eta )\), the distribution of \(T({\mathbf{X}};{\mathbf{x}},\zeta )\) is independent of nuisance parameters \(\eta\).

Property B: The observation of \(T({\mathbf{X}};{\mathbf{x}},\zeta )\), \(t_{obs} = T\left( {{\mathbf{x}};{\mathbf{x}},\zeta } \right)\), does not dependent on unknown parameters.

Property C: For given \({\mathbf{x}}\) and \(\eta\), \(P\left( {T({\mathbf{X}};{\mathbf{x}},\zeta ) \ge t} \right)\) is either stochastically increasing or decreasing in \(\xi\) for any given \(t\).
Without loss of generality, considering the following hypothesis: to test \(H_{0} :\xi \le \xi_{0}\) versus \(H_{1} :\xi > \xi_{0}\), where \(\xi_{0}\) is a specified value. If \(T\) is stochastically increasing in \(\xi\), then the generalized \(p\)value can be defined as.
where \(t_{obs} = T({\mathbf{x}};{\mathbf{x}},\xi_{0} ,\eta )\).
For the test with a significance level \(\alpha\), if \(p < \alpha\), then we have confidence to reject \(H_{0}\). The generalized test variable \(T\) is often computed by using MonteCarlo algorithm, due to the complexity of the exact distribution.
In the following, we use the concept of generalized pivotal quantity (GPQ) by Weerahandi [20] to develop the required generalized test variables (GTVs) to assessment noninferiority of a new treatment in a threearm trial measured as a ratio of difference in mean with CV of each treatment. For developing the GTV for hypothesis test in (1), we first define GPQs for \(\mu_{E}\),Â \(\mu_{R}\),Â \(\mu_{P}\),\(\sigma_{E}^{2}\),Â \(\sigma_{R}^{2}\),Â \(\sigma_{P}^{2}\),Â \(\theta_{E}\),Â \(\theta_{R}\) and \(\theta_{P}\) as
Note that \(Z_{E} \sim N(0,1)\), \(Z_{R} \sim N(0,1)\), \(Z_{P} \sim N(0,1)\), \(U_{E} \sim \chi^{2} (n_{E}  1)\), \(U_{R} \sim \chi^{2} (n_{R}  1)\), \(U_{P} \sim \chi^{2} (n_{P}  1)\), \(\overline{x}_{E}\), \(\overline{x}_{R}\) and \(\overline{x}_{P}\) be the observed values of \(\overline{X}_{E}\), \(\overline{X}_{R}\) and \(\overline{X}_{P}\), \(s_{E}^{2}\), \(s_{R}^{2}\) and \(s_{P}^{2}\) be the observed values of \(S_{E}^{2}\), \(S_{R}^{2}\) and \(S_{P}^{2}\). In addition, we use pooled estimator \(S_{pooled}^{2}\) to estimate both \(\sigma_{E}^{2}\) and \(\sigma_{R}^{2}\). The pooled estimator is defined as \(S_{pooled}^{2} = {{\left( {(n_{E}  1)S_{E}^{2} + (n_{R}  1)S_{R}^{2} } \right)} \mathord{\left/ {\vphantom {{\left( {(n_{E}  1)S_{E}^{2} + (n_{R}  1)S_{R}^{2} } \right)} {\left( {n_{E} + n_{R}  2} \right)}}} \right. \kern0pt} {\left( {n_{E} + n_{R}  2} \right)}}\), and the \(s_{pooled}^{2}\) be the observed value of \(S_{pooled}^{2}\). Moreover, \(Z_{E}\), \(Z_{R}\), \(Z_{P}\), \(U_{E}\), \(U_{R}\) and \(U_{P}\) are mutually independent.
The GPQ of \(\xi = \frac{{\theta_{E}  \theta_{P} }}{{\theta_{R}  \theta_{P} }}\) can thus be defined as
Hence, we can construct a GTV for \(\xi\) given by
Given the observed data, the observed value of \(R_{\xi }\) is equal to \(\xi\) and \(R_{\xi }\) has the distribution that is free of parameters. Hence, the distribution of \(T_{\xi }\) does not depend on nuisance parameters for any given value of \(\xi = \xi_{0}\), and that the observation of \(T_{\xi }\) is equal to zero. Consequently, Property A and Property B are satisfied. Furthermore, the distribution function of \(T_{\xi }\) can be expressed as
Since the distribution function of \(T_{\xi }\) is stochastically increasing in \(\xi\), Property C is also satisfied. By definition, \(T_{\xi }\) is a GTV of \(\xi\). To test the hypothesis \(H_{0} :\xi \le \xi_{0} \quad versus\quad H_{1} :\xi > \xi_{0}\), the following MonteCarlo algorithms are provided to derive the required GPV.

Step 1: Choose MonteCarlo samples large enough, e.g., \(H = 10000\)\(10000\). For each \(h\), \(1 \le h \le H\), generate three pairs of random outcomes from mutually independent chisquare distributions, \(U_{E}\), \(U_{R}\) and \(U_{P}\) (with \(n_{E}  1\), \(n_{R}  1\) and \(n_{P}  1\) degrees of freedom) respectively, and standard normal variables \(Z_{E}\), \(Z_{R}\) and \(Z_{P}\).

Step 2: Use (2) (10) to calculate \(R_{{\mu_{E} }}\), \(R_{{\mu_{R} }}\), \(R_{{\mu_{P} }}\), \(R_{{\sigma_{E}^{2} }}\), \(R_{{\sigma_{R}^{2} }}\), \(R_{{\sigma_{P}^{2} }}\), \(R_{{\theta_{E} }}\), \(R_{{\theta_{R} }}\) and \(R_{{\theta_{P} }}\).

Step 3: Calculate \(R_{\xi ,h}\) from (11).

Step 4: Finally, \(T_{\xi ,h}\) can be calculated from (12), given \(\xi_{0}\).
Since \(T_{\xi }\) is stochastically increasing in \(\xi\) and the observed value of \(T_{\xi }\) is equal to zero, the GPV is thus estimated by \(p = {{\sum\nolimits_{h = 1}^{H} {I\left( {T_{\xi ,h} \le 0} \right)} } \mathord{\left/ {\vphantom {{\sum\nolimits_{h = 1}^{H} {I\left( {T_{\xi ,h} \le 0} \right)} } H}} \right. \kern0pt} H}\). Under significance level \(\alpha\), the null hypothesis \(H_{0} :\frac{{\theta_{E}  \theta_{P} }}{{\theta_{R}  \theta_{P} }} \le \xi_{0}\) in (1) is rejected whenever \(p < \alpha\).
Results
To evaluate the efficacy of the proposed method, three sets of simulation studies are conducted. First, the empirical sizes from GPVbased method are compared with those from the Delta method and empirical bootstrap method in various finite sample sizes. Second, we evaluate the empirical power among the three tests and compare the performance of the proposed GPVbased method with that of other two tests. Third, we show that GPVbased method can be well applied to nonnormality cases.
Simulation study I: type I error rate
We conducted a simulation study of the type I error rates under GPVbased, Delta and empirical bootstrap methods. The noninferiority limit is chosen as \(\xi_{{0}} { = 0}{\text{.8}}\). We consider the following three cases of \(\Delta { = }\mu_{R}  \mu_{P}\): (i) \(\Delta = 9\); (ii) \(\Delta = 15\) and (iii) \(\Delta = 20\). We consider the allocations of 3:2:1 of the total sample size \(n\) for experimental, reference and placebo group, so the total sample sizes will choose as follows: \(n\)â€‰=â€‰60, 90,120,480 and 900, respectively. For cases (i)(iii), the population mean of placebo group (\(\mu_{P}\)) is set to be 16.5. The population mean of experimental group is \(\mu_{E} = \xi_{0} \times \Delta + \mu_{P}\) under all scenarios. For case (i)(iii), we consider setting \(\tau_{R} = {{\sigma_{R}^{2} } \mathord{\left/ {\vphantom {{\sigma_{R}^{2} } {\sigma_{E}^{2} }}} \right. \kern0pt} {\sigma_{E}^{2} }}\) to be 1 and \(\tau_{P} = {{\sigma_{P}^{2} } \mathord{\left/ {\vphantom {{\sigma_{P}^{2} } {\sigma_{E}^{2} }}} \right. \kern0pt} {\sigma_{E}^{2} }}\) to be 0.5, 1.0 and 2.0, respectively. In this way, we keep variances of experimental and reference treatments homogeneous, while allowing heteroskedasticity for placebo group. In this simulation study, the standard deviation of placebo group (\(\sigma_{P}\)) is set to be 7.5, and the standard deviation of reference group (\(\sigma_{R}\)), as well as the standard deviation of experimental group (\(\sigma_{E}\)), are both equal to \({{\sigma_{p} } \mathord{\left/ {\vphantom {{\sigma_{p} } {\sqrt {\tau_{p} } }}} \right. \kern0pt} {\sqrt {\tau_{p} } }}\). In addition, given any pair of \((\mu_{i} ,\sigma_{i} )\), \(i = E,R,P\), \(\theta_{i}\) and hence \(\theta_{E}  \theta_{P}\), \(\theta_{R}  \theta_{P}\) can be derived.
Under each parameter specification, the simulation data are independently generated 10,000 times. The empirical size and power are computed by the proportion of the 10,000 simulated \(p\)values that are less than 5% (significance level). Given the above nominal significance level and simulation random samples, if a testing procedure can adequately control the size at the 5% nominal level, then the empirical sizes should fall into (0.0457, 0.0543). In this simulation study, for each sample, 5000 GPQs are constructed, and 1000 bootstrap samples are drawn. We display the simulation results in Table 1.
Table 1 presents the results of the type I error rates simulation based on the ratio of population mean differences with unknown coefficients of variation for assessing noninferiority of a new treatment in a threearm trial in the presence of heteroscedasticity with noninferiority limit of 0.8 under normal assumption. The simulation results lead us to the following conclusions.

(1)
In Table 1, the range of the type I error rates of the GPVbased method is given by (0.0475,0.0518). This range is within (0.0457, 0.0543), and most of the type I error rates of the GPVbased method are quite close to nominal value of 0.05. Therefore, the test procedure of the GPVbased method can maintain type I error rate close to the nominal level of 5% adequately.

(2)
In addition, from Table 1, the range of the type I error rates from Delta method is (0.0001,0.0058). The ranges of the type I error rates of the Delta method are all outside the range of (0.0457, 0.0543), and all of which are far less than nominal value of 0.05. One may observe that Delta method is quite conservative. However, in some extreme cases (not shown in Table 1), such as \(\tau_{p} = 0.01\), and \(n = 96,000\), Delta method controls type I error rate much better, and the difference in power between GPVbased and Delta methods shrinks. Apparently, the extreme cases are infeasible for practical clinical application.

(3)
On the other hand, the range of the type I error rates from the empirical bootstrap method is (0.0001,0.0477). There are only 5 out of 45 (11.1%) empirical sizes from the empirical bootstrap method fall within (0.0457, 0.0543). As a result, the test procedure by the empirical bootstrap method is quite conservative, except when \(\mu_{R}  \mu_{P} = 20\), \(n \ge 480\), \(\tau_{R} = {1}\) and \(\tau_{P} = 2\). As the mean difference between reference and placebo groups gets larger, the bootstrap method controls type I error rate better.
Taken as a whole, the GPVbased method performs extremely well in most cases, and it clearly controls the sufficient the type I error rates better, especially in the small sample cases.
Simulation study II: empirical power
To study the empirical power of the GPVbased method, we consider a simulation in the case of \(\mu_{E}  \mu_{P} = 9\) and \(\mu_{E}  \mu_{P} = 20\); \(\tau_{R} = 1\) and \(\tau_{P} = 2\); sample sizeâ€‰=â€‰60,120 and 480. We allocate total sample for experimental, reference and placebo group by \(n_{E}\): \(n_{R}\): \(n_{P}\)â€‰=â€‰3: 2: 1. The noninferiority limit is also chosen as \(\xi_{0} = 0.8\), and the significance level is set to be 0.05 as well. For each combination of parameter specification, 10,000 random samples are generated. For each random sample, 5000 GPQs are constructed, and 1000 samples are drawn for bootstrap method. The results of the empirical power curves are provided in Fig.Â 1.
FigureÂ 1 provides the power of the simulation by GPVbased method, the Delta method, and the empirical bootstrap method. In Fig.Â 1, when the mean difference of reference and placebo groups is 9, the GPVbased method is uniformly more powerful than the Delta method and the empirical bootstrap method. FigureÂ 1 shows the power curves as a function of \(\xi = \frac{{\theta_{E}  \theta_{P} }}{{\theta_{R}  \theta_{P} }}\) for total sample sizes 60,120 and 480, respectively. The power increases with the increasing values of \(\xi\) and with the increasing total sample sizes. However, when the mean difference of reference and placebo groups is 20, the empirical power curves of the GPVbased method and the empirical bootstrap method quite overlap when \(\xi\) is larger than 0.9. Therefore, when the mean difference of reference and placebo groups is equal to 9, the performance of empirical power by using GPVbased method is better than those of the Delta method and the empirical bootstrap method. On the other hand, the performance of the empirical bootstrap method is as good as that of GPVbased method when the mean difference of reference and placebo groups is equal to 20 and sample size exceeds 60. In sum, the GPVbased method performs relatively better when the mean difference of reference and placebo groups and the sample size are small.
Simulation study III: nonnormality cases
In this section, we consider two nonnormal distributions, i.e.,lognormal and gamma distributions to study the robustness of the GPQbased method. When the probability distribution of the population is assumed to be lognormal distribution, let \(X_{i} \;,\;i = E,\;R,\;P\) be mutually independent with means \(\ln (\mu_{i} )  \frac{1}{2}\ln \left( {\frac{{\sigma_{i}^{2} }}{{\mu_{i}^{2} }} + 1} \right)\) and unknown variances \(\ln \left( {\frac{{\sigma_{i}^{2} }}{{\mu_{i}^{2} }} + 1} \right)\), respectively. When \(X_{i} \;\) belongs to the gamma distribution, denote \(X_{i} \;\) by \(gamma\left( {\gamma_{i1} = \frac{{\mu_{i}^{2} }}{{\sigma_{i}^{2} }}\;,\;\gamma_{i2} = \frac{{\sigma_{i}^{2} }}{{\mu_{i} }}} \right)\;\;,\;i = E,\;R,\;P\), where \(\gamma_{i1}\) and \(\gamma_{i2}\) represent the shape and scale parameters, respectively. The same simulation parameters such as \(\mu_{R}  \mu_{P}\)\(\tau_{R}\),\(\tau_{P}\),\(n\) are the same as those in Simulation study I and II. The simulation results of the type I error rates are displayed in Tables 2 and 3, and the simulation results of empirical powers are presented in Table 4.
From Tables 2 and 3, when data follow lognormal or gamma distribution, the performance of GPVbased method can more appropriately maintain the type I error rate near the nominal level of 0.05 than the Delta method and the empirical bootstrap method do. In addition, the type I error rate of the Delta method is quiet conservative as well. Furthermore, under \(\mu_{R}  \mu_{P} = 20\), \(\tau_{R} = 1\), \(\tau_{P} = 2\) and the total sample size is greater than 900, the type I error rate derived from the empirical bootstrap method approaches the claimed significance level of the noninferiority test. Moreover, in Table 4, regardless of the sample size and distributions, the empirical power performance of GPVbased method is more powerful than that of the Delta method and the empirical bootstrap method, especially under the \(\mu_{R}  \mu_{P} = 9\), \(\tau_{R} = 1\), \(\tau_{P} = 2\) and the total sample size is less than 120.
Numerical example: evaluation of the mutagenicity
We adopt the mutagenicity data set in Hauschke et al. [21], which are published by Adler and Kliesch [22] from a micronucleus assay on hydroquinone implementing a positive control of 25Â mg/kg cyclophosphamide. The results for male mice at 24Â h sampling time are given in Table 5.
Through comparing the difference between a dose group and a vehicle control with the difference between the positive control and the vehicle control, the noninferiority test can also be adopted to verify the safety in toxicological experiments. Therefore, the above mutagenicity data can be evaluated by such noninferiority test. Hothorn and Hauschke [23] used the concept of the acceptable maximal safe dose by identifying the highest dose that is noninferior to the vehicle control, and as a result all other levels of dose below the highest one are also noninferior. Under the assumption of normality and homogeneous variance, Hauschke et al. [21] built confidence intervals for the ratio of the difference between the dose groups and the vehicle control to the difference between a positive control and the vehicle control, in which the safety threshold is set to be 0.5. Hence, the hypothesis of the corresponding noninferiority test can be characterized as follows.
where the dose group is taken as the experimental group, the vehicle control taken as the placebo group and the positive control taken as the reference group. The upper 95% confidence limits for \(\frac{{\theta_{E}  \theta_{P} }}{{\theta_{R}  \theta_{P} }}\) calculated from GPVbased method, the Delta method, and the empirical bootstrap method are presented Table 6.
From Table 6, one can see that safety is attainable for the two lower doses, therefore the maximal safe dose is 50Â mg/kg. The two higher levels of dose, 75 and 100Â mg/kg, reveal an unacceptable increase. Cases where the variance heterogeneity is taken into account in the GPVbased method, the Delta method, and the empirical bootstrap method, the results do not change.
Conclusions and discussions
We propose the GPVbased method to conduct the noninferiority test by the difference of means with unknown coefficient of variations between the experimental and the placebo groups relative to that between the reference and the placebo groups under the normality assumption. The main contribution of this research is that we revise the measurement of noninferiority by considering the coefficient of variation (CV) of each kind of treatment from the average effect of trials. This is slightly different from the traditional noninferiority test that is difference of means between the experimental and the placebo groups relative to that between the reference and the placebo groups. Besides, through the heuristic statistical testing procedure for noninferiority test, we incorporate unknown heterogeneous variance among the three arms. Hence, CVs are included in the noninferiority hypothesis testing explicitly to help prevent possible estimating distortion if heteroskedasticity is allowed.
Empirical results from simulation studies show that the GPVbased method can not only adequately control the type I error rates at the nominal level but also provide power higher than those from the Delta method and the empirical bootstrap method. The performances of empirical type I error rates and empirical power of GPVbased method are better than those from the Delta method and the empirical bootstrap method. Therefore, the GPVbased method is suitable to conduct the noninferiority test for the means with unknown coefficient of variation in a threearm trial. The R program for the proposed GPVbased method is available as Supplementary materials 1 and 2.
To further explore the properties of these comparable methods, estimations are conducted for noninferiority limit under parameter settings as in simulation studies. The noninferiority limit is chosen as 0.8. For each specified parameter combination, the data are generated 10,000 times independently. The bias, mean square error (MSE) and coverage probability (CP) simulation results of the three methods are shown in Table 7.
From Table 7, the biases from the GPV method are not much different to those from Delta method, but most of which are smaller than the empirical bootstrap method. Furthermore, when the mean difference of the reference and placebo groups is equal to 9 and sample size is less than 120, one can see that the GPQ from GPVbased method has smaller MSE than estimators from the Delta method and the empirical bootstrap method do. On the other hand, the GPVbased method generally provides sufficient coverage probabilities around the confidence level of 0.95. The GPVbased method approach results in fairly better coverage probability than the other two methods do, regardless of the sample size. Moreover, when the mean difference of reference and placebo groups is large than 20, under the ratio of variance of the reference group to the experimental group is 1 and the ratio of variance of the placebo group to the experimental group is 2, the performances of coverage probabilities of the empirical bootstrap method are as good as that of the GPVbased method. Additionally, the coverage probabilities presented by the Delta method are quite conservative as well.
Under the normality assumption, the required percentiles of GPQ for \(\frac{{\theta_{E}  \theta_{P} }}{{\theta_{R}  \theta_{P} }}\) (our measurement of noninferiority) cannot be obtained in closed form but may be estimated using MonteCarlo algorithm. In addition, if the data belongs to nonnormal data, we recommend that the power transformation of Box and Cox [24] be performed.
In Wu and Hsieh [5], when conducting noninferiority test in a threearm trial, they estimate the sample mean by Searlsâ€™ estimator (mean with CV) rather than the traditional one (pure sample mean), showing that testing results are better, in terms of empirical sizes and empirical powers. While in our research, different from the traditional threearm trial, we conduct the noninferiority test for the means with unknown CVs, and we show that the explicit inclusion of CVs in the measurement of noninferiority can still control the type I error at the nominal level. In sum, when conducting noninferiority test, CVs are highly recommended to be included, whether through the estimation of average effects of trials or through the specification of noninferiority.
Availability of data and materials
The numerical example used and analyzed during this study may be obtained from the corresponding author on reasonable request.
Abbreviations
 GP:

The GPVbased method
 DM:

The Delta method
 EB:

The empirical bootstrap method
 n :

The total sample sizes
References
Hauschke D, Pigeot I. Establishing efficacy of a new experimental treatment in the â€śGold Standardâ€ť design. Biom J. 2005;47:782â€“6.
Pigeot I, SchĂ¤fer J, Hauschke D. Assessing noninferiority of a new treatment in a threearm clinical trial including a placebo. Stat Med. 2003;22:883â€“9.
Hasler M, Vonk R, Hothorn LA. Assessing noninferiority of a new treatment in a threearm trial in the presence of heteroscedasticity. Stat Med. 2008;27:490â€“503.
Searls DT. The utilization of a known coefficient of variation in the estimation procedure. J Am Stat Assoc. 1964;59:1225â€“6.
Wu WH, Hsieh HN. Assessing the noninferiority of a new treatment in a threearm trial with unknown coefficient of variation. Commun Stat Simul Comput. 2022. https://doi.org/10.1080/03610918.2022.2051716.
Thangjai W, Niwitpong S, Niwitpong SA. Confidence intervals for mean and difference of means of normal distributions with unknown coefficients of variation. Mathematics. 2017;5:1â€“23.
Tovohery JM, Totohasina A, Rajaonasy FD. Application of equality test of coefficients of variation to the heteroskedasticity test. Am J Comput Math. 2020;10:73â€“89.
Li Z, Yao J. Testing for heteroscedasticity in highdimensional regressions. Econom Stat. 2019;9:122â€“39.
Tsui K, Weerahandi S. Generalized values in significance testing of hypotheses in the presence of nuisance parameters. J Am Statist Assoc. 1989;84:602â€“7.
Liao CT, Iyer HK. A tolerance interval for the normal distribution with several variance components. Stat Sinica. 2004;14:217â€“29.
Liao CT, Lin TY, Iyer HK. One and two sided tolerance intervals for general balanced mixed models and unbalanced oneway random models. Technometrics. 2005;47:323â€“35.
McNally RJ, Iyer HK, Mathew T. Tests for individual and population bioequivalence based on generalized values. Stat Med. 2003;22:31â€“53.
Mathew T, Webb DW. Generalized values and confidence intervals for variance components: applications to army test and evaluation. Technometrics. 2005;47:312â€“22.
Gamage J, Mathew T, Weerahandi S. Generalized values and generalized confidence regions for the multivariate BehrensFisher problem and MANOVA. J Multivar Anal. 2004;88:177â€“89.
Li CR, Liao CT, Liu JP. A noninferiority test for diagnostic accuracy based on the paired partial areas under ROC curves. Stat Med. 2008;27:1762â€“76.
Gamalo MA, Muthukumarana S, Ghosh P, Tiwari RC. A generalized value approach for assessing noninferiority in a threearm trial. Stat Methods Med Res. 2013;22:261â€“77.
Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman and Hall; 1993.
Williams CJ. In Christopher J. Williams' Nonparametric Statistics (STAT 514) Course Notes at the University of Idaho, Fall 2010. Retrieved from https://www.webpages.uidaho.edu/~chrisw/stat514/bootstrapcimethods1.pdf.
Dorfman R. A note on the Î´method for finding variance formulae. The Biometric Bulletin. 1938;1:129â€“37.
Weerahandi S. Generalized confidence intervals. J Am Statist Assoc. 1993;88:899â€“905.
Hauschke D, SlacikErben R, Hensen S, Kaufmann R. Biostatistical assessment of mutagenicity studies by including the positive control. Biom J. 2005;47:82â€“7.
Adler ID, Kliesch U. Comparison of single and multiple treatment regiments in the mouse bone marrow micronucleus assay for hydroquinone and cyclophosphamide. Mutat Res. 1990;234:115â€“23.
Hothorn LA, Hauschke D. Identifying the maximum safe dose: a multiple testing approach. J Biopharm Stat. 2000;10:15â€“30.
Box GEP, Cox DR. An analysis of transformation. J R Statist Soc Ser B. 1969;26:211â€“46.
Acknowledgements
We are grateful to anonymous reviewers and editors for their comments on our manuscript.
Funding
This research did not receive specific funding from any institution.
Author information
Authors and Affiliations
Contributions
M.C.Lee, W.Y.Wu, W.H.Wu proposed concept development and design of study. H.Y.Lu, H.N.Hsieh performed statistical simulations and acquisition of data. M.C.Lee, W.Y.Wu, H.Y.Lu analyzed and interpreted data. H.N.Hsieh, W.H.Wu conducted manuscript drafting and revised the manuscript. All authors reviewed and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Lee, MC., Wu, WY., Lu, HY. et al. Conducting the noninferiority test for the means with unknown coefficient of variation in a threearm trial. BMC Med Res Methodol 23, 183 (2023). https://doi.org/10.1186/s1287402301990w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287402301990w
Keywords
 Heteroskedasticity
 Coefficient of variation
 Generalized pvalue
 Noninferiority test
 Searlsâ€™ estimator