Comparison of alternative approaches for difference, noninferiority, and equivalence testing of normal percentiles

Background Percentiles are widely used in scientific research for determining the comparative magnitude and reference limit of quantitative measurements. The investigations for point and interval estimation of normal percentiles are well documented in the literature. However, the corresponding statistical tests of hypothesis have received relatively little attention. Methods To facilitate data analysis and design planning of percentile study, this paper aims to present hypothesis testing procedures and associated power functions for assessing the difference, noninferiority, and equivalence of normal percentiles. Results Numerical illustrations about drug dissolution are provided to demonstrate the usefulness of the suggested exact approaches and the deficiency of approximate methods. Conclusions The exact approaches are superior to the approximate methods on the basis of control of Type I errors. Computer algorithms are constructed to implement the recommended test procedures and sample size calculations for percentile analysis.


Background
Percentiles are extremely useful for describing the reference threshold and meaningful magnitude of numerical quantities, such as achievement score, developmental index, medical measurement, and physical dimension. The inferential methods for normal means are well documented in the fundamental texts of statistical analysis. However, the methodological aspects and statistical implications of analyzing normal percentiles have been less discussed. It is essential to note that normal percentiles are a linear function of the mean and standard deviation of the underlying population. Because the sample mean and sample variance are complete and sufficient statistics for the population mean and variance, the minimum variance unbiased estimator of a normal percentile can be readily obtained. Specifically, Royston and Mathews [1] compared the minimum variance unbiased estimator and other useful formulas under the intrinsic criteria of bias and mean square error. More advanced and theoretical treatments of normal percentile estimation are also available in Keating, Mason, and Balakrishnan [2], Keating and Tripathi [3], Parrish [4], Rukhin [5], and Zidek [6,7].
Both exact and approximate confidence intervals of normal percentiles have been considered in several analytical developments. The exact interval estimation of normal percentiles was presented in Meeker, Hahn, and Escobar [8], Johnson, Kotz, and Balakrishnan [9], and Owen [10]. Note that the exact confidence intervals involve the quantiles of a noncentral t distribution. Such critical values are not commonly available in tabulated forms and the implementation necessitates appropriate computing algorithms. To circumvent the reliance on a noncentral t distribution, approximate methods were considered by using the standardization technique and the regular t distribution. Accordingly, the approximate confidence intervals of Bland and Altman [11] and Chakraborti and Li [12] are computationally simple and the interval calculations do not require specialized software. However, the numerical study of Shieh [13] demonstrated that the confidence limits of the approximate methods generally do not preserve the nominal equaltailed error rates. The finding provides cautionary counterpoint on the practical value of approximate intervals, especially when the sample sizes are small.
The existing investigations present important inferential methodology for point and interval estimation of normal percentiles. However, the related hypothesis testing problems have not been properly explicated in the literature. It is well known that there exists a direct connection between confidence interval and hypothesis testing. But the two approaches are philosophically different in the outset of precision and power viewpoints. Accordingly, to conduct a significance tests for percentiles, the conclusion can be alternatively obtained by examining whether the specified percentile value is contained in the proper two-or one-sided confidence intervals. It appears that percentile analysis can be performed without explicitly defining the desirable test statistics and associated rejection regions. However, power evaluation and sample size planning for hypothesis testing methodologically differ from the precision and sample size considerations in the context of interval estimation. Consequently, it is of theoretical importance and practical interest to document the exact test procedures, power calculations, and sample size determinations for percentile studies.
To enhance the usage of percentile analysis, this article describes hypothesis testing procedures and associated power functions for assessing the difference, noninferiority, and equivalence of normal percentiles. The difference and noninferiority procedures closely follow the two-and one-tail test formulations. In the conventional studies of the population means, a null hypothesis of zero may be informative to address certain essential research questions. The situations associated with percentile assessment are more sophisticated because the target percentile is unlikely a zero value. The percentile tests for difference and noninferiority require researchers to provide a sensible magnitude that corresponds to the percentile threshold for identifying substantial research finding. Moreover, the importance for establishing equivalence instead of no difference has been emphasized in Blackwelder [14] and Parkhurst [15], among others. Further details on the design and analysis of noninferiority and equivalence studies can be found in Fleming et al. [16] and Wellek [17].
Notably, the binomial test of hypotheses concerning quantiles in Mood, Graybill, and Boes ( [18], Section 11.3.2) provides an appealing nonparametric alternative. Although the procedure is applicable for all random samples from a continuous distribution, there are not many feasible alpha values for small sample sizes, unless randomized tests are used. In general, the nonparametric tests may be more powerful than their parametric counterparts when normality assumption fails, whereas the nonparametric alternatives are less powerful than the parametric procedures when the conventional assumptions hold. More importantly, the undesirable properties and related problems associated with binomial tests have been addressed in Vos and Hudson [19] and Thulin [20], among others. Comprehensive discussions and reviews for the prevailing Wald large-sample normal test and other alternative interval procedures can be found in Agresti and Coull [21], Newcombe [22], Brown, Cai, and DasGupta [23,24], and the references therein. The illustrations and appraisals in this article were confined to the test procedures that assume normality of the sampling distribution.
This paper aims to present the exact test procedures for percentile study under the three structural considerations of difference, noninferiority, and equivalence scenarios. For the purpose of providing profound implications in selecting the most appropriate approach, the approximate techniques of Bland and Altman [11] and Chakraborti and Li [12] are also extended to the percentile testing problem. Specifically, Bland and Altman [11] proposed an approximate t distribution for a convenient transformation of the natural, but biased, estimator of the normal percentile. On the other hand, Chakraborti and Li [12] suggested that a standardized minimum variance unbiased estimator also has an approximate t distribution. Note that the simplified considerations proposed in Bland and Altman [11] and Chakraborti and Li [12] may be appealing for inducing computational shortcuts but they do not necessarily maintain the desired accuracy for all settings, especially when the sample sizes are small. Accordingly, it is essential to discern not only which method is most suitable under what circumstances but also the actual differences between the contending test procedures.
Furthermore, the corresponding power and sample size calculations for advance planning of percentile studies are explicated. Monte Carlo simulation study was also conducted to compare the accuracy of the exact and approximate procedures with respect to the control of Type I error rate. Although an exact technique is theoretically better than the approximate methods, the actual performance may not guarantee a substantial difference to justify the need for adopting the exact approach that is methodologically sophisticated and computationally demanding. The current study provides detailed analytic explications and numerical evidences to reveal the discrepancy between the exact and approximate procedures for percentile analysis. A drug dissolution problem and accompanying software programs are employed to illustrate the usefulness of suggested procedures for data analysis and design planning.

Exact test procedures
Assume X 1 , …, X N are a sample from a N(μ, σ 2 ) population with unknown mean μ and variance σ 2 for N > 1. The 100pth percentile of the normal distribution N(μ, and z p is the (100·p)th percentile of the standard normal distribution N(0, 1). An intuitive, but biased, estimator of the percentile θ iŝ ðX i −XÞ 2 =ðN−1Þ are the sample mean and sample variance, respectively. Accordingly, the minimum variance unbiased estimator iŝ where c = (ν/2) 1/2 Γ(ν/2)/Γ{(ν + 1)/2} and ν = N -1. Further details about the point estimation properties ofθ B andθ MU are available in Royston and Mathews [1]. Also, the recent study of Shieh [13] compared several confidence interval procedures of θ. In contrast, the focus here is on the hypothesis testing of normal percentiles. Under the prescribed normal setting for the sample {X 1 , …, X N }, standard derivations show that where t(ν, −z p N 1/2 ) is a noncentral t distribution with degrees of freedom ν and noncentrality parameter -z p N 1/2 . The fundamental properties and related extensions of noncentral t distribution can be found in Johnson, Kotz, and Balakrishnan [9].

Tests for difference
To detect the magnitude of a percentile in terms of the hypotheses the test statistic is of the form where θ 0 is a constant. The test rejects H 0 at the significance level α if T E0 < τ α/2 or T E0 > τ 1 − α/2 where τ α/2 and τ 1 − α/2 are the lower and upper (100·α/2)th quantiles of the distribution t(ν, −z p N 1/2 ), respectively, for 0 < α < 0.5. Accordingly, it can be shown that the power function is of the form where Δ = (μ -θ 0 )/(σ 2 /N) 1/2 .

Tests for noninferiority
In addition to the regular test of difference, it is of practical importance to test the hypotheses for noninferiority. The problem of testing noninferiority of percentiles can be presented by the following hypotheses: when larger values of θ are desired and θ 0 is the designated noninferiority threshold. The test procedure rejects the null hypothesis at the significance level α if T E0 > τ 1 − α and the associated power function is readily obtained as On the other hand, if smaller values of θ are preferred, then the following hypotheses should be adopted for the test of noninferiority: where the chosen value θ 0 represents the noninferiority bound. At the significance level α, the rejection region for the lower one-sided test is T E0 < τ α and the power function is expressed as Tests for equivalence Unlike the traditional differences-based procedures, equivalence testing provides a proper method for demonstrating the comparability of target percentile. In general, the null and alternative hypotheses of a test of percentile equivalence can be formulated as where θ T and δ (> 0) are constants. Accordingly, θ T is the target value and δ represents the minimum threshold for declaring equivalence between the population percentile θ and θ T . Following the two one-sided tests procedure proposed by Schuirmann [25] and Westlake [26] for assessing equivalence of mean effects, the null hypothesis is rejected at the significance level α if It is important to note that the rejection is an intersection of two one-sided segments in terms of the lower and upper (100·α)th quantiles τ α and τ 1 − α of the noncentral t distribution t(ν, −z p N 1/2 ). The rejection region of X and S 2 /N has an isosceles triangular shape similar to those in Meyners [27] and Schuirmann [28] for the equivalence procedure of two treatment means. Consequently, the power function of the percentile equivalence test can be written as Moreover, it is clear from the fundamental assumption Then, the exact power function can be expressed by where is the cumulative density function of the standard normal distribution, and the expectation E K is taken with respect to the distribution K. It is essential to note that the probability P{K ≥ κ E } ≐ 0 in the subsequent numerical assessments under a wide range of model configurations. This phenomenon is similar to the power computations for the equivalence procedure of two treatment means as noted in Siqueira, et al. [29] and Shieh [30]. Therefore, the exact power appraisal can be numerically approximated by

Approximate methods
For the purpose of method comparisons, two different approaches for testing normal percentiles are also presented next. To construct confidence intervals of normal percentiles, Bland and Altman [11] and Chakraborti and Li [12] considered simple t approximations for the standardized forms ofθ B andθ M ; respectively. Their methods are extended and examined here for the three types of difference, noninferiority, and equivalence testing.

The Chakrabort-Li method
In view of the desirable properties of the minimum variance unbiased estimatorθ M ; Chakraborti and Li [12] suggested an approximate t distribution for the standardized quantity ofθ M : where m ¼ 1 þ Nz 2 p ðc 2 −1Þ and t(ν) is a t distribution with degrees of freedom ν. Note that Var½θ M = (mσ 2 )/N and the denominator of T M is obtained by a direct substitution of σ 2 with S 2 in the standard deviation ofθ M .
The simple formulation of T M provides an alternative test statistic for judging the magnitude of normal percentiles. For the hypothesis test of difference in terms of H 0 : θ = θ 0 versus H 1 : θ ≠ θ 0 , the null hypothesis can be rejected at the significance level α if and t α/2 and t 1 − α/2 are the lower and upper 100(α/2)th quantiles of a t distribution t(ν) with degrees of freedom ν, respectively. Under the approximate t assumption, the corresponding power function can be derived as Similarly, the test statistic T M0 can be applied for hypothesis testing of noninferiority of percentiles in terms of H 0 : θ ≤ θ 0 versus H 1 : θ > θ 0 . The test procedure rejects the null hypothesis at the significance level α if T M0 > t 1 − α and the associated power function is Moreover, under the hypotheses: H 0 : θ ≥ θ 0 versus H 1 : θ < θ 0 , the test of noninferiority is rejected if T M0 < t α and the corresponding power is given by For the case of evaluating percentile equivalence with respect to H 0 : θ -θ T ≤ −δ or θ -θ T ≥ δ versus H 1 : -δ < θ -θ T < δ, the null hypothesis is rejected at the significance level α if Accordingly, the power function can be shown as where Numerically, the power calculation can be simplified just as Ψ AEQ given above: The Bland-Altman method Similar to the test procedures based on the minimum variance unbiased estimator, hypothesis testing of normal percentiles can be conducted with the following transformation ofθ B in Bland and Altman [11]: where b ¼ 1 þ z 2 p =2 . Specifically, the hypothesis testing of percentile difference in terms of H 0 : θ = θ 0 versus H 1 : The associated power function is of the form To perform the hypothesis testing of noninferiority with H 0 : θ ≤ θ 0 versus H 1 : θ > θ 0 , the test rejects the null hypothesis at the significance level α if T B0 > t 1 − α and the power function is readily obtained as Likewise, under the hypotheses: H 0 : θ ≥ θ 0 versus H 1 : θ < θ 0 , the test of noninferiority is rejected if T B0 < t α and the corresponding power is expressed as Moreover, for the equivalence test of normal percentiles under the hypotheses of H 0 : θ -θ T ≤ −δ or θ -θ T ≥ δ versus H 1 : -δ < θ -θ T < δ, the null hypothesis is rejected at the significance level α if In this case, the power function has the following formulation: where Similar to the other two cases, the power computation can be well approximated by

Results
Numerical investigations are presented next to examine and compare the fundamental features of the exact and approximate test procedures of percentiles with respect to the control of Type I error rate and accuracy of power and sample size computation.

Tests for difference
For the purpose of illustration, the null Nðμ 0 ; σ 2 0 Þ distribution is set as N(0, 1) and two different mean values are considered for the alternative distribution N(μ, σ 2 ): N(0.4, 1) and N(0.6, 1). The corresponding percentiles θ 0 and θ are simplified as θ 0 = μ 0 + z p σ 0 = z p and θ = μ + z p σ = μ + z p , respectively, with μ = 0.4 and 0.6. For the difference test of percentile in terms of H 0 : θ = θ 0 versus H 1 : θ ≠ θ 0 , the sample sizes needed to attain the specified power 0.80 for the chosen significance level α = 0.05 are determined by the power functions Ψ DI , Ω DI , and Ξ DI for p = 0.1, …, 0.9. The computed sample sizes for the prescribed three procedures {T E0 , T M0 , T B0 } are summarized in Table 1 for all eighteen combined cases of μ and p. It should be noted that the parameter settings are chosen so that the resulting sample sizes have a reasonable magnitude that is often occurred in practice. Moreover, these situations with small and moderate sample sizes are of great importance in the sense that the contending procedures have the obvious potential of yielding distinct outcomes. Monte Carlo simulation studies of 10,000 iterations were conducted for examining the accuracy of the power functions Ψ DI , Ω DI , and Ξ DI . The results reveal that the simulated powers and the attained powers of all three methods agree to the second decimal place for all cases considered here. To save space, the details are not reported.
Due to the approximate nature of the t distribution associated with the two approximations of Chakraborti and Li [12] and Bland and Altman [11], it is of statistical concern to validate the control of the Type I error rates. Note that the real distribution of the percentile is skewed when sample size is small and p deviates considerably from 0.5. This implies that the symmetric t approximation of the two test statistics T M0 , and T B0 is presumably unsuitable. In other words, the two critical values t α/2 and t 1 − α/2 are theoretically inaccurate when one-sided rejection probability are evaluated. It is constructive to examine three distinctive Type I errors correspond to the lower-tail, upper-tail, and two-sided rejection regions of the difference tests of percentile. Accordingly, Monte Carlo simulation studies were also performed to compute the simulated Type I error rates of the exact and approximate test procedures for θ = θ 0 or μ = 0. The simulated Type I error rate was the proportion of the 10,000 replicates whose test statistic fell in the designated rejection region. In the process, the estimates of the lower-tail and upper-tail rejection rates were computed and summed as the overall or two-sided simulated Type I error rate. The accuracy of the control of Type I error rate can be assessed by the differences between the one-sided and two-sided simulation estimates and the nominal values 0.025 and 0.05, respectively. These differences or errors of the three contending test procedures are also reported in Table 1. It can be readily seen from the results in Table 1 that the all three test methods have excellent control of two-sided Type I error rate. The absolute magnitudes of the errors are less than 0.01 for the investigated mean and percentile configurations.
Moreover, the lower-tail and upper-tail rejection rates of the exact approach are also very close to the nominal levels. But the one-sided Type I error rates of the two approximate methods do not maintain the same accuracy especially for low and high percentiles. Despite the desired performance of the approximate tests in overall Type I error rate, the resulting errors of the lower-tail rejection region tend to be negative for small p while Table 1 The error between simulated alpha and nominal alpha for the difference tests of percentile H 0 : θ = θ 0 versus H1: θ ≠ θ 0 with μ 0 = 0, σ 0 = 1, σ = 1, and α = 0.05 those associated with large p are constantly positive. In contrast, the upper-tail errors have the exactly opposite outcomes. For the particular case with μ = 0.6 and p = 0.9, the induced errors for the approximation of Chakraborti and Li [12] are 0.0201 and − 0.0174 for lower and upper rejection regions, respectively. The corresponding deviated percentages are 0.0201/0.025 = 80.4% and 0.0174/0.025 = 69.6%. To the approximate method of Bland and Altman [11], the lower-tail and upper-tail errors are 0.0248 and − 0.0182 with the deviated percentages 0.0248/0.025 = 99.2% and 0.0182/0.025 = 72.8%, respectively.

Tests for noninferiority
The underlying characteristics of the exact and approximate methods for the noninferiority test of percentile are also assessed. With the same model formulations in the previous scenario of difference test, the required sample sizes are computed for the hypotheses H 0 : θ ≤ θ 0 versus H 1 : θ > θ 0 with the power functions Ψ NI , Ω NI , and Ξ NI . As expected, the result reported in Table 2 is relatively smaller than the counterpart in Table 1 with the identical values of μ and p. Moreover, simulation studies were also performed to appraise the actual performance of Type I error for θ = θ 0 or μ = 0. The errors between the simulated rejection rates and nominal value α = 0.05 are presented in Table 2. Unlike the exact procedure with good control of Type error rate, the two approximate tests do not maintain the required performance. Specifically, when μ = 0.6 and p = 0.9, the absolute errors (absolute error percentage) can be as large as 0.0258 (0.0258/0.05 = 51.6%) and 0.0260 (0.0260/0.05 = 52.0%) for T M0 and T B0 of Chakraborti and Li [12] and Bland and Altman [11], respectively. Although the situations improved with increasing sample size as those cases when μ = 0.4, they still suffer some potential deficiency and are outperformed by the exact test.

Tests for equivalence
For the sake of completeness, numerical examination is extended to the equivalence tests of percentile in terms of H 0 : θ -θ T ≤ −δ or θ -θ T ≥ δ versus H 1 : -δ < θ -θ T < δ. In this case, the target percentile and threshold are set as θ T = z p and δ = 0.6, respectively. The alternative normal distribution is selected as N(μ, 1) and the associated percentile is θ = μ + z p σ = μ + z p . Then, the power functions Ψ AEQ , Ω AEQ , and Ξ AEQ are applied to computed the minimum sample sizes required for attaining the nominal power 0.80 at α = 0.05. The resulting sample sizes are listed in Table 3 for μ = 0 and 0.3 and p = 0.1, …, 0.9. It was further justified with simulation studies that the power and sample size calculations of the three procedures are all extremely accurate for all eighteen cases reported here. However, power evaluation is valid Table 2 The error between simulated alpha and nominal alpha for the non-inferiority tests of percentile H0: θ ≤ θ 0 versus H1: θ > θ 0 with μ 0 = 0, σ 0 = 1, σ = 1, and α = 0.05 and informative only when the critical value satisfies the nominal Type I error rate. Additional simulation studies were employed to assess the control of Type I error rates of the equivalence tests {T EL , T EU }, {T ML , T MU }, and {T BL , T BU } for θ = θ T -δ or μ = −δ = − 0.6. The errors between the simulated and nominal Type I error rates are presented in Table 3. The assessments show that the two approximate tests {T ML , T MU }, and {T BL , T BU } are not as good as the exact procedure {T EL , T EU }. The deficiency of the two simple t distributions is particularly more prominent when the sample size is small and |p -0.5| is large.

An example
To demonstrate the usefulness of the suggested techniques and accompanying programs, a quality control application in pharmaceutical products is exemplified and analyzed with the hypothesis testing and sample size procedures. Suppose a sample of the selected batch of tablets is obtained and tested according to the acceptance sampling plan. Specifically, the dissolution performance is assessed in terms of the percentage of tablets dissolved less than a specified amount at a certain time period. For illustration, the summary statistics of the dissolution values are X ¼ 50: 10  7909, respectively. The null hypothesis is rejected and it implies that the 90th percentile of dissolution amount is not 50.8379 at α = 0.05. Also, it can be shown that the values of the two approximate test statistics for Chakraborti and Li [12] and Bland and Altman [11] are T M0 = 2.0857 and T B0 = 2.0614, respectively, with the critical value t 0.975 = 2.1448. Hence, the two approximate tests suggest that the null hypothesis cannot be rejected for α = 0.05.
Moreover, a noninferiority test can be formed as H 0 : θ ≤ 50.8379 versus H 1 : θ > 50.8379. With a critical value τ 0.05 = − 3.1072, it indicates that the 90th percentile of dissolution distribution is higher than 50.8379 at the 5% level of significance. In this case, the critical value for the two approximate methods is t 0.95 = 1.7613. Thus, the two approximate tests lead to the same result as the Table 3 The error between simulated alpha and nominal alpha for the equivalence tests of percentile H 0 : θ -θ T ≤ − δ or θ -θ T ≥ − δ versus H 1 : -δ < θθ T < δ with δ = 0.6, θ T = z p , σ = 1, and α = 0.05 For planning future drug dissolution study, sample size calculations should be considered so that the tests have enough power to confirm meaningful magnitude of percentile. It is commonly assumed that typical sources like published findings or expert opinions can offer plausible and reasonable values for the vital characteristics of future study. Hence, the sample statistics of the summary statistics are used as parameter values μ = 50.1 and σ = 1.31. To achieve the nominal power 0.80 with α = 0.05, the constructed SAS/IML programs reveal that the required sample sizes are N = 21 and 17 for the test of difference: H 0 : θ = 50.8379 versus H 1 : θ ≠ 50.8379 and the test of noninferiority: H 0 : θ ≤ 50.8379 versus H 1 : θ > 50.8379, respectively. Moreover, for the abovementioned test of equivalence with θ T = 51.6660 and δ = 1.2, sample size N = 25 is needed to attaining the nominal power 0.80 at α = 0.05. Note that the exemplifying configurations are included in the user specifications of the SAS/ IML programs presented in the supplemental files. Accordingly, users can easily modify the input values in these statements to accommodate their own model specifications.

Discussion
The present investigation generalizes and expands current results in the statistical literature by describing both exact and approximate procedures for the three different percentile tests of difference, noninferiority, and equivalence. The exact approach employs a noncentral t distribution, while the approximate techniques follow the familiar t distribution as considered in Chakraborti and Li [12] and Bland and Altman [11]. Regarding the two approximate procedures, the results of the conventional tests for difference show that the lower critical value t α/2 is generally too small for lower normal percentiles and is typically too large for the higher normal percentiles. On the other hand, the upper critical value t 1 − α/2 overestimate and underestimate the correct one for small and large p, respectively. Even the overall Type I error is not an issue, it is statistically improper to recommend a two-sided test procedure on the basis of a combination of some noticeable under-and over-sized critical values and rejection regions. Moreover, despite the relatively involved analytic assessments and computational requirements, the comprehensive numerical appraisals show that the exact approach is superior to the approximate methods on the basis of control of Type I errors.

Conclusion
In view of the conceptual simplicity and context-free feature, percentiles are widely used for determining the relative magnitude and substantial importance of quantitative measurements in all scientific fields. Accordingly, much of the literature has provided the inferential procedures for point and interval estimation of normal percentiles. To extend the applicability of percentile analysis, this article addresses the hypothesis testing problem for the percentiles of a normal distribution. The recommended test procedures and derived power functions are also empirically justified for percentile score assessments and sample size determinations. In order to facilitate data analysis and study planning, specialized computer programs are presented for conducting hypothesis testing and sample size calculation in percentile research.