The appropriateness of Bland-Altman’s approximate confidence intervals for limits of agreement

Background Percentiles are widely used as reference limits for determining the relative magnitude and substantial importance of quantitative measurements. An important application is the advocated Bland-Altman limits of agreement. Methods To contribute to the data analysis and design planning of reference limit or percentile research, the purpose of this paper is twofold. The first is to clarify the statistical features of interval estimation procedures for normal percentiles. The second goal is to provide sample size procedures for precise interval estimation of normal percentiles. Results The delineation demonstrates the theoretical connections between different pivotal quantities for obtaining exact confidence intervals. Moreover, the seemingly accurate approximate methods with equidistant from the principal estimators are shown to have undesirable confidence limits. It is found that the optimal sample size has a minimum for median or mean, and increases as the percentile approaches the extremes. Conclusions The exact interval procedure should be used in preference to the approximate methods. Computer algorithms are presented to implement the suggested interval precision and sample size calculations for planning percentile research. Electronic supplementary material The online version of this article (10.1186/s12874-018-0505-y) contains supplementary material, which is available to authorized users.


Background
A percentile is a numerical measure that represents the reference point below which a given percentage of values in the target population fall. Because of the conceptual simplicity and context-free feature, percentiles are widely used for determining the relative magnitude and substantial importance of quantitative measurements in all scientific fields. For example, the children health conditions are often assessed by their weight and height in comparison to the national averages and percentiles found in the growth charts. Also, reference limits are extensively applied in medicine and related fields to identify informative range of measurement from a reference population. The most typical reference limits contain the central 95% of the values in the population of interest. As an important application, the Bland and Altman [1,2] 95% limits of agreement are comprised of the 2.5th percentile and 97.5th percentile for the distribution of the difference between paired measurements.
The practical usage of percentiles is often represented by referring to a normal distribution. In this prominent case, the normal percentile is a linear function of the mean and standard deviation of the designated population. Note that the sample mean and sample variance are complete and sufficient statistics for the population mean and variance. Although estimation of normal percentile is not discussed in most standard texts, it is straightforward to obtain the minimum variance unbiased estimator of a normal percentile. However, the dominance property does not extend to other principles in decision theoretic analyses such as the mean square error criterion. Among others, Royston and Mathews [3] conducted a comparison of potential point estimators of normal percentiles with respect to bias and mean square error. More advanced and theoretical investigations of normal percentile estimators can be found in Keating, Mason, and Balakrishnan [4], Keating and Tripathi [5], Parrish [6], Rukhin [7], and Zidek [8,9].
In view of the stochastic nature in statistical inference, it is more informative to construct confidence intervals for the target parameters than to provide a single estimate about their values. General expositions and comprehensive guidelines of interval estimation are available in Hahn [10,11], Hahn and Meeker [12], and Vardeman [13]. Accordingly, various interval methods of normal percentiles have been described from different perspectives. The exact interval procedure of normal percentiles has been documented in the literature, for example, see Hahn and Meeker [12], Johnson, Kotz, and Balakrishnan [14], and Owen [15]. Moreover, the one-sided confidence intervals of normal percentiles have a close link to the one-sided tolerance bounds of a normal distribution as noted in David and Nagaraja [16], Krishnamoorthy and Mathew [17], and Odeh and Owen [18].
Notably, Bland and Altman [1,2] suggested the 95% limits of agreement for evaluating the differences between measurements by two methods. The endpoints of the Bland-Altman 95% limits of agreement are the 2.5th percentile and 97.5th percentile for the distribution of the difference between paired measurements. To reflect the uncertainty due to sampling error, approximate interval formulas were presented for estimating the two individual percentiles. The large number of citations revealed that the Bland-Altman analysis has become the major technique for assessing agreement between two methods of clinical measurement. But the recent work of Carkeet [19] and Carkeet and Goh [20] provided detailed discussions in favor of exact confidence interval over the approximate procedure considered in Bland and Altman [1,2], especially when the sample sizes are small. Further considerations and reviews of measuring agreement in method comparison studies are available in Barnhart, Haber, and Lin [21], Choudhary and Nagaraja [22], and Lin et al. [23].
Although the practical implementation of the exact interval procedure is well presented in Carkeet [19], the explication of the differences between the exact and approximate methods mainly concentrated on the relative magnitudes and symmetric/asymmetric bounds of the resulting confidence limits. On the other hand, the endpoints of the Bland-Altman 95% limits of agreement are usually viewed as a pair of bound for measuring agreement in method comparison studies. Accordingly, Carkeet [19] and Carkeet and Goh [20] focused on the comparison of the approximate confidence intervals for upper and lower limits of agreements as a pair and the exact two-sided tolerance intervals for a normal distribution. Therefore, the distinctive advantage of the exact interval procedures and the potential limitation of the approximate confidence intervals for the individual upper and lower limits of agreement were not fully addressed in Carkeet [19] and Carkeet and Goh [20]. It is of practical importance to conduct a detailed appraisal of the accuracy and discrepancy between the exact and approximate interval procedures for an individual limit of agreement under a wide range of model configurations. The problem of obtaining a single confidence interval to cover both limits of agreement simultaneously is more involved and a detailed discussion of this topic is beyond the scope of the present study.
In addition to the abovementioned studies, a numerical comparison of several interval estimation methods of normal percentiles was presented in Chakraborti and Li [24]. They adopted a standardized minimum variance unbiased estimator as the pivotal quantity and proposed both exact and approximate confidence intervals of normal percentiles. Their simulation study showed that the expected width and coverage probability of the suggested exact and approximate methods are nearly identical to that of the procedure described in Lawless ([25], p. 231). Despite the analytic arguments and empirical findings in Chakraborti and Li [24], the following two attentions toward their illustration should be noted. First, although it was demonstrated that Lawless's [25] confidence intervals are the same as the existing formulas in Owen [15] and Odeh and Owen [18], they did not discuss the theoretical implications between their exact method and the established exact procedure. Second, in contrast to the asymmetry of the exact confidence intervals, the approximate confidence intervals of Chakraborti and Li [24] are equidistant around the minimum variance unbiased estimate. Note that the two endpoints of a two-sided confidence interval can also be interpreted as the limits of one-sided confidence interval. Thus, the performance of the two limits of Chakraborti and Li's [24] approximate interval method should be further evaluated with respect to the equal-tailed property. The analytic and numerical results in Chakraborti and Li [24] are not detailed enough to clarify these fundamental issues. It is prudent to elucidate these vital aspects of their methods to be accepted as a feasible technique.
To enhance the adoption of appropriate techniques for interval estimation and research design, this paper has two objectives. The first is to appraise the statistical features of interval estimation procedures for normal percentiles. Theoretical justifications are presented to illuminate the statistical connections between different pivotal quantities for obtaining exact confidence intervals. Furthermore, comprehensive empirical assessments are provided to show the seemingly accurate approximate methods with equidistant around the principal estimators have problematic confidence limits. The second goal is to provide sample size procedures for precise interval estimation of normal percentiles. The required precision of a confidence interval is evaluated with the magnitude of expected width, and the assurance probability of interval width within a designated threshold. In view of the general availability of statistical software packages SAS and R, computer algorithms are developed to facilitate the implementation of the suggested confidence interval and sample size computations.

Methods
Assume X 1 , …, X N are a sample from a N(μ, σ 2 ) population with unknown mean μ and variance σ 2 for N > 1. The sample mean X and sample variance S 2 are defined respectively. The 100pth percentile of the distribution N(μ, σ 2 ) is denoted by θ, where and z p is the 100pth percentile of the standard normal distribution N(0, 1). To estimate the percentile θ, the intuitive formulâ is a biased estimator because E[S] < σ. As noted in Royston and Mathews [3], the minimum variance unbiased estimator iŝ where c = (ν/2) 1/2 Γ(ν/2)/Γ{(ν + 1)/2} and ν = N -1. Note that c is an adjusting factor so that cS is an unbiased estimator of σ or E The relative numerical performance ofθ B ,θ MU , and alternative estimators of θ can also be found in Royston and Mathews [3].
To obtain confidence intervals for θ, standard derivations show that where t(ν, -z p N 1/2 ) is a noncentral t distribution with degrees of freedom ν and noncentrality parameter -z p N 1/2 (Johnson, Kotz, & Balakrishnan [14], Chapter 31). Accordingly, T* yields a pivotal quantity for constructing confidence intervals of normal percentiles. An upper 100(1 -α)% one-sided confidence interval of θ is expressed as {θ L , ∞} and the lower confidence limit isθ where Also, a lower 100(1 -α)% one-sided confidence interval of θ is {−∞,θ U } and the upper confidence limit has the form Furthermore, a 100(1 -α)% two-sided confidence interval of θ with equal tail probability can be readily obtained as {θ L ,θ U } wherê Supplementary SAS/IML and R computer programs are provided to take advantage of the embedded statistical functions for calculating the exact confidence intervals.
In addition, it may be more appealing to modify the point estimatorsθ B andθ MU to acquire the alternative pivotal quantities for deriving the confidence intervals of θ, respectively. It is easy to see that T B = T* + z p N 1/2 and T MU = T* + z p cN 1/2 . Therefore, T B and T MU differ from T* only in the location shift. Because the terms z p N 1/2 and z p cN 1/2 do not depend on the unknown parameters, T B and T MU give the same one-and two-sided confidence intervals for θ described in Eqs. 5-7. As a generalization of the simple location shifts between different pivotal quantities, the prescribed application of pivotal quantity for exact interval estimation extends to any linear function of T*. For example, Lawless [25] constructed the confidence intervals of normal percentiles through the quantity Evidently, T L can be expressed as a linear transformation of T* by T L = (T* + z p N 1/2 )/N 1/2 . Assume q L, 1 − α is the 100(1 -α)th percentile of T L , it is readily established that Lawless ([25], p. 231) is written in a different form, the quantity T L also leads to the same exact confidence interval {θ L ,θ U } for θ.
On the other hand, Chakraborti and Li [24] considered the standardized quantity for interval estimation of θ, where a = 1 + Nz 2 p (c 2 -1). Their method relies on direct computations with the derived probability density function and cumulative distribution function of T ST . Therefore, a special purpose algorithm is required to compute the quantiles of T ST and to obtain the suggested confidence intervals of θ. Note that T ST is a linear function of T* in terms of T ST = (T* + z p cN 1/2 )/a 1/2 . Hence, if q ST, 1 − α denotes the 100(1 -α)th percentile of T ST , it has the identical linear transform with the 100(1 -α)th percentile of T* or q ST, 1− α = {t 1 − α (v, −z p N 1/2 ) + z p cN 1/2 }/a 1/2 . As noted earlier, the actual value t 1 − α (ν, -z p N 1/2 ) can be obtained with the cumulative distribution function of a noncentral t distribution in major statistical packages such as SAS and R. Hence with the general availability of software systems and the underlying linear relationship between T ST and T*, direct calculation is not required to compute the percentile q ST, 1− α . More importantly, using the standard pivotal procedure and the prescribed linear transformation of T*, the pivotal quantity T ST leads to the same interval estimators of θ with T* and the other three pivotal measures T B , T MU , and T L . Although the pivotal quantity T L was also examined in Chakraborti and Li [24], the resulting interval estimators of T L and T ST are viewed as two distinct procedures. However, the numerical assessments in Chakraborti and Li [24] reported that the performances of the two interval procedures of T L and T ST are almost identical. The important connections between the pivotal quantities and the resulting confidence intervals of θ should be properly recognized. Essentially, the prescribed explication illuminates the conceptual equivalence between the five pivotal quantities T * , T B , T MU , T L , and T ST for constructing confidence intervals of θ.

Results
Along with the exact confidence interval procedure of normal percentiles, Chakraborti and Li [24] also described an approximate interval estimator by assuming T ST has a t distribution with degrees of freedom ν: Thus, an approximate 100(1 -α)% two-sided equal tail confidence interval {θ AL ,θ AU } of θ is immediately constructed asθ where Although the two-sided confidence interval is only an approximation, the simulation study of Chakraborti and Li [24] revealed that {θ AL ,θ AU } is very competitive with the exact interval estimator {θ L ,θ U } with respect to the coverage probability and interval width.
On the other hand, to construct confidence intervals of limits of agreement or percentiles, Bland and Altman [2] argued that Var[S] ≐ σ 2 /(2ν) and With the approximation, they suggested the simplified pivotal quantity Accordingly, the widely used confidence intervals of Bland and Altman [2] can be derived from T BA and they are written as {θ BAL ,θ BAU } wherê For the particular case of α = 0.05, the general expressions reduce to the confidence intervals for the two endpoints of the 95% limits of agreement considered in Bland and Altman [2]: and respectively, because z 0.025 = − 1.96, z 0.975 = 1.96, and b = 2.92.
For the blood pressure data presented in Bland and Altman [2] with the sample size N = 85, the sample mean difference (observer minus machine) X = − 16.29 mmHg, and the standard deviation of the differences S = 19.61, the 95% confidence intervals of the exact and two approximate methods for the 2. .3736}, respectively. Although the differences between these estimates may not be substantial, it is vital to point out that the confidence limits of the 2.5th percentile are in the ascending order ofθ L <θ AL <θ BAL andθ U <θ AU <θ BAU . Whereas the confidence limits of the 97.5th percentile have a reversed situation:θ BAL <θ AL <θ L andθ BAU <θ AU <θ U . This inherent relationship between the three interval procedures is further justified as the usual occurrence in the simulation study.
In general, the actual distribution of the pivotal quantity T* is skewed, especially when sample size is small and p deviates considerably from 0.5. This implies that the interval procedure should adopt asymmetric confidence intervals for θ. Notably, the exact two-sided interval estimates {θ L ,θ U } are not equidistant from the sample mean except for the special case p = 0.5. In contrast, the approximate confidence intervals {θ AL ,θ AU } of Chakraborti and Li [24] is equidistant about the unbiased estimateθ UB . Therefore, the interval procedure is presumably inappropriate and the two confidence limitsθ AL andθ AU are methodologically inaccurate when one-sided coverage probabilities are considered. But the numerical investigations in Chakraborti and Li [24] did not cover these fundamental issues. Similarly, the confidence intervals {θ BAL ,θ BAU } of Bland and Altman [2] are symmetric around the estimateθ B and thus also suffer the same shortcoming as the intervals {θ AL ,θ AU } of Chakraborti and Li [24].
Note that the lower and upper confidence limits of a 100(1 -α)% two-sided confidence interval are equivalent to the lower and upper confidence limits of the 100(1 -α/2)% one-sided upper and lower confidence intervals, respectively. To demonstrate the potential drawback of the approximate interval procedures of Chakraborti and Li [24] and Bland and Altman [2], a simulation study was conducted to evaluate the coverage performance of their one-and two-sided confidence intervals. Although the approximate interval method of Bland and Altman [2] has been examined in Carkeet and Goh [20] under a different perspective, the particular method is included in the following appraisal for the sake of completeness and with the intention to explicate additional properties that were not reported before.
Specifically, Monte Carlo simulation studies of 10,000 iterations were performed to compute the simulated coverage probability of the exact and approximate confidence intervals for the percentiles of a standard normal distribution N(0, 1). The designated sample size has six different magnitudes: N = 10, 20, 30, 50, 100, and 200. Also, a total of eight percentile probabilities are examined: p = 0.025, 0.05, 0.10, 0.20, 0.80, 0.90, 0.95, and 0.975. For each replicate, the lower and upper confidence limits {θ L ,θ U }, {θ AL ,θ AU }, and {θ BAL ,θ BAU } were computed to construct the 95 and 97.5% one-sided confidence intervals and the corresponding 90 and 95% two-sided confidence intervals. The simulated coverage probability was the proportion of the 10,000 replicates whose confidence interval contained the population normal percentile. Then, the adequacy of the one-and two-sided interval procedures is determined by the error = simulated coverage probabilitynominal coverage probability. The results are summarized in Tables 1, 2, 3 and 4 for the exact and approximate confidence intervals with two-sided confidence coefficient 1 -α = 0.90 and 0.95, respectively.
It can be seen from the resulting errors of the three types of confidence intervals that the exact approach performs extremely well for all 96 cases presented in Tables 1, 2, 3 and 4. For the two approximate methods of Chakraborti and Li [24] and Bland and Altman [2], the coverage probabilities of their two-sided interval remain rather close to the nominal confidence levels. However, the corresponding approximate one-sided interval procedures do not preserve the same desired accuracy unless the sample size is large. Due to different degree of presumed simplifications, the interval procedure of Bland and Altman [2] is inferior to that of Chakraborti and Li [24], especially for small sample sizes. To enhance the explication, the simulated coverage probabilities of the 97.5% one-sided confidence intervals for N = 10 are plotted in Fig. 1. Despite the attractive coverage behavior of the approximate two-sided confidence intervals, the errors of the upper confidence intervals tend to be negative for small p while those associated with large p are consistently positive. The situations of the lower confidence intervals reveal exactly the opposite patterns. In other words, the corresponding lower and upper confidence limits are generally too large for the 2.5th, 5th, 10th and 20th normal percentiles and are mostly too small for the 80th, 90th, 95th, and 97.5th normal percentiles. Consequently, the two endpoints of the two-sided confidence intervals generally do not meet the assumption of equal-tailed error rates for the two approximate interval methods. A mere coverage probability assessment of the approximate two-sided confidence intervals may obscure the potential biases of the confidence limits based on the t(ν) approximations described in Eqs. 11 and 13. It is inappropriate to claim that a twosided interval procedure is accurate on the basis of a combination of some noticeable under-and over-estimated confidence limits. Instead, the exact interval procedure should be used in preference to the approximate methods of Bland and Altman [2] and Chakraborti and Li [24].

Sample size determinations
From a study design viewpoint, it is essential to determine the optimal sample sizes so that the resulting confidence interval will meet the designated precision requirement. Two particularly useful criteria concern the control of the expected width and the assurance probability of the width within a designated bound (Beal [26]; Kupper & Hafner [27]).
The width of the 100(1 -α)% two-sided confidence intervals {θ L ,θ U } given in Eq. 7 is Accordingly, it is desired to calculate the least sample size such that the expected width of a 100(1 -α) % two-sided confidence interval is within the given threshold: where δ (> 0) is a constant. On the other hand, one may compute the minimum sample size needed to guarantee, with a given assurance probability, that the width of a 100(1 -α)% two-sided confidence interval will not exceed the planned value: where 1 -γ is the specified assurance level and ω (> 0) is a constant. Under the normal assumption, the assessments of expected width and assurance probability are further simplified for brevity. Note that the expected width E[W] has the alternative form Table 1 The error between simulated coverage probability and nominal coverage probability for the 90% two-sided and 95% one-sided confidence intervals when N = 10, 20, and 30 Hence, the inequality E[W] ≤ δ is expressed as {t 1 − α/2 (ν, z p N 1/2 )t α/2 (ν, z p N 1/2 )}/(cN 1/2 ) ≤ δ/σ. Also, the assurance probability is equivalent to where K = νS 2 /σ 2~χ2 (ν) is a chi-square distribution with ν degrees of freedom, κ = {N(N -1)(ω/σ) 2 }/{t 1 − α/2 (ν, z p N 1/2 ) t α/2 (ν, z p N 1/2 )} 2 , and Φ(·) is the cumulative distribution function of the chi-square random variable K. With the exact computational formulas of expected width and assurance probability given in Eqs. 20 and 21, respectively, the sample size N needed to attain the specified precision can be found with a simple iterative search for the chosen parameter values {μ, σ 2 }, percentile p, and confidence level 1 -α.
Evidently, the sample size determinations do not depend on the mean value μ and reduce to the sample size procedures of Kupper and Hafner [27] because θ = μ when p = 0.5. The precision evaluations of expected width and assurance probability depend on the thresholds δ and ω through the relative magnitude ratios δ/σ and ω/σ, respectively. Accordingly, supplementary SAS/IML and R computer programs are presented to facilitate the required computations. Due to the prospective nature of advance research planning, the general guidelines suggest that typical sources like published findings or expert opinions can offer plausible and reasonable values for the vital characteristics of future study. For illustration, the sample statistics of the blood pressure data in Bland and Altman [2] are adopted as parameter values μ = − 16.29 and σ = 19.61. With δ = ω = (0.7)σ = 9.805 and 1 -γ = 0.9, the optimal sample sizes for precise 95% interval estimation of the 97.5th percentile are 183 and 207 under the expected width Table 2 The error between simulated coverage probability and nominal coverage probability for the 90% two-sided and 95% one-sided confidence intervals when N = 50, 100, and 200  For ease of illustration, the computed sample sizes are plotted in Fig. 2.
It is seen from Fig. 2 for the six types of precision that the graphs of the optimal sample size are symmetric with respect to p = 0.5 and are monotonously increasing with the absolute difference |p -0.5|. Therefore, the required sample size for precise interval estimation of median or mean is smaller than those of the other normal percentiles. Also, the optimal sample size increases with a smaller width bound of δ and ω when all other factors are fixed. As expected, more sample size is needed to attain a higher assurance level 1 -γ when the designated width ω and other configurations remain identical. Regarding the difference between the two precision principles, it typically requires a larger sample size to meet the necessary precision of assurance probability than the control of a designated expected width. With the same interval bound δ = ω, the sample sizes Table 3 The error between simulated coverage probability and nominal coverage probability for the 95% two-sided and 97.5% one-sided when N = 10, 20, and 30

Discussion
In view of the wide application in medical studies, this article aims to explicate the theoretical and empirical features of interval procedures of percentiles. An integrated discussion is presented to address the similarities and differences of exact and approximate confidence intervals constructed with various pivotal quantities described in the literature. Although there are distinct selections of pivotal quantities, it is shown that they yield the same exact confidence intervals. Notably, the exact interval procedure requires the use of the cumulative distribution function of a noncentral t distribution. The difficulty of applying the exact approach has been alleviated because of the availability of specialized routines in popular software packages. In contrast, the approximate interval methods are computationally simple and do not require specialized software because they only involve the quantiles of a regular t distribution. However, the approximate confidence intervals carry the symmetry property of a t distribution whereas the noncentral t distribution is skewed so that the resulting exact confidence intervals are not equidistant around the primary statistic.