Interval estimation and optimal design for the within-subject coefficient of variation for continuous and binary variables

Background In this paper we propose the use of the within-subject coefficient of variation as an index of a measurement's reliability. For continuous variables and based on its maximum likelihood estimation we derive a variance-stabilizing transformation and discuss confidence interval construction within the framework of a one-way random effects model. We investigate sample size requirements for the within-subject coefficient of variation for continuous and binary variables. Methods We investigate the validity of the approximate normal confidence interval by Monte Carlo simulations. In designing a reliability study, a crucial issue is the balance between the number of subjects to be recruited and the number of repeated measurements per subject. We discuss efficiency of estimation and cost considerations for the optimal allocation of the sample resources. The approach is illustrated by an example on Magnetic Resonance Imaging (MRI). We also discuss the issue of sample size estimation for dichotomous responses with two examples. Results For the continuous variable we found that the variance stabilizing transformation improves the asymptotic coverage probabilities on the within-subject coefficient of variation for the continuous variable. The maximum like estimation and sample size estimation based on pre-specified width of confidence interval are novel contribution to the literature for the binary variable. Conclusion Using the sample size formulas, we hope to help clinical epidemiologists and practicing statisticians to efficiently design reliability studies using the within-subject coefficient of variation, whether the variable of interest is continuous or binary.


Background
Measurement errors can seriously affect statistical analysis and interpretation; it therefore becomes important to assess the magnitude of such errors by calculating a reliability coefficient and assessing its precision. For instance medical diagnosis, clinicians have now become cognizant to the paramount importance of obtaining accurate meas-urements to ensure safe and efficient delivery of care to their patients. Experiments designed to measure validity and precision of instruments used in biomedical and epidemiological research are ubiquitous. For example, Ashton [1] demonstrated the importance of evaluating the reliability of manual and automated methods for quantifying total white matter lesions burden in multiple sclero-sis patients. They compared the coefficient of variations of three methods. In oncology, Schwartz et al. [2] used the coefficient of variation to evaluate the repeatability in bidimensional computed tomography measurements of three techniques: hand-held calipers on film, electronic calipers on a workstation, and an auto-contour technique on a workstation. Differences between the coefficients of variation were statistically significantly different for the auto-contour technique, compared to the other techniques. The coefficient of variation is often used to compare variables measured on different scales. For example, in social sciences, when the intent is to compare the variability in school performance with the variability of household income, a comparison of standard deviations makes no sense because income and school performance are measured on different scales. The correct comparison may be based on the coefficient of variation because it adjusts for scale. Other applications of the coefficient of variation are given in Tian [3].
Scientists have developed several indices to assess the reliability and reproducibility of quantitative measurements. The intra-class correlation (ICC), the proportion of the between-subject variance to the total variance, has been widely used as an index of measurement reliability. For a comprehensive review on the ICC and its applications, we refer the reader to Fleiss [4], Dunn [5] and Shoukri [6]. One of the criticisms of the ICC is that its value depends on the population from which the study subjects have been obtained, and this may lead to difficulties in comparing results from different studies. Accordingly, Quan and Shih [7] (QS) considered an alternative measure, the Within-Subject Coefficient of Variation (WSCV) as an alternative to the ICC for assessing measurements reproducibility or test-re-test reliability. Because of the requirement that repeated observations are made on each subject, they used the one-way random effects model (REM) as a mechanism to describe the data. Although the use of the WSCV as a measure of reproducibility is long standing, the issue of sample size determination has not been adequately investigated. Sample size estimation is one of the most important issues in the design of any study that uses inferential statistics.
When the ICC is used as the index of reliability, Donner and Eliasziw [8] provided contours of exact power for selected numbers of subjects (k) and numbers of replicates (n). These power results were then used to identify optimal designs that minimize the study costs. Assuming a constant number of replicates per subject, Walter et al. [9] considered an approximation to determine the required number of subjects to achieve fixed levels of power. Bonett [10] calculated the sample size required to achieve a prescribed expected width for the confidence interval on the ICC. Shoukri et al. [11] derived the values of k and n that allocate the sample resources optimally and minimize the variance of the estimated ICC under cost constraints. The cost structure that was considered was general and followed the general guidelines identified by Flynn et al. [12].
In this paper, we derive the optimal allocation for the number of subjects and the number of repeated measurements needed to minimize the variance of the maximum likelihood estimator (MLE) of the WSCV. In Section 2 we present the random effects model, the definition of the WSCV, and the asymptotic distribution of its MLE for continuous data. In Section 3, we use the calculus of optimization to find the optimal combinations (n, k) that minimize the variance of the MLE of WSCV for normally distributed variables. The use of the WSCV for dichotomous data has never been investigated before, and a novel contribution in this paper is the estimation of WSCV for binary outcome measurements, and sample size requirements, with emphasis on the case of two ratings per subject (i.e. n = 2). We devote Section 4 to the binary data, and general discussion is presented in Section 5.

Estimating the WSCV for continuous variables Assumptions
Consider a random sample of k subjects with n repeated measurements of a continuous variable Y, and denote by Y ij the j th reading made on the i th subject under identical experimental conditions (i = 1,2,...k; j = 1,2,...n). In a testretest scenario, and under the assumption of no reader's effect (i.e. the readings within a specific subject are exchangeable), Y ij denotes the reading of the j th trial made on the i th subject. A useful model for analyzing such data is given by: where µ is the mean of Y ij , the random subject effects s i are normally distributed with mean 0 and variance , or N(0, ), the measurement errors e ij are N(0, ), and the s i and e ij terms are independent. We assume that the subjects are randomly drawn from some population of interest.
Quan and Shih [7] defined the WSCV parameter in the above model as: With model (1), it is assumed that the within subject variance is the same for all subjects. , are respectively, the within-subject and between subjects mean squares as obtained from the usual one-way ANOVA table, and . Note that the MSB does not exist for k = 1, which means that to obtain a sensible estimate of ρ as an index of reliability, the study should include more than one subject.

Results
The asymptotic variance-covariance matrix of the MLE's is obtained by inverting Fisher's information matrix. The large sample variance of can be obtained using delta method (see Kendall vol. 1 [13]) and was shown by Quan and Shih [7] to be: To construct an approximate confidence interval on , it is assumed that for large k, ( -θ) follows a normal distribution with mean 0 and variance A(ρ,n,θ). An approximate 100(1 -α)% confidence interval on θ can be given as , where Z α/2 is the 100(1 -α/2)% cut off point of the standard normal distribution.
Due to the dependence of the variance of on the true parameter value θ itself, we found that the asymptotic coverage deviates from its nominal levels for some values of θ. To improve the coverage probability we suggest a variance stabilizing transformation to remove the dependence of var( ) on θ.
Note that the limits of the interval depend on the unknown value of the intra-class correlation, which can be replaced by its MLE as defined in section 2.1.
To examine the finite sample behavior of the VST based confidence interval estimator, a Monte-Carlo study was conducted under model (1)

Example 1
Accurate and reproducible quantification of brain lesion count and volume in multiple sclerosis (MS) patients using magnetic resonance imaging (MRI) is a vital tool for evaluation of disease progression and patient response to therapy. Current standard methods for obtaining these data are largely manual and subjective and are therefore error-prone and subject to inter-and intra-operator variability. Therefore, there is a need for a rapid automated lesion quantification method. Ashton et al. [1] compared manual measurements and an automated data technique known as Geometrically Constrained Region Growth (GEORG) of the brain lesion volume of 3 MS patients, each measured 10 times by a single operator for each method. The data are presented in Table 5.
Based on the guidelines for the levels of reliability provided by Fleiss [4], a value of an ICC above 80% indicates an excellent reliability, and from Table 3 both methods cross this threshold level. However, based on the WSCV values, the manual method is definitely less reproducible than the automated method (the GEORG is 5 times more reproducible than the manual). This example demonstrates the usefulness of the WSCV over the ICC as a measure of reproducibility. Clearly, one should construct a formal test on the significance of the difference between two correlated within-subject coefficients of variation.
There are several competing methods to construct such a test (e.g. LRT, Wald, and Score tests) but this issue is quite involved and so we intend to report our findings in a future publication.

Sample size estimation
In the following development we discuss the second objective of this paper. We assume that the investigator is interested in the number of replicates, n, per subject, so that the variance of the estimate of θ is minimized, given that the total number of measurements is fixed a priori at N = nk.

Efficiency criterion
For fixed total number of measurements N = nk, equation (3) gives: The necessary condition for var( ) to have a unique minimum is that ∂var( )/∂n = 0. This, and the additional condition that ∂ 2 var( )/∂n 2 > 0 are both satisfied so long  as 0 <ρ < 1. Differentiating (4) with respect to n, equating to zero and solving for n we obtain The required number of subject is thus k* = N/n*. Table 4 shows few optimal allocations of (n, k) for ρ = 0.6, 0.7, and 0.8, θ = 0.1, 0.2, 0.3 and 0.4, when N = 24.
Note that in practice, only integer values of (n, k) are used, and because N = nk is fixed a priori, we first round the optimum values of n to the nearest integer; then k = N/n was rounded to the nearest integer. The values of var( ) at the rounded optimal allocations for different values of ρ,θ and n showed that the net loss or gain in efficiency due to rounding is negligible. It is clear that to efficiently estimate the WSCV for large values of θ we need smaller number of replicates and larger number of subjects. Bonett [10] discussed the issue of sample size requirements that achieve a pre-specified expected width for a confidence interval about ICC. This approach is useful in planning a reliability study in which the focus is on estimation rather than hypothesis testing. He demonstrated that the effect of inaccurate planning value of ICC is more serious in hypothesis testing applications. Shoukri et al. [11] argued that the hypothesis testing approach might not be appropriate while planning a reproducibility study. This is because, in most cases, values of the coefficient under the null and alternative hypotheses may be difficult to specify. An alternative approach is to focus on the width of the CI for θ. Since the approximate width of an (1 -α)100%CI on θ is, 2z α/2 var( ) 1/2 , an approximate sample size that yields an (1 -α)100% CI for θ with a

Cost criterion
Funding constraints will often determine the cost of recruiting subjects for a reliability study. Although too small a sample may lead to a study that produces an imprecise estimate of the reproducibility coefficient, too large a sample may result in a waste of resources. Thus, an important decision in a typical reliability study is to balance the cost of recruiting subjects with the need for a precise estimate of the parameter summarizing reliability.
In this section, we determine the combinations (n, k) that minimize the variance of subject to cost constraints. Constructing a flexible cost function starts with identifying sampling and overhead costs. The sampling cost depends primarily on the size of the sample and includes costs for data collection, compensation to volunteers, management, and evaluation. On the other hand, overhead costs are independent of sample size. Following Sukhatme et al. [14], we assume that the overall cost function is given as: C = c 0 + kc 1 + nkc 2 (6) where, c 0 is the fixed cost, c 1 the cost of recruiting a single subject, and c 2 is the cost of making one observation. Using the method of Lagrange multipliers and following Shoukri et al. [11], we write the objective function Ψ in this form where, var( ) is given by Equation (3) and λ is the Lagrange multiplier. Differentiating Ψ with respect to n, k and λ and equating to zero, we obtain 2θ 2 ρ* n 4 -4θ 2 ρ* n 3 -(2θ 2 r + r -2θ 2 ρ* + 1)n 2 + 4θ 2 rn -2θ 2 r = 0 (8) where r = c 1 /c 2 , and ρ* = ρ/(1 -ρ) Although an explicit solution to (8) is available, the resulting expression is complicated and does not provide any useful insight. The 4 th degree polynomial in the left side of (8) has two imaginary roots, one negative and one admissible (positive) root for n. Table 5 summarize the results of the optimization procedure where we provide the optimal n for various values of θ, ρ, and r, noting that:

Results
From Table 7, it is apparent that when r = c 1 /c 2 increases, the required number of replicates per subject (n) increases, because the cost of making a single observation (c 2 ) decreases and the cost of recruiting a subject (c 1 ) increases. When r is fixed, an increase in ρ results in a decline in the required value of n and accordingly an increase in k. An increase in θ also results in a decrease in n. The general conclusion is that it is sensible to decrease the number of items associated with a higher cost, while increasing those with lower cost.
We note that by setting c 1 = 0 in Equation (8), we obtain , as in Equation (5). The situation c 1 = 0 is quite plausible, at least approximately if the major cost is in actually making the observations (e.g. expensive equipment, cost of interviews versus free volunteer subjects). This means that a special cost structure is implied by the optimal allocation procedure discussed earlier.

Example 2
To assess the accuracy of Doppler Echocardiography (DE) in determining aortic valve area (AVA) prospective evaluation on patients with aortic stenosis, an investigator wishes to demonstrate a high degree of reliability (ρ = 0.80) in estimating AVA using the "velocity integral method" with a planned value for the WSCV = 0.10. Suppose that the total cost of making the study is fixed at $1600.0. It is assumed that the overhead fixed cost c 0 iŝ

Estimating the WSCV for dichotomous responses Assumptions
Consider a random sample of k subjects, each is blindly evaluated n times by the same rater. We assume that all subject responses y ij (where j = 1, 2, ...n) are dichotomous and are conditionally independent with probabilities P(y ij = 1) = p i (i = 1,2,...,k) and p(y ij = 0) = 1 -p i . Thus, for fixed p i , the conditional distribution of the random variable follows binomial distribution with parameters n and p i . To account for the variation of response probabilities between subjects, as considered by Mak [15], we assume further that the probabilities p i are independently and identically distributed as a beta distribution, Beta (α,β), with mean π = α/(α + β) and variance π (1π)ρ. Given these assumptions, one can show that the correlation between y ij and y il is in fact ρ. Define i• = y i• /n and /(n -1), , and . We therefore estimate the WSCV for binary assessments by: A case of special interest to clinical epidemiologists is when n = 2, or a test re-test reliability study involving two readings per subject. For this case we investigate the sample size issue in the following section.

Example 3
To illustrate the methodology discussed in this section, we use data from an investigation of mammography by Powell et al. [17] concerning the equivalence of film-screen (FS) and digital images (DI). Two readings were made on the presence/absence (1/0) of malignancy by each rater on the same set of k = 58 patients. The data and the results of the analysis are summarized in Table 9. Both methods seem to have the same levels of reliability in terms of ICC and WSCV. We note that the 95% confidence interval is somewhat relatively wide, and this may be due to the fact that the sample size is not large enough.
Note that if the observed frequencies in the sample of k subjects are given as in Table 10, we can write a simpler estimator of the WSCV as /(n 2 + 2n 1 ). To construct an estimate of the confidence interval on υ, the MLE of ρ, and should be substituted in equation (12) where, from Donner and Eliasziw [18] .

Sample size estimation Methods
There has been increasing attention given recently to estimation of sample size using a confidence interval rather than a significance testing approach (e.g. Gardner and Altman [19]). This is consistent with recent arguments made by many authors, including Goodman and Berlin [20] who state that "confidence intervals should play an important role when setting sample size" and that " the size of a confidence interval can be predicted in the planning stages of an experiment and this can be a great help Total k 1  where z 1-α/2 and z 1-β are the critical values of the standard normal distribution corresponding to α and β.
As an example suppose it is of interest to test H 0 : v = 0.04 versus H 1 : v = 0.1, where v 0 corresponds to high reliability.
To ensure with 80 per sent probability a significant result at α = 5% and π = 0.30 when v 1 = 0.10, we compute the required number of subjects from the above equation as k = 986 and when π = 0.50, k = 355. For the sake of compar-ison to the fixed width CI procedure, suppose it is of interest to construct 95% CI on v with expected width w = 0.10.

Discussion
The ICC has been traditionally used to assess the reliability of a measurement. QS considered the WSCV as an alternative measure of reproducibility for continuous scale measurements. It should be emphasized that our investigation has not allowed for forms of systematic error (e.g. measurement, or trend that is unaccounted for in the model). A reviewer of this paper indicated that this is beyond our scope. In this paper we have dealt with the issue of sample size estimation of the WSCV from continuous and binary scale measurements focusing on random measurement error, in the conventional way that reliability is usually discussed.
As in any reliability study, a crucial decision that a researcher faces in the design stage is the determination of the number of subjects, k and the number of measurements per subject, n. We have discussed two alternative statistical techniques to determine an optimal allocation. When we have prior knowledge of what constitutes an acceptable level of reproducibility, a hypothesis testing approach may be used. We used this approach in the case of binary outcome variable, following the GOF approach proposed by DE. The application of the GOF was straightforward because the number of replicates n = 2 was fixed. However, there are situations, when appropriate values of the reliability coefficient under the null and alternative hypotheses may be difficult to specify. An alternative to hypotheses testing is the efficient allocation of the sample, and the guidelines provided in this article for the continuous scale measurements allow selection of the pair (n, k) that maximizes the precision of the estimated coefficient under cost constrains. We note that cost implications, for dichotomous assessments, are quite important particularly when n is larger than two, which we intend to report on in a future paper.
Finally it is noted that in practice, the optimal allocation must be integer values, and the net loss/gain in precision as a result of rounding the values the values of (n, k) was negligible. Ideally one should adopt one of the available optimization algorithms, often referred to as integer programming models. These models are suited for the optimal allocations problems since the main concern was to find the best solution(s) in a well-defined discrete space.

Conclusion
The WSCV is a useful index measure of measurements reliability. Investigators may design reliability studies using either efficiency or cost considerations. For continuouŝ measurements, optimal allocation of the sample may be achieved with as few as two replications per subject. For dichotomous data, when each subject is measured twice, investigators may use, either fixed length confidence interval, or power considerations is estimating the sample size. Both methods produce comparable results.