This article has Open Peer Review reports available.
Confidence regions for repeated measures ANOVA power curves based on estimated covariance
- Matthew J Gribbin^{1}Email author,
- Yueh-Yun Chi^{2},
- Paul W Stewart^{3} and
- Keith E Muller^{4}
https://doi.org/10.1186/1471-2288-13-57
© Gribbin et al.; licensee BioMed Central Ltd. 2013
Received: 16 October 2012
Accepted: 15 March 2013
Published: 15 April 2013
Abstract
Background
Using covariance or mean estimates from previous data introduces randomness into each power value in a power curve. Creating confidence intervals about the power estimates improves study planning by allowing scientists to account for the uncertainty in the power estimates. Driving examples arise in many imaging applications.
Methods
We use both analytical and Monte Carlo simulation methods. Our analytical derivations apply to power for tests with the univariate approach to repeated measures (UNIREP). Approximate confidence intervals and regions for power based on an estimated covariance matrix and fixed means are described. Extensive simulations are used to examine the properties of the approximations.
Results
Closed-form expressions are given for approximate power and confidence intervals and regions. Monte Carlo simulations support the accuracy of the approximations for practical ranges of sample size, rank of the design matrix, error degrees of freedom, and the amount of deviation from sphericity. The new methods provide accurate coverage probabilities for all four UNIREP tests, even for small sample sizes. Accuracy is higher for higher power values than for lower power values, making the methods especially useful in practical research conditions. The new techniques allow the plotting of power confidence regions around an estimated power curve, an approach that has been well received by researchers. Free software makes the new methods readily available.
Conclusions
The new techniques allow a convenient way to account for the uncertainty of using an estimated covariance matrix in choosing a sample size for a repeated measures ANOVA design. Medical imaging and many other types of healthcare research often use repeated measures ANOVA.
Keywords
Sample size Replication study Study planning Univariate approach UNIREPBackground
Motivation
Computing power for a linear model involving repeated measures requires specifying a set of means and a covariance matrix. Scientists usually feel comfortable specifying a pattern of means that corresponds to a difference of clinical or scientific importance. However, specifying plausible variance and covariance values usually requires estimates from a previous study.
Using data from a previous study to estimate the covariance matrix makes the power value a random variable. Kraemer, et al. [1] noted that if the estimated variance is too small, i.e., when the pilot study is overly favorable, power will be over-estimated. Conversely, if the estimated variance is too large (pilot study overly conservative), power will be under-estimated. Maxwell [2] conducted simulations to illustrate the amount of bias that can occur. Taylor and Muller [3] and Muller and Pasour [4] derived exact distributions of noncentrality and power in univariate linear models based on all combinations of estimated variance and means. The results account for power computed conditional on a previous result being significant, or conditional on a previous result being non-significant. The former creates optimistic bias, while the latter creates pessimistic bias.
Providing confidence intervals to account for the uncertainty inherent in the random power values would be useful for study planning. For example, a lower bound for power would allow stating that a test has power of at least “P” to detect an effect, with a specified confidence. A confidence region for a power curve would be even more informative.
Medical imaging research motivated the work here because it often generates the type of complete data that can be handled with the univariate approach to repeated measures (UNIREP). Muller, et al. [5] reviewed the advantages gained by being able to use the UNIREP model, a special case of the general linear mixed model. The same authors described accurate and convenient power approximations for UNIREP analysis. The four UNIREP tests, Box conservative, Geisser-Greenhouse, Huynh-Feldt, and the uncorrected, all use the same test statistic. For data analysis, UNIREP tests differ only by their respective degrees of freedom due to different degrees of freedom multipliers, which measure sphericity in the error covariance for the hypothesis variables. Muller and Stewart [6] provided detailed discussion of the basic theory for both the null and non-null cases. Earlier work detailed basic UNIREP theory. Box [7, 8], Geisser and Greenhouse [9, 10] and Huynh and Feldt [11] gave null results. Davies [12] and Muller and Barton [13, 14] treated the non-null case.
Browne [15] evaluated the impact of using a pilot study to estimate the variance for a t-test. More generally, Taylor and Muller [16] demonstrated how to construct exact power confidence intervals for the general linear univariate model for a data-estimated variance and fixed means. The same authors also generalized the result to provide an exact confidence region around a power curve. Parallel results for the UNIREP setting would be equally useful. We generalize the methods in Taylor and Muller [16] to UNIREP tests for repeated measures. We use analytic and simulation results to demonstrate that the techniques allow computing approximate confidence intervals and regions for power with good accuracy for the UNIREP tests, based on an estimated covariance matrix and fixed means.
Existing results
A vector z, (n×1), is lower case bold. A matrix, z, is upper case bold with transpose Z ^{′}, inverse Z ^{−1} and generalized inverse Z ^{−}. Also, 1 _{ n } is an (n×1) vector of 1’s and I _{ n } is an (n×n) identity matrix. A diagonal matrix with (i,i) element z _{ i } is written Dg (z). The expected value, variance, and trace are E(Z), $\mathcal{V}\left(\mathit{Z}\right)$, and tr(z), respectively. Throughout, Z∼χ ^{2}(ν,ω) indicates that Z has a noncentral chi-square distribution with ν degrees of freedom and noncentrality ω, while Z∼χ ^{2}(ν) indicates a central distribution. Similarly, Z∼F(ν _{1},ν _{2},ω) indicates X has a noncentral F distribution with ν _{1} numerator and ν _{2} denominator degrees of freedom, and noncentrality ω with cumulative distribution function F _{ F }(ν _{1},ν _{2},ω). A central F is written Z∼F(ν _{1},ν _{2}) with quantile q indicated ${F}_{F}^{-1}\left(q;{\nu}_{1},{\nu}_{2}\right)$. Writing $\mathit{z}\sim {\mathcal{N}}_{p}\left(\mathit{\mu},\mathbf{\Sigma}\right)$ indicates z (p×1) is Gaussian with mean μ and covariance Σ (p×p). If z (N×p) has independent rows and ${\left[{\text{row}}_{i}\left(\mathit{Z}\right)\right]}^{\prime}\sim {\mathcal{N}}_{p}\left({\mathit{\mu}}_{i},\mathbf{\Sigma}\right)$, then $\mathit{S}={\mathit{Z}}^{\prime}\mathit{Z}\sim {\mathcal{W}}_{p}\left(N,\mathbf{\Sigma},\mathit{\Omega}\right)$ indicates S follows a Wishart distribution with N degrees of freedom, covariance Σ, and noncentrality Ω=E(Z ^{′})E(Z)Σ ^{−1}.
suchthat Cdefinesthebetween-subject effects(rank a) while U defines the within-subject effects (rank b). Requiring estimable Θ and full rank {C,U} ensure a testable hypothesis. Appropriate selections of the contrast matrices (C and U) and null matrix (Θ _{0}) allows testing important one-degree-of-freedom parameters, such as the difference between two means, or a comparison of two trends.
For M = C(X ^{′} X)^{−} C ^{′}, unscaled noncentrality is Δ = (Θ−Θ _{0})^{′} M ^{−1}(Θ−Θ _{0}), scaled noncentrality is $\mathit{\Omega}={\mathit{\Delta \Sigma}}_{\ast}^{-1}$. Here Σ _{∗}=U ^{′} ΣU=ΥDg (λ)Υ ^{′} is the covariance matrix among the hypothesis variables, with ΥΥ ^{′}=Υ ^{′} Υ=I _{ b }, and λ={λ _{ k }} the eigenvalues. Estimates are $\stackrel{~}{\mathit{B}}={\left({\mathit{X}}^{\prime}\mathit{X}\right)}^{-}{\mathit{X}}^{\prime}\mathit{Y}$ and $\hat{\mathbf{\Sigma}}={\mathit{Y}}^{\prime}\left[\mathit{I}-{\left({\mathit{X}}^{\prime}\mathit{X}\right)}^{-}{\mathit{X}}^{\prime}\right]\mathit{Y}/{\nu}_{e}$, with ν _{ e }=N−rank(X), the error degrees of freedom. Furthermore, $\widehat{\mathit{\Theta}}=\mathit{C}\stackrel{~}{\mathit{B}}\mathit{U}$, $\hat{\mathbf{\Delta}}={(\hat{\mathit{\Theta}}-{\mathit{\Theta}}_{0})}^{\prime}{\mathit{M}}^{-1}(\hat{\mathit{\Theta}}-{\mathit{\Theta}}_{0})\sim {\mathcal{W}}_{b}\left(a,{\mathbf{\Sigma}}_{\ast},\mathit{\Omega}\right)$ and ${\hat{\mathbf{\Sigma}}}_{\ast}={\mathit{U}}^{\prime}\hat{\mathbf{\Sigma}}\mathit{U}$, with ${\nu}_{e}{\hat{\mathbf{\Sigma}}}_{\ast}\sim {\mathcal{W}}_{b}\left({\nu}_{e},{\mathbf{\Sigma}}_{\ast}\right)$. The sum of squares hypothesis matrix is ${\mathit{S}}_{H}=\hat{\mathbf{\Delta}}$ and the sum of squares error matrix is ${\mathit{S}}_{E}={\nu}_{e}{\hat{\mathbf{\Sigma}}}_{\ast}$, which are independent of one another. The notation follows that in Muller and Stewart [6]. Additional notation is in Appendix A: Additional notation.
The sphericity parameter, ∊=tr^{2}(Σ _{∗})/[btr(Σ∗2)], quantifies the spread of population eigenvalues and is used to discount the degrees of freedom. The term sphericity reflects the fact that uncorrelated Gaussian variables with equal variances in three dimensions have a spherical scattergram. The eigenvalues of Σ _{∗} are the variances of the (uncorrelated) principal components of the hypothesis response variables. Perfect sphericity requires ∊ = 1, which occurs with all eigenvalues equal. Minimal sphericity has ∊=1/b, which occurs with one nonzero eigenvalue. Other patterns of Σ _{∗} have 1/b<∊<1.
The Box conservative test uses the fixed, lower bound of ∊, while the uncorrected test uses the fixed, upper bound of ∊. With sphericity (∊=1), the uncorrected test is exact and uniformly most powerful (among similarly invariant tests). The Geisser-Greenhouse and Huynh-Feldt tests use the observed data to estimate ∊. The Geisser-Greenhouse estimator, $\hat{\u220a}={\text{tr}}^{2}\left({\hat{\mathbf{\Sigma}}}_{\ast}\right)/b\text{tr}\left({\hat{\mathbf{\Sigma}}}_{\ast}^{2}\right)$, is the maximum likelihood (ML) estimator. The Huynh-Feldt estimator, $\stackrel{~}{\u220a}=\left(\mathit{\text{Nb}}\hat{\u220a}-2\right)/\left[b\left({\nu}_{e}-b\hat{\u220a}\right)\right]$ was proposed as the ratio of two unbiased estimators. Their claim holds only for the special case of rank (X)=1. Lecoutre [17] provided a more general form. In turn, Gribbin [18] and Chi et al. [19] described a rank-adjusted approximately unbiased estimator, ${\stackrel{~}{\u220a}}_{r}=\left[\left({\nu}_{e}+1\right)b\hat{\u220a}-2\right]/\left[b\left({\nu}_{e}-b\hat{\u220a}\right)\right]$, which applies to any general linear multivariate model. The rank-adjusted power approximation was shown through simulations to approximate observed mean power values as well as, or better than, the Huynh-Feldt power approximation (Chi et al. [19]). Only the rank-adjusted Huynh-Feldt estimator will be considered in the remainder of the paper.
Although the four UNIREP tests all use the same test statistic, they each use a different measure of sphericity, here indicated e. For data analysis, all four tests use a critical value $q\left(e\right)={F}_{F}^{-1}\left(1-\alpha ,{\nu}_{1}e,{\nu}_{2}e\right)$. Here ν _{1}=a b and ν _{2}=b ν _{ e }. The Box test uses e=1/b, the GG test uses $e=\hat{\u220a}$, HF uses $e=\stackrel{~}{\u220a}$, and the uncorrected test uses e=1. The p-value is then computed, for observed test statistic t, as p=1−F(t,ν _{1} e,ν _{2} e). In all cases $1/b\le \hat{\u220a}\le \stackrel{~}{\u220a}\le 1$. In turn, the p-values always have the reverse order, with the Box p-value being largest, and the uncorrected being smallest.
the diagonal elements of the scaled noncentrality, Ω _{∗}=Υ ^{′} ΔΥDg (λ)^{−1}=Δ _{∗}Dg (λ)^{−1}.
Methods
Estimating approximate UNIREP power with estimated covariance and fixed means
By extending results in Muller et al. [5], the following lemma helps simplify the F approximations. Appendix B: Supporting lemmas and proofs contains all proofs.
Lemma 1
Sphericity multipliers for UNIREP power approximations for fixed means
Multipliers | ||||||
---|---|---|---|---|---|---|
Covariance | Test | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} |
Σ _{∗}(known) | Un | 1 | 1 | ∊ _{ n } | ∊ _{ d } | ∊ _{ n } |
HF | $\text{E}\left(\stackrel{~}{\u220a}\right)$ | $\text{E}\left(\stackrel{~}{\u220a}\right)$ | ∊ _{ n } | ∊ _{ d } | ∊ _{ n } | |
GG | $\text{E}\left(\hat{\u220a}\right)$ | $\text{E}\left(\hat{\u220a}\right)$ | ∊ _{ n } | ∊ _{ d } | ∊ _{ n } | |
Box | 1/b | 1/b | ∊ _{ n } | ∊ _{ d } | ∊ _{ n } | |
${\hat{\mathbf{\Sigma}}}_{\ast}$(estimated) | Un | 1 | 1 | ${\stackrel{~}{\u220a}}_{n}$ | ${\hat{\u220a}}_{d}$ | ${\stackrel{~}{\u220a}}_{n}$ |
HF | ${\stackrel{~}{\u220a}}_{r}$ | ${\stackrel{~}{\u220a}}_{r}$ | ${\stackrel{~}{\u220a}}_{n}$ | ${\hat{\u220a}}_{d}$ | ${\stackrel{~}{\u220a}}_{n}$ | |
GG | ${\hat{\u220a}}_{d}$ | ${\hat{\u220a}}_{d}$ | ${\stackrel{~}{\u220a}}_{n}$ | ${\hat{\u220a}}_{d}$ | ${\stackrel{~}{\u220a}}_{n}$ | |
Box | 1/b | 1/b | ${\stackrel{~}{\u220a}}_{n}$ | ${\hat{\u220a}}_{d}$ | ${\stackrel{~}{\u220a}}_{n}$ |
In practice, some elements of $\{{e}_{1},{e}_{2},{e}_{3},{e}_{4},{e}_{5},\text{tr}(\mathbf{\Delta}),\overline{\lambda}\}$ may be estimated and hence random. The random elements imply random power values, as with estimated covariance and fixed means, $\{{\hat{\mathbf{\Sigma}}}_{\ast},\mathbf{\Delta}\}$, for ${\hat{\mathbf{\Sigma}}}_{\ast}\phantom{\rule{0.3em}{0ex}}=\phantom{\rule{0.3em}{0ex}}{\hat{\mathit{E}}}^{\prime}\hat{\mathit{E}}/{\nu}_{\text{est}}$, the unbiased restricted maximum likelihood (REML) estimator. A distinction must be carefully maintained between the estimation study and target study. The estimation study provides the covariance estimate and has sample size N _{est}, design matrix rank of rank (X _{est}), and ν _{est}=N _{est}−rank(X _{est}) degrees of freedom. The target study for which power is desired has sample size N, rank (X) and ν _{ e }=N−rank(X) degrees of freedom.
A better choice, given in the following lemma, uses a ratio of unbiased estimators. The result generalizes the rank-adjusted Huynh-Feldt estimator for data analysis. Appendix B: Supporting lemmas and proofs has derivations of moments as well as all proofs.
Lemma 2
The corresponding estimator for the null case is ${\stackrel{~}{\u220a}}_{r}=\left[\left({\nu}_{\text{est}}+1\right)b\hat{\u220a}-2\right]/\left[b\left({\nu}_{\text{est}}-b\hat{\u220a}\right)\right]$, the rank-adjusted Huynh-Feldt sphericity estimator.
with $\overline{\hat{\lambda}}=\text{tr}\left({\hat{\mathbf{\Sigma}}}_{\ast}\right)/b$, and e _{1} through e _{5} estimated if unknown (Table 1). Nearly every combination of ${\hat{\u220a}}_{n}$, ${\stackrel{~}{\u220a}}_{n}$, ${\hat{\u220a}}_{d}$, ${\stackrel{~}{\u220a}}_{r}$, 1 and 1/b was examined for each UNIREP test for the wide range of simulations discussed in Muller et al. [5]. The values chosen provided the most accurate results. In retrospect, they are natural choices as well.
Approximate UNIREP power confidence intervals
Taylor and Muller [16] recommended one-sided power confidence intervals by noting that “the change from a one-sided to a two-sided confidence interval has little effect on the upper bound, but a large effect on the lower bound”. Muller and Fetterman [20] provided examples of a one-sided power confidence interval in the univariate case.
Approximate UNRIEP power confidence regions for power curves
The new methods allow calculating a confidence interval for a single power value. The logic of a proof in Taylor and Muller ([16] equations 2.1-2.13 and surrounding text) guarantees that accurate confidence regions are provided by the point-wise calculations. The proof may be sketched for the present setting as follows. Equations 14-21 establish the validity of the approximate confidence interval for a particular alternative hypothesis, as specified by the scalar constant tr(Δ). The randomness in the noncentrality arises from a scalar random variable, ${\stackrel{~}{\lambda}}_{\ast 1}$, analogous to a variance. Equation 19 describes a single event with a specified probability. The inequality defining the event, and the associated probability, do not change for different values of the scalar constant tr(Δ). The smooth and strictly monotone dependence of power on the noncentrality ensures the validity of equations 20-21. The proof is completed by noting that the monotonicity extends the simultaneity property to the power confidence region.
In some cases scientists prefer to consider sample size as a function of the pattern of mean differences. The theory already presented allows plotting sample size as a function of mean difference, albeit with a shift in algorithm. The power function must be numerically inverted to solve for the sample size desired. Taylor and Muller [16] outlined the steps of algorithm needed for the univariate case. Details are not presented here for the sake of brevity.
Results
Simulation overview
The accuracy of the new approximate confidence intervals isevaluatedfor a widerange of conditions. Appendix C: Simulation details contains more details of the simulations and examples. All simulations were conducted in SAS/IML (SAS 9.1, SAS Institute, 2003) using a version of LINMOD 3.4 modified to include the rank-adjusted Huynh-Feldt estimator and test. Predicted power values and approximate power confidence intervals were computed using a similarly modified version of POWERLIB 2.03. The modified versions of LINMOD and POWERLIB are available at http://www.health-outcomes-policy.ufl.edu/muller/.
Simulation 1 with rank (X)=1(one-group repeated measures ANOVA)
The accuracy of the new approximate confidence intervals were evaluated for a completely within-subject design with p=9 repeated measures, N∈{10,20,40}, and q=rank(X)=1. Values for B, contrast matrices C, U, and Θ _{0} were chosen to test a within-subject interaction for α=0.05. The model was chosen to ensure predicted power values for the Geisser-Greenhouse test of 0.20, 0.50, and 0.80, using the power approximation in Muller et al. [5]. Population covariance matrices were chosen to provide ∊∈{0.282,0.505,0.720,1.00}. The sphericity values were selected to cover a range of eigenvalue patterns (i.e., patterns of the principal component variances) arising from the structure of Σ _{∗}. For example, if b=3, then $\mathit{\lambda}={\left[\begin{array}{lll}1& 0.12& 0.12\end{array}\right]}^{\prime}$ gives ∊≈0.50. In turn, $\mathit{\lambda}={\left[\begin{array}{lll}1& 0& 0\end{array}\right]}^{\prime}$gives ∊=1/3≈0.33. Pseudo-random realizations of the error matrix, E, were generated and tests were calculated. The observed mean power values for the four UNIREP tests were calculated and tabulated for 500,000 replications per condition.
For the conditions described above, additional pseudo-random realizations of the error matrix were generated using an estimating study with sample size, N _{est}, of 10 and rank of X, rank (X _{est}), of 1 with 500,000 replications per condition for all four UNIREP tests. Corresponding estimated covariance matrices were calculated, as well as lower and upper bounds for power. Both one- and two-sided confidence intervals were evaluated with target coverages of 90% and 95%. The number of replications gave a standard error of estimated coverage probability less than or equal to 0.0003 for 1−α=0.95, and 0.0004 for 1−α=0.90, nearly guaranteeing 3 digits of accuracy. Only coverage of observed mean power values, and not predicted, was tabulated. The accuracy of the predicted power values, with respect to the observed, made it essentially redundant to consider both.
Target 95% CI (two-sided) estimated coverage (×100) of simulated population power for N = 10
Test | ∊ | Power | Lower tail | Coverage | Upper tail |
---|---|---|---|---|---|
0.282 | 0.123 | 1.1 | 97.8 | 1.1 | |
0.535 | 1.9 | 97.0 | 1.1 | ||
0.930 | 1.7 | 97.3 | 1.0 | ||
0.505 | 0.054 | 0.1 | 97.3 | 2.6 | |
0.266 | 0.5 | 97.0 | 2.5 | ||
Box | 0.690 | 1.1 | 97.0 | 1.9 | |
0.720 | 0.052 | 0.4 | 94.1 | 5.5 | |
0.227 | 0.6 | 96.8 | 2.6 | ||
0.569 | 1.4 | 97.0 | 1.6 | ||
1 | 0.023 | 0.6 | 85.1 | 14.3 | |
0.117 | 0.5 | 96.0 | 3.5 | ||
0.350 | 0.8 | 97.8 | 1.4 | ||
0.282 | 0.155 | 3.1 | 94.7 | 2.2 | |
0.585 | 2.6 | 95.6 | 1.8 | ||
0.942 | 1.8 | 96.6 | 1.6 | ||
0.505 | 0.162 | 5.4 | 87.7 | 6.9 | |
0.520 | 3.8 | 90.6 | 5.6 | ||
GG | 0.870 | 2.6 | 92.4 | 5.0 | |
0.720 | 0.203 | 2.4 | 92.3 | 5.3 | |
0.539 | 2.6 | 94.1 | 3.3 | ||
0.856 | 3.3 | 94.2 | 2.5 | ||
1 | 0.161 | 0.7 | 95.6 | 3.7 | |
0.438 | 1.4 | 97.0 | 1.6 | ||
0.751 | 2.7 | 96.2 | 1.1 | ||
0.282 | 0.166 | 3.8 | 93.5 | 2.7 | |
0.602 | 2.8 | 95.2 | 2.0 | ||
0.946 | 1.9 | 96.3 | 1.8 | ||
0.505 | 0.210 | 8.2 | 82.9 | 8.9 | |
0.592 | 4.7 | 88.5 | 6.8 | ||
HF | 0.902 | 2.9 | 90.9 | 6.2 | |
0.720 | 0.271 | 3.6 | 90.9 | 5.5 | |
0.631 | 3.4 | 93.3 | 3.3 | ||
0.904 | 4.0 | 93.6 | 2.4 | ||
1 | 0.224 | 0.8 | 96.7 | 2.5 | |
0.531 | 1.8 | 97.1 | 1.1 | ||
0.821 | 3.2 | 95.9 | 0.9 |
Table 2 also contains coverage results for the Geisser-Greenhouse and the rank-adjusted Huynh-Feldt tests. The target 95% estimated coverage is consistently reached for extreme sphericity values for both tests. For midrange sphericity values, the coverage fell below the target coverage from 0.8% to 7.3% for the Geisser-Greenhouse, and 1.4% to 12.1% for the rank-adjusted Huynh-Feldt. Coverage accuracy improved as the estimated power increased. In practice, lower power values are of little concern. For target power of 0.80 for the Geisser-Greenhouse test, the largest deviation from the target 95% estimated coverage was 2.6% for the Geisser-Greenhouse test and 4.1% for the rank-adjusted Huynh-Feldt test. Both occurred for the population sphericity value of 0.505.
Target 95% CI (Two-sided) estimated coverage (×100) of simulated population power for the uncorrected test (∊ = 1.00)
N | Population power | Lower tail | Coverage | Upper tail |
---|---|---|---|---|
10 | 0.238 | 0.5 | 97.5 | 2.0 |
0.551 | 1.5 | 97.6 | 0.9 | |
0.835 | 3.2 | 96.1 | 0.7 | |
20 | 0.215 | 0.8 | 97.3 | 1.9 |
0.520 | 1.6 | 97.6 | 0.8 | |
0.814 | 3.1 | 96.2 | 0.7 | |
40 | 0.207 | 0.9 | 97.2 | 1.9 |
0.509 | 1.5 | 97.7 | 0.8 | |
0.806 | 3.0 | 96.3 | 0.7 |
Although not presented here, in general, the accuracy of the coverage improved directly with increasing sample size, for all tests and conditions. The accuracy of the approximate confidence bounds for all four UNIREP tests also improved as the population sphericity increased.
Simulation 2 with rank (X)>1
All of the simulations in the second example considered the condition of rank of X greater than 1. The cases used p=5 repeated measures, N∈{16,32,48}, and q=rank(X)∈{2,4,8,16}, corresponding to a three-, five-, nine-, and seventeen-group comparison, respectively. Appropriate fixed matrices of regression coefficients, B, contrast matrices, C and U, and Θ _{0} were chosen to test a within-subject interaction for a test size, α, of 0.05. The matrices were also chosen to ensure approximate target predicted power values for the rank-adjusted Huynh-Feldt test of 0.20, 0.50, and 0.80. Specific design matrices, X, were defined. Population covariance matrices were chosen to provide specific population sphericity values, ∊∈{0.282,0.505,0.720,1.00}. Observed mean power values were simulated and tabulated in a similar manner to that described in section ‘Simulation 1 with rank (X)=1 (one-group repeated measures ANOVA)’.
Pseudo-random realizations of the error matrix were generated using an estimating study with sample size, N _{est}, of 16 and rank of X, rank (X _{est}), of 4 with 500,000 replications per condition for all four UNIREP tests. Corresponding estimated covariance matrices were calculated, as well as lower and upper bounds for power using the methods presented in section ‘Approximate UNIREP power confidence intervals UNIREP power confidence intervals’. Approximate confidence interval coverage was defined as the proportion of the 500,000 simulated bound realizations that successfully covered the observed mean power values for each condition described above. Only coverage of observed mean power values, and not predicted, were tabulated. The accuracy of the predicted power values, with respect to the observed, made it essentially redundant to consider both. Both one- and two-sided confidence intervals were evaluated with target coverages of 90% and 95%.
In practical biomedical research, low power values are of little concern. Rarely will one have a power targeted below 0.70. Therefore, only the results for target power values of 0.80 will be presented and discussed. Power confidence interval coverage converged to the target coverage as sample size increased. Only the worst case results for two-sided 95% confidence intervals are presented here. The worst cases occurred with the smallest sample size for the target study, for a variety of population sphericity values and estimated population powers.
Simulated population power for target power = 0.80, N = 16 and rank ( X ) = q
N | Simulated population power | |||||||
---|---|---|---|---|---|---|---|---|
∊=0.282 | ∊=0.505 | |||||||
q | Box | GG | HF | Box | GG | HF | ||
16 | 2 | 0.779 | 0.811 | 0.817 | 0.561 | 0.778 | 0.809 | |
4 | 0.763 | 0.797 | 0.805 | 0.510 | 0.762 | 0.802 | ||
8 | 0.753 | 0.787 | 0.799 | 0.455 | 0.736 | 0.796 | ||
∊=0.720 | ∊=1.000 | |||||||
q | Box | GG | HF | Box | GG | HF | UN | |
16 | 2 | 0.457 | 0.760 | 0.805 | 0.399 | 0.748 | 0.790 | 0.799 |
4 | 0.355 | 0.740 | 0.801 | 0.255 | 0.724 | 0.787 | 0.801 | |
8 | 0.267 | 0.695 | 0.795 | 0.138 | 0.655 | 0.775 | 0.800 | |
∊=0.282 | ∊=0.505 | |||||||
q | Box | GG | HF | Box | GG | HF | ||
48 | 2 | 0.803 | 0.843 | 0.845 | 0.586 | 0.802 | 0.813 | |
4 | 0.773 | 0.812 | 0.815 | 0.552 | 0.794 | 0.805 | ||
8 | 0.766 | 0.800 | 0.803 | 0.544 | 0.787 | 0.799 | ||
16 | 0.762 | 0.796 | 0.799 | 0.522 | 0.780 | 0.795 | ||
∊=0.720 | ∊=1.000 | |||||||
q | Box | GG | HF | Box | GG | HF | UN | |
48 | 2 | 0.500 | 0.792 | 0.806 | 0.455 | 0.785 | 0.797 | 0.800 |
4 | 0.427 | 0.787 | 0.802 | 0.346 | 0.781 | 0.796 | 0.800 | |
8 | 0.402 | 0.781 | 0.799 | 0.280 | 0.778 | 0.797 | 0.801 | |
16 | 0.359 | 0.775 | 0.799 | 0.221 | 0.770 | 0.795 | 0.801 |
Target 95% CI (two-sided) estimated coverage (×100) of simulated population power for target power = 0.80, N = 16, rank (X) = q, N _{ est } = 16, and rank ( X _{ est } ) = 4
∊ = 0.282 | ∊ = 0.505 | |||||||
---|---|---|---|---|---|---|---|---|
q | Box | GG | HF | Box | GG | HF | ||
2 | 97.8 | 97.2 | 97.0 | 97.5 | 93.4 | 92.3 | ||
4 | 93.7 | 92.0 | 91.6 | 95.6 | 86.8 | 85.0 | ||
8 | 90.9 | 87.9 | 87.2 | 94.8 | 81.4 | 79.0 | ||
∊=0.720 | ∊=1.000 | |||||||
Box | GG | HF | Box | GG | HF | UN | ||
2 | 97.6 | 95.4 | 94.9 | 97.4 | 95.3 | 95.5 | 95.8 | |
4 | 97.5 | 93.6 | 92.9 | 97.6 | 96.8 | 97.0 | 97.4 | |
8 | 96.9 | 90.6 | 89.8 | 97.0 | 96.1 | 96.9 | 97.4 |
Although not presented here, in general, as sample size increased the conservative coverage values observed for the Box conservative and the uncorrected tests slowly converged to the target coverage value. This trend was observed for the conservative coverage values with the extreme population sphericity values for the Geisser-Greenhouse and the rank-adjusted Huynh-Feldt tests as well. The same is true of the liberal coverage values observed for the midrange population sphericity values for the Geisser-Greenhouse and the rank-adjusted Huynh-Feldt tests. Similar results were obtained for the target 90% two-sided confidence interval coverage as well as the 95% and 90% one-sided confidence intervals coverage.
Target 95% CI (two-sided) estimated coverage (×100) of simulated population power for ∊=0.505 , target power = 0.80, N= 48 and rank (X) = q
Box coverage | GG coverage | HF coverage | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
rank ( X _{ est } ) | rank ( X _{ est } ) | rank ( X _{ est } ) | ||||||||
N _{ est } | q | 2 | 4 | 8 | 2 | 4 | 8 | 2 | 4 | 8 |
16 | 2 | 97.5 | 97.2 | 97.4 | 94.1 | 94.3 | 94.9 | 92.2 | 92.1 | 93.6 |
4 | 94.8 | 94.8 | 95.3 | 87.3 | 87.5 | 88.9 | 85.9 | 86.2 | 87.4 | |
8 | 92.6 | 92.7 | 93.4 | 83.2 | 83.4 | 86.0 | 81.4 | 82.0 | 84.3 | |
16 | 92.0 | 92.3 | 93.7 | 82.3 | 82.5 | 85.9 | 80.5 | 80.7 | 83.2 | |
32 | 2 | 97.3 | 97.3 | 97.3 | 93.3 | 93.3 | 93.4 | 92.4 | 92.5 | 92.6 |
4 | 93.8 | 94.1 | 94.3 | 85.7 | 85.2 | 85.8 | 85.0 | 84.8 | 84.5 | |
8 | 91.6 | 91.8 | 91.4 | 81.3 | 81.5 | 82.4 | 79.5 | 80.0 | 80.6 | |
16 | 91.5 | 91.7 | 91.7 | 79.4 | 78.9 | 80.0 | 79.2 | 79.1 | 79.5 | |
64 | 2 | 97.2 | 97.2 | 97.4 | 93.6 | 93.7 | 93.5 | 92.6 | 92.5 | 92.8 |
4 | 94.4 | 94.6 | 94.8 | 84.5 | 85.0 | 84.7 | 84.4 | 85.0 | 85.2 | |
8 | 91.7 | 91.5 | 91.8 | 79.6 | 80.1 | 80.4 | 78.9 | 79.2 | 79.6 | |
16 | 90.9 | 90.9 | 91.0 | 78.5 | 78.4 | 78.7 | 78.9 | 78.4 | 78.7 |
In general, for population sphericity values of 0.282 and 0.505, the approximate power confidence interval coverage for the Box conservative test converged to the target coverage value as rank of X _{est} increased, and thus ν _{est} decreased. Coverage decreased as rank of X from the target study increased. For larger rank of X, the approximate power confidence interval coverage fell short of the target coverage in several instances. No clear trend was apparent as N _{est} increased. The Box conservative test would not be used for larger population sphericity values. However, the realization that the target coverage was reached in nearly every case considered for the larger population sphericity values is worth mentioning.
The approximate power confidence interval coverages for both the Geisser-Greenhouse and rank-adjusted Huynh-Feldt tests seem to have converged to the target coverage value as rank of X _{est} increased, and thus ν _{est} decreased, except in cases of sphericity. Such cases have little practical importance since exact results are available if sphericity is valid. Coverage decreased as rank of X from the target study increased. As observed in previous simulations, the approximate power confidence interval coverages for both the Geisser-Greenhouse and rank-adjusted Huynh-Feldt tests fell short of the target coverage to varying degrees in nearly every case considered for midrange population sphericity values. This outcome was also observed for larger rank of X from the target study for population sphericity of 0.282. The approximate power confidence interval coverage for the uncorrected tests reached the target coverage value in every case except for large ν _{est} and small rank of X from the target study. The approximate power confidence interval coverage increased as the ranks of X for both the target and estimating studies increased and as N _{est} decreased.
The slow convergence of the approximate power confidence interval coverage to the target coverage may be due, in part, to use of ${\stackrel{~}{\u220a}}_{n}$ and ${\stackrel{~}{\u220a}}_{r}$ in the approximate power confidence interval equation. These estimators of the sphericity parameter are ratios of unbiased estimators for the non-null and null cases, respectively. The variances of these estimators are much larger than the variances for ${\hat{\u220a}}_{n}$ and ${\hat{\u220a}}_{d}$. The larger variances may account for the slow convergence to the population power as the target and estimating study sample sizes and degrees of freedom increase. Further simulations may be needed to confirm this reasoning.
Alternate approximations
In attempts to develop even better confidence bound estimates, additional approximations were developed and evaluated. One approach approximated the distribution of ${\stackrel{~}{\lambda}}_{\ast 1}$ with an F. Using the methods presented in Kim et al. [22], the numerator of ${\stackrel{~}{\lambda}}_{\ast 1}$ was approximated with a weighted noncentral chi-square, while the denominator was approximated with a weighted central chi-square. Two concerns arose. First, the denominator is not necessarily a central quadratic. The 2tr(Δ/a) component makes the denominator more of a shifted central quadratic. Second, the Kim et al. [22] result requires that the components of the numerator and denominator be mutually independent, which does not hold. Simulations demonstrated that the approximation was inaccurate in small samples.
Alternative approximations were developed and evaluated. The alternatives matched only the numerator to a weighted noncentral chi-square or to a weighted central chi-square with the denominator a constant equal to E$\left[\text{tr}\right({\hat{\mathbf{\Sigma}}}_{\ast})+2\text{tr}(\mathbf{\Delta}/a\left)\right]$. All were less accurate than the approximation presented here.
Discussion
Even for small sample sizes, the proposed power confidence intervals attain very accurate coverage probabilities for the Box conservative test in all cases and for the uncorrected test with ∊=1 (the only case for which it should be used). The result is also true for the extreme population sphericity values for the Geisser-Greenhouse and rank-adjusted Huynh-Feldt tests. For midrange population sphericity values, the coverage probabilities of the approximate power confidence intervals for the latter two tests often fell somewhat short of the various target coverage values considered. Coverage probabilities improve as sample size increases. Accuracy is better for higher target power values than for lower, which makes the results useful in practice. One-sided confidence intervals are recommended for lower bounds on power.
The techniques also allow plotting power confidence regions around an estimated power curve (Figure 1). The resulting plots have been well received by researchers.
Conclusions
Good statistical practice requires associating a credible measure of uncertainty with any parameter estimate. We described and evaluated new methods to meet the need for UNIREP power estimates based on an estimated covariance with fixed means. Across a large range of conditions, the methods provide reasonably accurate coverage for all four UNIREP tests.
Appendix
Appendix A: Additional notation
They used the parameters (assumed known) to approximate the UNIREP test statistic with a noncentral F distribution, as presented in equation 6.
Appendix B: Supporting lemmas and proofs
Lemma B.1
A.1-A.5 imply $\left\{{S}_{t1},{S}_{t2},{S}_{t3},{S}_{t4}\right\}=\phantom{\rule{2pt}{0ex}}\left\{\text{tr}\left({\mathbf{\Sigma}}_{\ast}\right),\right.\left(\right)close="\}">\text{tr}\left({\mathbf{\Sigma}}_{\ast}^{2}\right),\text{tr}\left(\mathbf{\Delta}\right),\text{tr}\left({\mathbf{\Sigma}}_{\ast}\mathbf{\Delta}\right)$.
Proof of lemma B.1
Lemma B.2
The first moments of $\text{tr}\left({\hat{\mathbf{\Sigma}}}_{\ast}\right)$, $\text{tr}\left({\hat{\mathbf{\Sigma}}}_{\ast}^{2}\right)$, ${\text{tr}}^{2}\left({\hat{\mathbf{\Sigma}}}_{\ast}\right)$, and $\text{tr}\left(\mathbf{\Delta}{\hat{\mathbf{\Sigma}}}_{\ast}\right)$ are known.
Proof of lemma B.2
Proof of lemma 1
Pr{(y _{∗1}/ν _{∗1})/(y _{∗2}/ν _{∗2})≤t λ _{∗2} a b ν _{∗2}/(λ _{∗1} ν _{∗1} b ν _{ e })}=F _{ F }(t;ν _{∗1},ν _{∗2},ω _{∗}).
Proof of lemma 2
In the null case Δ=0 and ${\stackrel{~}{\u220a}}_{n}$ reduces to the rank-adjusted sphericity estimator, ${\stackrel{~}{\u220a}}_{r}=\left[\left({\nu}_{e}\phantom{\rule{0.3em}{0ex}}+\phantom{\rule{0.3em}{0ex}}1\right)\right.\left(\right)close="]">\phantom{\rule{0.3em}{0ex}}b\hat{\u220a}-2$ $\left[b\left({\nu}_{e}-b\hat{\u220a}\right)\right]={\stackrel{~}{\u220a}}_{n}|\mathbf{\Delta}=0$.
Appendix C: Simulation details
Covariance conditions
Thus, ∊∈{0.28,0.51,0.72,1.00}. Given Σ _{∗}=Dg(λ _{ j }), it follows that Σ=UΣ _{∗} U ^{′}.
CLAHE mammography example
With L the linear and Q the quadratic trends for interaction components being tested, ${\mathit{U}}_{\text{cr}}=\left[\begin{array}{llll}{\mathit{u}}_{\text{LL}}& {\mathit{u}}_{\text{LQ}}& {\mathit{u}}_{\text{QL}}& {\mathit{u}}_{\text{QQ}}\end{array}\right]$.
All four covariance patterns were factorially combined with N∈{10,20,40}. The multivariate test considered ${\mathit{\Theta}}_{\text{cr}}={\beta}_{P}\xb7\left[\begin{array}{llll}0.5& 1.0& -1.0& 0.5\end{array}\right]$ with α=0.05, and β _{ P } the scaling factor for B corresponding to approximate target power P∈{0.20,0.50,0.80} for the Geisser-Greenhouse approximation using methods in Muller et al. [5]. The conditions in the example were used in section ‘Simulation 1 with rank (X)=1 (one-group repeated measures ANOVA)’. More details of the example are in Muller et al. [5].
Test of interaction with rank (X)>1Example
Each row of the between-subject contrast, C, a (q−1×q) orthonormal matrix, contained one of the (q−1) trends. The contrasts define a test of interaction of between- and within-subject trends. Without loss of generality, we assumed Θ _{0}=0. A test size, α, of 0.05 was used.
Computational methods
All power computations were conducted in SAS/IML (SAS 9.1, SAS Institute, 2003). Free software LINMOD 3.4 was used for all data analysis and includes new methods. Free software POWERLIB 2.1 in Johnson et al. [25] was used for all power analysis and includes the new methods. Both are available at http://health-outcomes-policy.ufl.edu/faculty-directory/-muller-keith/list-of-software/. UNIREP power is also available in GLIMMPSE, a free web-browser based program with a graphical user interface aimed at health scientists ( http://www.SampleSizeShop.org). The next version of GLIMMPSE is expected to implement the confidence interval methods.
Declarations
Acknowledgements
We thank the reviewers Drs. Warren Comulada, Peter Lachenbruch, and Russell Lenth for their timely review and beneficial comments. This work was supported in part by a UF CTSI core grant to Chi and Muller (NCATS UL1TR000064). Chi’s support included NIDCR R01-DE020832, NIDDK R01-DK072398, NIDA R01-DA031017, and NICHD P01-HD065647. Muller’s support included NIDCR R01-DE020832-01A1, NIDDK R01-DK072398, NIDCR U54-DE019261, NCRR K30-RR022258, NHLBI R01-HL091005, NIAAA R01-AA013458-01, and NIDA R01-DA031017.
Authors’ Affiliations
References
- Kraemer HC, Mintz J, Noda A, Tinklenberg J, Jerome A, Yesavage: Caution regarding the use of pilot studies to guide power calculations for study proposals. Arch Gen Psychiatry. 2006, 63: 484-489. 10.1001/archpsyc.63.5.484.View ArticlePubMedGoogle Scholar
- Maxwell SE: The persistence of underpowered studies in psychological research:causes, consequences, and remedies. Psychol Methods. 2004, 9: 147-163.View ArticlePubMedGoogle Scholar
- Taylor DJ, Muller KE: Bias in linear model power and sample size calculation due to estimating noncentrality. Commun Stat Theory Methods. 1996, 25: 1595-1610. 10.1080/03610929608831787.View ArticleGoogle Scholar
- Muller KE, Pasour VB: Bias in linear model power and sample size due to estimating variance. Commun Stat Theory Methods. 1997, 26: 839-851. 10.1080/03610929708831953.View ArticleGoogle Scholar
- Muller KE, Edwards LJ, Simpson SL, Taylor DJ: Statistical tests with accurate size and power for balance linear mixed models. Stat Med. 2007, 26: 3639-3660. 10.1002/sim.2827.View ArticlePubMedGoogle Scholar
- Muller KE, Stewart PW: Linear Model, Theory: Univariate, Multivariate and Mixed Models. 2006, New York: WileyView ArticleGoogle Scholar
- Box GEP: Some theorems on quadratic forms applied in the study of analysis of variance problems: I. Effects of inequality of variance in a one-way classification. Ann Math Stat. 1954, 25: 290-302. 10.1214/aoms/1177728786.View ArticleGoogle Scholar
- Box GEP: Some theorems on quadratic forms applied in the study of analysis of variance problems: II. Effects of inequality of variance and of correlation between errors in the two-way classification. Ann Math Stat. 1954, 25: 484-498. 10.1214/aoms/1177728717.View ArticleGoogle Scholar
- Geisser S, Greenhouse SW: An extension of Box’s results on the use of the F distribution in multivariate analysis. Ann Math Stat. 1958, 29: 885-891. 10.1214/aoms/1177706545.View ArticleGoogle Scholar
- Greenhouse SW, Geisser S: On methods in the analysis of profile data. Psychometrika. 1959, 24: 95-112. 10.1007/BF02289823.View ArticleGoogle Scholar
- Huynh H, Feldt LS: Estimation of the Box corrections for degrees of freedom from sample data in randomized block and split-plot designs. J Edu Stat. 1976, 1: 69-82.Google Scholar
- Davies RB: Distribution of a linear combination of chi-square random variables. Appl Stat. 1980, 29: 323-333. 10.2307/2346911.View ArticleGoogle Scholar
- Muller KE, Barton CN: Approximate power for repeated measures ANOVA lacking sphericity. J Am Stat Assoc. 1989, 84: 549-555. 10.1080/01621459.1989.10478802.View ArticleGoogle Scholar
- Muller KE, Barton CN: Correction to Approximate power for repeated-measures ANOVA lacking sphericity. J Am Stat Assoc. 1991, 86: 255-256.Google Scholar
- Browne RH: On the use of a pilot sample for sample size determination. Stat Med. 1995, 14: 1933-1940. 10.1002/sim.4780141709.View ArticlePubMedGoogle Scholar
- Taylor DJ, Muller KE: Computing confidence bounds for power and sample size of the general linear univariate model. Am Stat. 1995, 49: 43-47.PubMedPubMed CentralGoogle Scholar
- Lecoutre B: A correction for the $\stackrel{~}{\u220a}$, approximate test with repeated measures design with two or more independent groups. J Educ Stat. 1991, 16: 371-372.View ArticleGoogle Scholar
- Gribbin MJ: Better power methods for the univariate approach to repeated measures. PhD Dissertation. 2007, University of North Carolina at Chapel Hill, Department of BiostatisticsGoogle Scholar
- Chi YY, Gribbin MJ, Lamers Y, Gregory JF, Muller KE: Global hypothesis testing for high-dimensional repeated measures outcomes. Stat Med. 2012, 31: 724-742. 10.1002/sim.4435.View ArticlePubMedGoogle Scholar
- Muller KE, Fetterman BA: Regression and, ANOVA: An Integrated Approach Using SAS®; Software Chapter 17. 2002, Cary: SAS InstituteGoogle Scholar
- Morrison DF: Multivariate Statistical, Methods. 1990, New York: McGraw-HillGoogle Scholar
- Kim HY, Gribbin MJ, Muller KE, Taylor DJ: Analytic, computational and approximate forms for ratios of noncentral and central Gaussian quadratic forms. J Comput Graphical Stat. 2006, 15: 443-459. 10.1198/106186006X112954.View ArticleGoogle Scholar
- Wishart J: The generalized product moment distribution in samples from a normal multivariate population. Biometrika. 1928, 20A: 32-52.View ArticleGoogle Scholar
- Coffey CS, Muller KE: Properties of internal pilots with the univariate approach to repeated measures. Stat Med. 2003, 22: 2469-2485. 10.1002/sim.1466.View ArticlePubMedGoogle Scholar
- Johnson JL, Muller KE, Slaughter JC, Gurka MJ, Gribbin MJ, and Simpson SL: POWERLIB: SAS/IML software for computing power in multivariate linear models. J Stat Softw. 2009, 30: 1-27. [http://www.jstatsoft.org/v30/i05/paper]View ArticleGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/13/57/prepub
Pre-publication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.