Open Access
Open Peer Review

This article has Open Peer Review reports available.

How does Open Peer Review work?

Statistical power as a function of Cronbach alpha of instrument questionnaire items

BMC Medical Research Methodology201515:86

https://doi.org/10.1186/s12874-015-0070-6

Received: 18 April 2015

Accepted: 18 September 2015

Published: 14 October 2015

Abstract

Background

In countless number of clinical trials, measurements of outcomes rely on instrument questionnaire items which however often suffer measurement error problems which in turn affect statistical power of study designs. The Cronbach alpha or coefficient alpha, here denoted by C α , can be used as a measure of internal consistency of parallel instrument items that are developed to measure a target unidimensional outcome construct. Scale score for the target construct is often represented by the sum of the item scores. However, power functions based on C α have been lacking for various study designs.

Methods

We formulate a statistical model for parallel items to derive power functions as a function of C α under several study designs. To this end, we assume fixed true score variance assumption as opposed to usual fixed total variance assumption. That assumption is critical and practically relevant to show that smaller measurement errors are inversely associated with higher inter-item correlations, and thus that greater C α is associated with greater statistical power. We compare the derived theoretical statistical power with empirical power obtained through Monte Carlo simulations for the following comparisons: one-sample comparison of pre- and post-treatment mean differences, two-sample comparison of pre-post mean differences between groups, and two-sample comparison of mean differences between groups.

Results

It is shown that C α is the same as a test-retest correlation of the scale scores of parallel items, which enables testing significance of C α . Closed-form power functions and samples size determination formulas are derived in terms of C α , for all of the aforementioned comparisons. Power functions are shown to be an increasing function of C α , regardless of comparison of interest. The derived power functions are well validated by simulation studies that show that the magnitudes of theoretical power are virtually identical to those of the empirical power.

Conclusion

Regardless of research designs or settings, in order to increase statistical power, development and use of instruments with greater C α , or equivalently with greater inter-item correlations, is crucial for trials that intend to use questionnaire items for measuring research outcomes.

Discussion

Further development of the power functions for binary or ordinal item scores and under more general item correlation strutures reflecting more real world situations would be a valuable future study.

Keywords

Cronbach alpha Coefficient alpha Test-retest correlation Internal consistency Reliability Statistical power Effect size

Background

Use of instrument questionnaire items is essential for measurement of outcome of interest in innumerable numbers of clinical trials. Many trials use well-established instruments; for example, major depressive disorders are often evaluated by scores on the Hamilton Rating Scale of Depression (HRSD) [1] in psychiatry trials. However, it is by far more often the case when instruments germane to a research outcome are not available. In such cases, of course, questionnaire items need to be developed to measure the outcome, and their psychometric properties should be evaluated for construct validity, internal consistency, and reliability among others [2, 3]. The internal consistency of instrument items quantifies how similarly in a interrelated fashion the items represent an outcome construct that the instrument is aiming to measure [4], whereas reliability is defined as the squared correlation between true score and observed score [3].

Cronbach alpha also known as coefficient alpha [5], hereafter denoted by C α , has been very widely used to quantify the internal consistency and reliability of items in clinical research and beyond [6] although internal consistency and reliability are not exchangeable psychometric concepts in general. For this reason, some argue that C α should not be used for quantifying either concept (e.g.,[7, 8]). One the other hand, for special cases where items under study are parallel such that items are designed as replicates to measure a unidimensional construct or attribute, C α can quantify internal consistency and reliability as well [2] although in general C α is not necessarily a measure of unidimensionality or homogeneity [4, 8]. In this paper, we consider parallel items; for example, items within a same factor could be considered parallel for a unidimensional construct. In this sense, items of HRSD are not parallel since it measures depression, a multidimensional construct with many factors.

The Cronbach alpha by mathematical definition is an adjusted proportion of total variance of the item scores explained by the sum of covariances between item scores, and thus ranges between 0 and 1 if all covariance elements are non-negative. Specifically, for an instrument with k items with a general covariance matrix Σ among the item scores, C α is defined as
$${\mathit{\mathsf{C}}}_{\alpha }=\frac{\mathit{\mathsf{k}}}{\mathit{\mathsf{k}}-\mathsf{1}}\left(\frac{{\text{\textbf{\textsf{1}}}}^{\mathit{\mathsf{T}}}\boldsymbol{\Sigma}\text{\textbf{\textsf{1}}}-\mathit{\mathsf{trac}}e\left(\boldsymbol{\Sigma}\right)}{{\text{\textbf{\textsf{1}}}}^{\mathit{\mathsf{T}}}\boldsymbol{\Sigma}\text{\textbf{\textsf{1}}}}\right)=\frac{\mathit{\mathsf{k}}}{\mathit{\mathsf{k}}-\mathsf{1}}\left(\mathsf{1}-\frac{\mathit{\mathsf{trac}}e\left(\boldsymbol{\Sigma}\right)}{{\text{\textbf{\textsf{1}}}}^{\mathit{\mathsf{T}}}\boldsymbol{\Sigma}\text{\textbf{\textsf{1}}}}\right),$$
(1)

where trace(.) is the sum of the diagonal elements of a square matrix, 1 is a column vector with k unit elements, and 1 T is the transpose of 1. This quantification is therefore based on the notion that relative magnitudes of covariances between item scores compared to those of corresponding variances serves as a measure of similarities of the items. Consequently, items with higher C α are preferred to measure the target outcome. However, C α is a lower bound for reliability, but is not equal to reliability unless the items are parallel or essentially τ-equivalent [3, 8]. The sum of the instrument items serves as a scale for the outcome, and is used for statistical inference including testing statistical hypotheses. At the design stage of clinical trials, information about magnitude of reliability or internal consistency of developed parallel items is crucial for power analysis and sample size determinations. Nonetheless, power functions based on C α have been lacking for various study designs.

In this paper, to derive closed-from power functions, we formulate a statistical model for parallel items that relates the item scores to a measurement error problem. Under this model, C α (1) is explicitly expressed in terms of an inter-item correlation. We examine relationship among C α , a test-retest correlation and reliability of scale scores that enables testing significance of C α through Fisher z-transformation. We explicitly express statistical power as a function of C α for the following comparisons: one-sample comparison of pre- and post-treatment mean differences, two-sample comparison of pre-post mean differences between groups, and two-sample comparison of mean differences between groups. Simulation study results compare derived theoretical power with empirical power and discussion and conclusion follow.

Methods

Statistical model

We consider the following model for item score Y ij to the j-th parallel item for the i-th subject:
$${\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}={\mu}_{\mathit{\mathsf{i}}}+{e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}$$
(2)

The parameter μ i represents the “true score” of the target (outcome) construct for the i-th subject. At the population level, its expectation and variance are assumed to be \(\mathit{\mathsf{E}}\left({\mu}_{\mathit{\mathsf{i}}}\right)=\mu\) and \(\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\mu}_{\mathit{\mathsf{i}}}\right)={\sigma}_{\mu}^{\mathsf{2}}\), which we call the true score variance. The error term \({e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\) represents the deviate of the item score Y ij from the true score μ i , i.e., \({e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\) is the measurement error of Y ij . The expectation and variance of \({e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\) for all subjects are assumed to be \(\mathit{\mathsf{E}}\left({e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\right)=\mathsf{0}\), i.e., unbiasedness assumption, that is, E j (Y ij ) = μ i and E i E j (Y ij ) = E(μ i ) = μ, where E j denotes the expectation over j. It is also assumed that \(\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\right)={\sigma}_e^{\mathsf{2}}\), which we call the measurement error variance. We further assume the following: μ i and e ij are mutually independent, i.e., μ i e ij ; and the elements of e ij ’s are independent for a given subject, i.e., conditional independence, that is, e ij e ij|μ i for j ≠ j′. Note that this conditional independence does not imply marginal impendence between Y ij and Y ij. In short, model (2) is a mixed-effects linear model for data with a two-level structure in a way that repeated item scores are nested within individuals.

Under those assumptions, we have \(\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\right)\equiv {\sigma}^{\mathsf{2}}={\sigma}_{\mu}^{\mathsf{2}}+{\sigma}_e^{\mathsf{2}}\), that is, the total variance of the item scores is the sum of the true score variance and the measurement error variance. Inter-item (score) covariance can be obtained as \(\mathit{\mathsf{C}}\mathit{\mathsf{o}}\mathit{\mathsf{v}}\left({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}},{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}{\mathit{\mathsf{j}}}^{\prime }}\right)={\sigma}_{\mu}^{\mathsf{2}}\) for j ≠ j′. Therefore, the diagonal elements of covariance matrix Σ under model (2) are identical and so are the off-diagonal elements. This compound symmetry covariance structure, also known as essential τ-equivalence, is the covariance matrix of parallel items each of which targets the underlying true score for a unidimensional construct. Furthermore, the compound symmetry covariance structure can be regarded as a covariance matrix of “standardized” item scores with unequal variances and covariances. Inter-item (score) correlation, denoted here by ρ, can accordingly be obtained as
$$\mathit{\mathsf{Corr}}\left({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}{\mathit{\mathsf{j}}}^{\prime }}\right)\equiv \rho =\frac{\sigma_{\mu}^{\mathsf{2}}}{\sigma^{\mathsf{2}}}=\frac{\sigma_{\mu}^{\mathsf{2}}}{\sigma_{\mu}^{\mathsf{2}}+{\sigma}_e^{\mathsf{2}}}.$$
(3)

Although item scores are correlated within subjects, they are independent between subjects. Note that this inter-item correlation is not necessarily equal to item-score reliability that quantifies a correlation between true and observed scores.

In this paper, we assume that the true score variance \({\sigma}_{\mu}^{\mathsf{2}}\), instead of the total variance σ2, is fixed at the population level, and it does not depend on the item scores of the subjects. Stated differently, the total variance σ2 depends only on \({\sigma}_e^{\mathsf{2}}\) which depends on item scores and thus σ2 is assumed to be an increasing function of only measurement errors of the item scores. Let us call this assumption the fixed true score variance assumption, which is crucial and reasonable from the perspective of measurement error theory in general. This assumption is crucial because it makes the total variance as a function of only measurement error variance as mentioned above, and it is reasonable because at the population level true score variance should not be varying whereas magnitudes of measurement error variance depend on reliability of items. Consequently, the true score variance \({\sigma}_{\mu}^{\mathsf{2}}\) is not a function of inter-item correlation ρ, but the measurement error variance \({\sigma}_e^{\mathsf{2}}\) is a decreasing function of ρ since from equation (3) we have
$${\sigma}_e^{\mathsf{2}}=\left(\mathsf{1}-\rho \right){\sigma}^{\mathsf{2}}=\left(\mathsf{1}/\rho -\mathsf{1}\right){\sigma}_{\mu}^{\mathsf{2}}.$$
(4)
It follows that as the item scores are closer or more similar to each other within subjects, the measurement errors will be smaller, which follows that the total variance is also a decreasing function of ρ since
$${\sigma}^{\mathsf{2}}={\sigma}_{\mu}^{\mathsf{2}}+{\sigma}_e^{\mathsf{2}}={\sigma}_{\mu}^{\mathsf{2}}/\rho .$$
(5)

We assume that the magnitudes of both \({\sigma}_e^{\mathsf{2}}\) and \({\sigma}_{\mu}^{\mathsf{2}}\) are known and thus that of σ2 for the purpose of derivation of power functions based on normal destructions instead of t-distributions, although replacement by t-distributions should be straightforward yet with little difference in results for sizable sample sizes.

Cronbach alpha, scale score and its variance

We assume that there are k items in an instrument, i.e., j =1, 2, …, k. The C α (1) of k items under model (2) and aforementioned assumptions can be expressed as
$${\mathit{\mathsf{C}}}_{\alpha }=\frac{\mathit{\mathsf{k}}{\sigma}_{\mu}^{\mathsf{2}}}{\sigma_e^{\mathsf{2}}+\mathit{\mathsf{k}}{\sigma}_{\mu}^{\mathsf{2}}}=\frac{\mathit{\mathsf{k}}\rho }{\mathsf{1}+\rho \left(\mathit{\mathsf{k}}-\mathsf{1}\right)}.$$
(6)
It is due to the fact that \(\boldsymbol{\Sigma}={\sigma}_e^{\mathsf{2}}\text{\textbf{\textsf{I}}}+{\sigma}_{\mu}^{\mathsf{2}}\text{\textbf{\textsf{1}}}{\text{\textbf{\textsf{1}}}}^{\mathit{\mathsf{T}}}\) under model (2) where I is a k-by-k identity matrix. C α in equation (6) is seen to be an increasing function of both ρ and k as depicted in Fig. 1. Therefore, the number of items needs to be fixed for comparison of C α of several candidate sets of items. It follows that for a fixed number of items, higher C α is associated with smaller measurement error of items through higher inter-item correlation ρ. From equation (6), ρ can be expressed in terms of C α as follows:
Fig. 1

Relationship between Cronbach alpha (C α ) and inter-item correlation (ρ) over varying number of items (k)

$$\rho =\frac{{\mathit{\mathsf{C}}}_{\alpha }}{\mathit{\mathsf{k}}-{\mathit{\mathsf{C}}}_{\alpha}\left(\mathit{\mathsf{k}}-\mathsf{1}\right)}.$$
(7)

Of note, the corresponding correlation matrix is denoted by \(\text{\textbf{\textsf{P}}}=\left(\mathsf{1}-\rho \right)\text{\textbf{\textsf{I}}}+\rho \text{\textbf{\textsf{1}}}{\text{\textbf{\textsf{1}}}}^{\mathit{\mathsf{T}}}\), an equi-correlation matrix.

The k correlated items are often summed up to a scale that is intended to measure the target construct. The scale score is denoted here by
$${\mathit{\mathsf{S}}}_{\mathit{\mathsf{i}}}={\displaystyle \sum_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}},$$
which can be viewed as an observed summary score for the i-th subject. Suppressing the subscription i in S i , its mean and variance can be obtained as follows:
$${E}_j(S)=k{\mu}_i,$$
(8)
and
$$Var(S)=k{\sigma}^2\left\{1+\rho \left(k-1\right)\right\}.$$
(9)
With respect to the mean (8), average scale score S i /k when used as observed score is an unbiased estimate of true score μ i for the i-th subject. The reliability, denoted here by R, defined as the squared correlation between true score and observed score can be obtained as follows:
$$\mathit{\mathsf{R}}=\mathit{\mathsf{C}}\mathit{\mathsf{o}}\mathit{\mathsf{r}}{\mathit{\mathsf{r}}}^{\mathsf{2}}\left({\mathit{\mathsf{S}}}_{\mathit{\mathsf{i}}}/\mathit{\mathsf{k}},{\mu}_{\mathit{\mathsf{i}}}\right)=\frac{\mathit{\mathsf{k}}\rho }{\mathsf{1}+\rho \left(\mathit{\mathsf{k}}-\mathsf{1}\right)}={\mathit{\mathsf{C}}}_{\alpha }.$$
(10)

This equation supports Theorem 3.1 of Novick and Lewis [9] that R = C α if and only if the items are parallel. Since statistical analysis results do not depend on whether S i /k or S i is used, we use the sum S in what follows.

With respect the total variance (9), if the total variance, instead of the true score variance, is assumed to be fixed, Var(S) is an increasing function of ρ, which conforms to an elementary statistical theory that variance of sum of correlated variables increases with increasing correlation. On the contrary, under the fixed true score variance assumption, it can be seen that Var(S) is a decreasing function of ρ since equation (9) can be re-expressed in terms of \({\sigma}_{\mu}^{\mathsf{2}}\) via equation (5) as follows:
$$\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left(\mathit{\mathsf{S}}\right)=\mathit{\mathsf{k}}{\sigma}_{\mu}^{\mathsf{2}}\left(\mathsf{1}/\rho +\mathit{\mathsf{k}}-\mathsf{1}\right)={\mathit{\mathsf{k}}}^{\mathsf{2}}{\sigma}_{\mu}^{\mathsf{2}}/{\mathit{\mathsf{C}}}_{\alpha }.$$
(11)

The last equation is due to equation (7). It follows that Var(S) is also a decreasing function of C α . In sum, increase of ρ decreases the magnitude of σ 2 which in turn decreases the magnitude of Var(S); therefore such indirect decreasing effect of ρ on Var(S) is larger than direct increasing effect of ρ on Var(S) in equation (9).

Cronbach alpha and test-retest correlation

Reliability R of instruments is sometimes evaluated by test-retest correlation [3]. Based on model (2), the test and retest item scores can be specified as \({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}}={\mu}_{\mathit{\mathsf{i}}}+{e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\) and \({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{r}}e\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}}={\mu}_{\mathit{\mathsf{i}}}+{e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\), respectively with a common μ i for both test and retest scores for each subject, i = 1, 2,…, N. The test-retest correlation can then be measured by the correlation, denoted by Corr(S test , S retest ), between scale scores \({\mathit{\mathsf{S}}}_{\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}}={\displaystyle {\sum}_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}}}\) and \({\mathit{\mathsf{S}}}_{\mathit{\mathsf{r}}e\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}}={\displaystyle {\sum}_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{r}}e\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}}}\) representing the scale scores of test and retest, respectively. Under the aforementioned assumptions for model (2) it can be shown that
$$Cov\left({S}_{test},\kern0.5em {S}_{retest}\right)={k}^2\rho {\sigma}^2,$$
(12)
and from equation (10)
$$Var\left({S}_{test}\right)=Var\left({S}_{retest}\right)=k{\sigma}^2\left\{1+\rho \left(k-1\right)\right\}.$$
(13)
It follows that:
$$\mathit{\mathsf{C}\mathsf{orr}}\left({\mathit{\mathsf{S}}}_{\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}},{\mathit{\mathsf{S}}}_{\mathit{\mathsf{r}}e\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}}\right)=\frac{\mathit{\mathsf{C}}\mathit{\mathsf{o}}\mathit{\mathsf{v}}\left({\mathit{\mathsf{S}}}_{\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}}{\mathit{\mathsf{S}}}_{\mathit{\mathsf{r}}e\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}}\right)}{\sqrt{\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\mathit{\mathsf{S}}}_{\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}}\right)}\sqrt{\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\mathit{\mathsf{S}}}_{\mathit{\mathsf{t}}e\mathit{\mathsf{s}}\mathit{\mathsf{t}}}\right)}}=\frac{\mathit{\mathsf{k}}\rho }{\mathsf{1}+\rho \left(\mathit{\mathsf{k}}-\mathsf{1}\right)}=\mathit{\mathsf{R}}={\mathit{\mathsf{C}}}_{\alpha }.$$
(14)

This equation shows that the test-rest correlation is the same as both C α and R due to equations (6) and (10), which provides another interpretation of C α . This property is especially useful when there is only one item available, in which case estimation of C α or ρ is impossible by definition. However, the test and retest scores can be thought of as two correlated parallel item scores, and thus their correlation can serve as C α of the single item. It is particularly fitting since ρ = C α  = R based on either equation (6), (7), or (14) when k = 1.

Taken together, the power \({\varphi}_{{\mathit{\mathsf{C}}}_{\alpha }}\) of testing significance of C α against any null value should be equivalent to that of testing significance of a correlation using a Fisher’s z-transformation as long as items are parallel, that is,
$${\varphi}_{{\mathit{\mathsf{C}}}_{\alpha }}=\mathsf{1}-\varPhi \left[{\varPhi}^{-\mathsf{1}}\left(\mathsf{1}-\alpha /\mathsf{2}\right)-\sqrt{\mathit{\mathsf{N}}-\mathsf{3}}\left(\frac{\mathsf{1}}{\mathsf{2}} \ln \left(\frac{\mathsf{1}+{\mathit{\mathsf{C}}}_{\alpha }}{\mathsf{1}-{\mathit{\mathsf{C}}}_{\alpha }}\right)+\frac{{\mathit{\mathsf{C}}}_{\alpha }}{\mathsf{2}\left(\mathit{\mathsf{N}}-\mathsf{1}\right)}\right)\right]$$
for a two-tailed significance level α, where Φ is the cumulative distribution function of a standardized normal distribution, and Φ−1 is its inverse function, i.e., Φ(Φ−1(x)) = Φ−1(Φ(x)) = x. We note that although it is necessary to be added for validation of unbiasedness of the test statistics under the null hypothesis, the probability under the other rejection area will be ignored for all test statistics considered herein. For general covariance structures for non-parallel items, however, many other tests for significance of reliability and C α have been developed [1017].

Pre-post comparison

We consider application of a paired t-test to the case of comparison of within-group means of scale scores between pre- and post-interventions. Based on model (2), the pre- and post-intervention item scores can be specified as \({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e}={\mu}_{\mathit{\mathsf{i}}}+{e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\) and \({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{post}}}={\mu}_{\mathit{\mathsf{i}}}+{\delta}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}+{e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\), respectively; the mean of the post-intervention item scores are shifted by δ PP , an intervention effect. Consequently, we have
$$E\left({S}_{post}\right)-E\left({S}_{pre}\right)=k{\delta}_{pp},$$
(15)
where \({\mathit{\mathsf{S}}}_{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e}={\displaystyle {\sum}_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e}}\) and \({\mathit{\mathsf{S}}}_{\mathit{\mathsf{post}}}={\displaystyle {\sum}_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{post}}}}\) are the pre- and post-intervention scale scores, respectively. A moment estimate of δ PP from (15) can be estimated as
$${\widehat{\delta}}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}=\left({\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}\mathsf{ost}}}-{\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e}\right)/\mathit{\mathsf{k}},$$
(16)
where \(\overline{\mathit{\mathsf{S}}}={\displaystyle {\sum}_{\mathit{\mathsf{i}}=\mathsf{1}}^{\mathit{\mathsf{N}}}{\displaystyle {\sum}_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}}}/\mathit{\mathsf{N}}\) and N is the total number of subject. Its variance can be obtained as
$$\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\widehat{\delta}}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}\right)=\frac{\mathsf{2}\left(\mathsf{1}-\rho \right){\sigma}^{\mathsf{2}}}{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}=\frac{\mathsf{2}\left(\mathsf{1}/\rho -\mathsf{1}\right){\sigma}_{\mu}^{\mathsf{2}}}{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}.$$
(17)
It is because from equations (12) and (13) we have
$$\begin{array}{c}\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}\mathsf{ost}}}-{\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e}\right)=\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}\mathsf{ost}}}\right)+\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e}\right)-\mathsf{2}\mathit{\mathsf{C}}\mathit{\mathsf{o}}\mathit{\mathsf{v}}\left({\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}\mathsf{ost}}},{\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e}\right)\\ {}=\mathit{\mathsf{k}}{\sigma}^{\mathsf{2}}\left\{\mathsf{1}+\rho \left(\mathit{\mathsf{k}}-\mathsf{1}\right)\right\}/\mathit{\mathsf{N}}+\mathit{\mathsf{k}}{\sigma}^{\mathsf{2}}\left\{\mathsf{1}+\rho \left(\mathit{\mathsf{k}}-\mathsf{1}\right)\right\}/\mathit{\mathsf{N}}-\mathsf{2}{\mathit{\mathsf{k}}}^{\mathsf{2}}\rho {\sigma}^{\mathsf{2}}/\mathit{\mathsf{N}}\\ {}=\mathsf{2}\mathit{\mathsf{k}}{\sigma}^{\mathsf{2}}\left(\mathsf{1}-\rho \right)/\mathit{\mathsf{N}}=\mathsf{2}\mathit{\mathsf{k}}{\sigma}_{\mu}^{\mathsf{2}}\left(\mathsf{1}/\rho -\mathsf{1}\right)/\mathit{\mathsf{N}}\;.\end{array}$$
The following test statistic can then be used for testing H0: δ = 0
$${\mathit{\mathsf{T}}}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}=\frac{{\widehat{\delta}}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}}{\sqrt{\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\widehat{\delta}}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}\right)}}=\frac{\sqrt{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}{\widehat{\delta}}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}}{\sigma_{\mu}\sqrt{\mathsf{2}\left(\mathsf{1}/\rho -\mathsf{1}\right)}}=\frac{\sqrt{\mathit{\mathsf{N}}}\left({\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}\mathsf{ost}}}-{\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e}\right)}{\sigma_{\mu}\sqrt{\mathsf{2}\mathit{\mathsf{k}}\left(\mathsf{1}/\rho -\mathsf{1}\right)}}.$$
(18)
Now, the statistical power φ PP of T PP for detecting non-zero δ PP can be expressed as follows:
$${\varphi}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}=\varPhi \left\{\left|{\delta}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}/{\sigma}_{\mu}\right|\sqrt{\frac{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}{\mathsf{2}\left(\mathsf{1}/\rho -\mathsf{1}\right)}}-{\varPhi}^{-\mathsf{1}}\left(\mathsf{1}-\alpha /\mathsf{2}\right)\right\}.$$
(19)
This statistical power is an increasing function of ρ for a fixed σ μ , which we assume. It follows that the power is also an increasing function of C α as seen next. When δ PP is standardized by σ μ and ρ is replaced by equation (7), equation (19) can further be expressed in terms of \({\varDelta}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}={\delta}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}/{\sigma}_{\mu }\) and C α as follows:
$${\varphi}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}=\varPhi \left\{\left|{\varDelta}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}\right|\sqrt{\frac{\mathit{\mathsf{N}}}{\mathsf{2}\left(\mathsf{1}/{\mathit{\mathsf{C}}}_{\alpha }-\mathsf{1}\right)}}-{\varPhi}^{-\mathsf{1}}\left(\mathsf{1}-\alpha /\mathsf{2}\right)\right\}.$$
(20)

This power function is seen to be independent of k, the number of items. Stated differently, the power will be the same between two instruments with different numbers of items as long as their C α ’s are the same even if the correlation of items will be smaller for the instrument with fewer items.

When sample size determination is needed for a study using an instrument of any number of items with a known C α for a desired statistical power φ, typically 80 %, it can be determined from equation as follows:
$$\mathit{\mathsf{N}}=\frac{\mathsf{2}\left(\mathsf{1}/{\mathit{\mathsf{C}}}_{\alpha }-\mathsf{1}\right){\mathit{\mathsf{z}}}_{\alpha, \varphi}^{\mathsf{2}}}{\varDelta_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}^{\mathsf{2}}},$$
(21)
where
$${\mathit{\mathsf{z}}}_{\alpha, \varphi }={\varPhi}^{-\mathsf{1}}\left(\mathsf{1}-\alpha /\mathsf{2}\right)+{\varPhi}^{-\mathsf{1}}\left(\varphi \right).$$
(22)
The sample size (21) is seen to be a decreasing function of increasing C α and Δ. In a possibly rare case in which determination of number of items with known correlations among them is needed for development of an instrument, it has to be determined from equation (19), instead of equation (20), as follow:
$$\mathit{\mathsf{k}}=\frac{\mathsf{2}\left(\mathsf{1}/\rho -\mathsf{1}\right){\mathit{\mathsf{z}}}_{\alpha, \varphi}^{\mathsf{2}}}{\mathit{\mathsf{N}}{\varDelta}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}^{\mathsf{2}}}.$$
(23)

Comparison of within-group effects between groups

In clinical trials, it is often of interest to compare within-group changes between groups. For instance, a clinical trial can be designed to compare of pre-post effect of an experimental treatment between treatment and control groups, that is, an interaction effect between group and time point. Based on model (2), the pre- and post-intervention item scores can be specified as \({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e\left(\mathsf{0}\right)}={\mu}_{\mathit{\mathsf{i}}}^{\left(\mathsf{0}\right)}+{e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\) and \({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{post}}\left(\mathsf{0}\right)}={\mu}_{\mathit{\mathsf{i}}}^{\left(\mathsf{0}\right)}+{\delta}_{\mathsf{0}}+{e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\) for the control group \({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e\left(\mathsf{1}\right)}={\mu}_{\mathit{\mathsf{i}}}^{\left(\mathsf{1}\right)}+{e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\) and \({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{post}}\left(\mathsf{1}\right)}={\mu}_{\mathit{\mathsf{i}}}^{\left(\mathsf{1}\right)}+{\delta}_{\mathsf{1}}+{e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\) for the treatment group. The primary interest will be testing Ho: δ BW  = δ 1δ 0 = 0, i.e., whether or not the difference in pre-post differences between groups will be the same. Consequently, we have
$$E\left\{{D}_{trt}(S)\right\}-E\left\{{D}_{control}(S)\right\}=k{\delta}_{BW},$$
(24)
where \({\mathit{\mathsf{D}}}_{\mathit{\mathsf{t}}\mathit{\mathsf{r}}\mathit{\mathsf{t}}}\left(\mathit{\mathsf{S}}\right)={\mathit{\mathsf{S}}}_{\mathit{\mathsf{p}\mathsf{ost}}\left(\mathsf{1}\right)}-{\mathit{\mathsf{S}}}_{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e\left(\mathsf{1}\right)}={\displaystyle {\sum}_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{p}\mathsf{ost}}\left(\mathsf{1}\right)}}-{\displaystyle {\sum}_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e\left(\mathsf{1}\right)}}\) and \({\mathit{\mathsf{D}}}_{\mathit{\mathsf{c}}\mathit{\mathsf{o}}n\mathit{\mathsf{trol}}}\left(\mathit{\mathsf{S}}\right)\) can be similarly defined. A moment estimate of δ BW from (24) can be obtained as
$${\widehat{\delta}}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}=\left({\overline{\mathit{\mathsf{D}}}}_{\mathit{\mathsf{t}}\mathit{\mathsf{r}}\mathit{\mathsf{t}}}-{\overline{\mathit{\mathsf{D}}}}_{\mathit{\mathsf{c}}\mathit{\mathsf{o}}n\mathit{\mathsf{t}\mathsf{rol}}}\right)/\mathit{\mathsf{k}},$$
(25)
where N is the number of subjects per group, \({\overline{\mathit{\mathsf{D}}}}_{\mathit{\mathsf{t}}\mathit{\mathsf{r}}\mathit{\mathsf{t}}}\equiv {\overline{\mathit{\mathsf{D}}}}_{\mathit{\mathsf{t}}\mathit{\mathsf{r}}\mathit{\mathsf{t}}}\left(\mathit{\mathsf{S}}\right)={\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}\mathsf{ost}}\left(\mathsf{1}\right)}-{\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e\left(\mathsf{1}\right)}=\) \({\displaystyle {\sum}_{\mathit{\mathsf{i}}=\mathsf{1}}^{\mathit{\mathsf{N}}}{\displaystyle {\sum}_{\mathit{\mathsf{j}}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{p}\mathsf{ost}}\left(\mathsf{1}\right)}}}/\mathit{\mathsf{N}}-{\displaystyle {\sum}_{\mathit{\mathsf{i}}=\mathsf{1}}^{\mathit{\mathsf{N}}}{\displaystyle {\sum}_{\mathit{\mathsf{j}}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\mathit{\mathsf{p}}\mathit{\mathsf{r}}e\left(\mathsf{1}\right)}}}/\mathit{\mathsf{N}}\), and, \({\overline{\mathit{\mathsf{D}}}}_{\mathit{\mathsf{c}}\mathit{\mathsf{o}}n\mathit{\mathsf{trol}}}\) can similarly be defined. The variance of \({\widehat{\delta}}_{\mathit{\mathsf{W}}\mathit{\mathsf{B}}}\) is
$$\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\widehat{\delta}}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}\right)=\frac{\mathsf{4}\left(\mathsf{1}-\rho \right){\sigma}^{\mathsf{2}}}{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}=\frac{\mathsf{4}\left(\mathsf{1}/\rho -\mathsf{1}\right){\sigma}_{\mu}^{\mathsf{2}}}{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}.$$
(26)
Therefore, the following test statistic can be used for testing the null hypothesis Ho: δ BW  = 0,
$${\mathit{\mathsf{T}}}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}=\frac{{\widehat{\delta}}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}}{\sqrt{\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\widehat{\delta}}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}\right)}}=\frac{\sqrt{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}{\widehat{\delta}}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}}{\mathsf{2}{\sigma}_{\mu}\sqrt{\left(\mathsf{1}/\rho -\mathsf{1}\right)}}=\frac{\sqrt{\mathit{\mathsf{N}}}\left({\overline{\mathit{\mathsf{D}}}}_{\mathit{\mathsf{t}}\mathit{\mathsf{r}}\mathit{\mathsf{t}}}-{\overline{\mathit{\mathsf{D}}}}_{\mathit{\mathsf{c}}\mathit{\mathsf{o}}n\mathit{\mathsf{t}\mathsf{rol}}}\right)}{\mathsf{2}{\sigma}_{\mu}\sqrt{\mathit{\mathsf{k}}\left(\mathsf{1}/\rho -\mathsf{1}\right)}}.$$
(27)
The statistical power φ BW of T BW for detecting non-zero δ BW can thus be expressed as follows:
$${\varphi}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}=\varPhi \left\{\left|{\delta}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}/{\sigma}_{\mu}\right|\sqrt{\frac{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}{\mathsf{4}\left(\mathsf{1}/\rho -\mathsf{1}\right)}}-{\varPhi}^{-\mathsf{1}}\left(\mathsf{1}-\alpha /\mathsf{2}\right)\right\}.$$
(28)
Again, this statistical power is an increasing of ρ and of C α as well as seen next. When δ BW is standardized by σ μ and ρ is replaced by equation (7), equation (28) can further be expressed in terms of \({\varDelta}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}={\delta}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}/{\sigma}_{\mu }\) and C α as follows:
$${\varphi}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}=\varPhi \left\{\left|{\varDelta}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}\right|\sqrt{\frac{\mathit{\mathsf{N}}}{\mathsf{4}\left(\mathsf{1}/{\mathit{\mathsf{C}}}_{\alpha }-\mathsf{1}\right)}}-{\varPhi}^{-\mathsf{1}}\left(\mathsf{1}-\alpha /\mathsf{2}\right)\right\}.$$
(29)

Again, this power function is seen to be independent of k, the number of items.

Sample size for a desired statistical power φ can be determined from (27) as follows:
$$\mathit{\mathsf{N}}=\frac{\mathsf{4}\left(\mathsf{1}/{\mathit{\mathsf{C}}}_{\alpha }-\mathsf{1}\right){\mathit{\mathsf{z}}}_{\alpha, \varphi}^{\mathsf{2}}}{\varDelta_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}^{\mathsf{2}}}.$$
(30)
Again, this sample size (30) is seen to be a decreasing function of increasing C α and Δ. When number of items is needed for development of an instrument, it can be determined from equation (28) as follow:
$$\mathit{\mathsf{k}}=\frac{\mathsf{2}\left(\mathsf{1}/\rho -\mathsf{1}\right){\mathit{\mathsf{z}}}_{\alpha, \varphi}^{\mathsf{2}}}{\mathit{\mathsf{N}}{\varDelta}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}^{\mathsf{2}}}.$$
(31)

Two-sample between-group comparison

Comparison of means between groups using an instrument is widely tested in clinical trials. Based on model (2), the intervention item scores from control and treatment groups can be specified as \({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\left(\mathsf{0}\right)}={\mu}_{\mathit{\mathsf{i}}}+{e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\) and \({\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\left(\mathsf{1}\right)}={\mu}_{\mathit{\mathsf{i}}}+{\delta}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}+{e}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}\), respectively. The primary interest will be testing Ho: δ TS  = 0, i.e., whether or not the means are the same between the two groups. Under this formulation, we have
$$E\left({S}_{trt}\right)=E\left({S}_{control}\right)+k{\delta}_{TS},$$
(32)
where \({\mathit{\mathsf{S}}}_{\mathit{\mathsf{t}}\mathit{\mathsf{r}}\mathit{\mathsf{t}}}={\displaystyle {\sum}_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\left(\mathsf{1}\right)}}\) and \({\mathit{\mathsf{S}}}_{\mathit{\mathsf{c}}\mathit{\mathsf{o}}n\mathit{\mathsf{trol}}}={\displaystyle {\sum}_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\left(\mathsf{0}\right)}}\) represents scale scores under treatment and control groups, respectively. A moment estimate of δ TS can be obtained from (32) as
$${\widehat{\delta}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}=\left({\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{t}}\mathit{\mathsf{r}}\mathit{\mathsf{t}}}-{\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{c}}\mathit{\mathsf{o}}n\mathit{\mathsf{t}\mathsf{rol}}}\right)/\mathit{\mathsf{k}},$$
(33)
where \({\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{t}}\mathit{\mathsf{r}}\mathit{\mathsf{t}}}={\displaystyle {\sum}_{\mathit{\mathsf{i}}=\mathsf{1}}^{\mathit{\mathsf{N}}}{\displaystyle {\sum}_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\left(\mathsf{1}\right)}}}/\mathit{\mathsf{N}}\), \({\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{c}}\mathit{\mathsf{o}}n\mathit{\mathsf{trol}}}={\displaystyle {\sum}_{\mathit{\mathsf{i}}=\mathsf{1}}^{\mathit{\mathsf{N}}}{\displaystyle {\sum}_{\mathit{\mathsf{j}}=\mathsf{1}}^{\mathit{\mathsf{k}}}{\mathit{\mathsf{Y}}}_{\mathit{\mathsf{i}}\mathit{\mathsf{j}}}^{\left(\mathsf{0}\right)}}}/\mathit{\mathsf{N}}\) and N is the number of participants per group. The variance of \({\widehat{\delta}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}\) can be obtained as
$$\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\widehat{\delta}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}\right)=\frac{\mathsf{2}\left\{\mathsf{1}+\rho \left(\mathit{\mathsf{k}}-\mathsf{1}\right)\right\}{\sigma}^{\mathsf{2}}}{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}=\frac{\mathsf{2}\left\{\mathsf{1}/\rho +\mathit{\mathsf{k}}-\mathsf{1}\right\}{\sigma}_{\mu}^{\mathsf{2}}}{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}.$$
(34)
The corresponding test statistic T TS can be built as
$${\mathit{\mathsf{T}}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}=\frac{{\widehat{\delta}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}}{\sqrt{\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\widehat{\delta}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}\right)}}=\frac{\sqrt{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}{\widehat{\delta}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}}{\sigma_{\mu}\sqrt{\mathsf{2}\left(\mathsf{1}/\rho +\mathit{\mathsf{k}}-\mathsf{1}\right)}}=\frac{\sqrt{\mathit{\mathsf{N}}}\left({\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{t}}\mathit{\mathsf{r}}\mathit{\mathsf{t}}}-{\overline{\mathit{\mathsf{S}}}}_{\mathit{\mathsf{c}}\mathit{\mathsf{o}}n\mathit{\mathsf{t}\mathsf{rol}}}\right)}{\sigma_{\mu}\sqrt{\mathsf{2}\mathit{\mathsf{k}}\left(\mathsf{1}/\rho +\mathit{\mathsf{k}}-\mathsf{1}\right)}}.$$
(35)
And the power function φ TS of T TS can be expressed as
$${\varphi}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}=\varPhi \left\{\left|{\delta}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}/{\sigma}_{\mu}\right|\sqrt{\frac{\mathit{\mathsf{k}}\mathit{\mathsf{N}}}{\mathsf{2}\left(\mathsf{1}/\rho +\mathit{\mathsf{k}}-\mathsf{1}\right)}}-{\varPhi}^{-\mathsf{1}}\left(\mathsf{1}-\alpha /\mathsf{2}\right)\right\}.$$
(36)
It should be noted that this statistical power (36) is also an increasing function of ρ in contrast to a situation when a fixed total variance assumption is more reasonable in which both \({\sigma}_e^{\mathsf{2}}\) and \({\sigma}_{\mu}^{\mathsf{2}}\) are a function of ρ but σ2 is not. For example, observations without measurement errors from clusters are often assumed to be correlated and power of between-group tests using such correlated observations is a decreasing function of ρ [18]. Again, when δ TS is standardized by σ μ and ρ is replaced by equation (7), equation (33) can further be expressed in terms can further be expressed in terms of \({\varDelta}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}={\delta}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}/{\sigma}_{\mu }\) and C α as follows:
$${\varphi}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}=\varPhi \left\{\left|{\varDelta}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}\right|\sqrt{{\mathit{\mathsf{C}}}_{\alpha}\mathit{\mathsf{N}}/\mathsf{2}}-{\varPhi}^{-\mathsf{1}}\left(\mathsf{1}-\alpha /\mathsf{2}\right)\right\}.$$
(37)

Again, this power function is seen to be independent of k, the number of items.

Sample size for a desired statistical power φ can be determined from (37) as follows:
$$\mathit{\mathsf{N}}=\frac{\mathsf{2}{\mathit{\mathsf{z}}}_{\alpha, \varphi}^{\mathsf{2}}}{{\mathit{\mathsf{C}}}_{\alpha }{\varDelta}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}^{\mathsf{2}}}.$$
(38)
Again, the sample size (38) is seen to be a decreasing function of increasing C α and Δ. When number of items is needed for development of an instrument, it can be determined from equation (36) as follow:
$$\mathit{\mathsf{k}}=\frac{\mathsf{2}\left(\mathsf{1}/\rho -\mathsf{1}\right){\mathit{\mathsf{z}}}_{\alpha, \varphi}^{\mathsf{2}}/{\varDelta}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}^{\mathsf{2}}}{\mathit{\mathsf{N}}-\mathsf{2}{\mathit{\mathsf{z}}}_{\alpha, \varphi}^{\mathsf{2}}/{\varDelta}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}^{\mathsf{2}}}.$$
(39)

Results

To validate equation (14) and the power functions (20), (29), and (37), we conduct simulation study for each test. For the simulation, the random item scores are generated based on model (2) assuming both μ i and e ij are normally distributed although this assumption is not required in general. Under this normal assumption, however, it can be shown that all the moment estimates herein are the maximum likelihood estimates [19]. We then compute scale scores by summing up the item scores for each individual.

We fix a two-tailed significance level of α = 0.05 and \({\sigma}_{\mu}^{\mathsf{2}}\) = 1 without loss generality for all simulations, and determine \({\sigma}_e^{\mathsf{2}}\) and σ2 through ρ determined by given k and C α . We randomly generate 1000 data sets for each combination of design parameters that include effect size Δ, number of items k, and sample size N. We then compute empirical power \(\tilde{\varphi}\) by counting data sets from which two-tailed p-values are smaller than 0.05; that is, \(\tilde{\varphi}={\displaystyle {\sum}_{\mathit{\mathsf{s}}}^{\mathsf{1000}}\mathsf{1}\left({\mathit{\mathsf{p}}}_{\mathit{\mathsf{s}}}<\alpha \right)}/\mathsf{1000}\) where p s represents a two-sided p-value from the s-th simulated data set. For the testing, we applied corresponding t-tests assuming the variances of the moment estimates are unknown, which is practically reasonable. We used SAS v9.3 for the simulations.

Test-retest correlation

The results are presented in Table 1 that shows the empirically estimated test-retest correlations (i.e., average of 1000 estimated Pearson correlations for each set of design parameter specifications) are approximately the same as the pre-assigned C α , regardless of sample size N, which is as small as 30, and number of items k. Therefore, equality between C α and test-retest correlation (14) is well validated.
Table 1

Empirical simulation-based estimates of test-retest correlation Corr(S test , S retest ) in equation (14)

 

Corr(S test , S retest )

 

Total N = 30

Total N = 50

C α

k = 5

k = 10

k = 5

k = 10

0.1

0.10

0.10

0.10

0.10

0.2

0.20

0.20

0.20

0.20

0.3

0.30

0.29

0.30

0.30

0.4

0.39

0.39

0.40

0.39

0.5

0.49

0.50

0.49

0.50

0.6

0.59

0.59

0.60

0.60

0.7

0.69

0.69

0.70

0.70

0.8

0.79

0.80

0.80

0.79

0.9

0.90

0.90

0.90

0.90

Note: Total N: total number of subjects; C α : Cronbach alpha; k: number of items

Pre-post intervention comparison

Table 2 shows that the theoretical power φ PP (20) is very close to the empirical power \({\tilde{\varphi}}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}\) obtained through the simulations. The results validate that the power φ PP increases with increasing C α (or equivalently increasing correlation for the same k) in the “pre-post” test settings, regardless of sample size N and number of items k. Furthermore, it shows that the statistical power does not depend on k for a given C α even if correlation ρ does.
Table 2

Statistical power of the pre-post test T PP (18): σ μ  = 1

   

k = 5

k = 10

Total N

Δ PP

C α

φ PP

\({\tilde{\varphi}}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}\)

φ PP

\({\tilde{\varphi}}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}\)

30

0.4

0.5

0.341

0.337

0.341

0.310

  

0.6

0.475

0.459

0.475

0.458

  

0.7

0.658

0.626

0.658

0.649

  

0.8

0.873

0.849

0.873

0.830

  

0.9

0.996

0.997

0.996

0.995

50

0.3

0.5

0.323

0.309

0.323

0.296

  

0.6

0.451

0.424

0.451

0.433

  

0.7

0.630

0.633

0.630

0.614

  

0.8

0.851

0.849

0.851

0.844

  

0.9

0.994

0.995

0.994

0.992

Note: Total N: total number of subjects; k: number of items; \({\varDelta}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}={\delta}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}/{\sigma}_{\mu }\); C α : Cronbach alpha; φ PP : theoretical power (20); \({\tilde{\varphi}}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}\): simulation-based empirical power

Between-group whithin-group comparison

Table 3 shows that the theoretical power φ BW (29) is very close to the empirical power \({\tilde{\varphi}}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}\) obtained through the simulations. Therefore, the results validate that the statistical power φ BW increases with increasing C α for testing hypotheses concerning between-group effects on within-group changes regardless of N, sample size per group, and k. Again, it shows that the statistical power does not depend on k for a given C α even if correlation ρ does.
Table 3

Statistical power of the between-group within-group test T BW (25): σ μ  = 1

   

k = 5

k = 10

N per group

Δ BW

C α

φ BW

\({\tilde{\varphi}}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}\)

φ BW

\({\tilde{\varphi}}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}\)

30

0.4

0.5

0.194

0.179

0.183

0.194

  

0.6

0.268

0.264

0.254

0.268

  

0.7

0.387

0.375

0.359

0.387

  

0.8

0.591

0.618

0.594

0.591

  

0.9

0.908

0.884

0.901

0.908

50

0.3

0.5

0.164

0.184

0.214

0.184

  

0.6

0.242

0.254

0.261

0.254

  

0.7

0.387

0.367

0.365

0.367

  

0.8

0.511

0.564

0.591

0.564

  

0.9

0.893

0.889

0.893

0.889

Note: N per group: number of subjects per group; k: number of items; \({\varDelta}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}={\delta}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}/{\sigma}_{\mu }\); C α : Cronbach alpha; φ BW : theoretical power (27); \({\tilde{\varphi}}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}\): simulation-based empirical power

Two-sample between-group comparison

Table 4 shows again that the theoretical power φ TS (37) is very close to the empirical power \({\tilde{\varphi}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}\) obtained through the simulations. The results validate that the statistical power increases with increasing Cronbach α even for two-sample testing in cross-sectional settings that does not involve within-group effects. it shows that the statistical power does not depend on k for a given C α even if correlation ρ does. Again, it shows that the statistical power does not depend on k for a given C α even if correlation ρ does.
Table 4

Statistical power of the between-group within-group test T TS (32): σ μ  = 1

   

k = 5

k = 10

N per group

Δ TS

C α

φ TS

\({\tilde{\varphi}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}\)

φ TS

\({\tilde{\varphi}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}\)

50

0.7

0.5

0.697

0.676

0.697

0.697

  

0.6

0.774

0.758

0.774

0.760

  

0.7

0.834

0.812

0.834

0.813

  

0.8

0.879

0.872

0.879

0.882

  

0.9

0.913

0.901

0.913

0.895

100

0.5

0.5

0.705

0.682

0.705

0.679

  

0.6

0.782

0.791

0.782

0.769

  

0.7

0.841

0.820

0.841

0.832

  

0.8

0.885

0.879

0.885

0.908

  

0.9

0.918

0.929

0.918

0.912

Note: N per group: number of subjects per group; k: number of items; \({\varDelta}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}={\delta}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}/{\sigma}_{\mu }\); C α : Cronbach alpha; φ TS : theoretical power (34); \({\tilde{\varphi}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}\): simulation-based empirical power

Discussion

We demonstrate by deriving explicit power functions that higher internal consistency or reliability of unidimensional parallel instrument items measured by Cronbach alpha C α results in greater statistical power of several tests regardless of whether comparisons are made within or between groups. In addition, the test-retest reliability correlation of such items is shown to be the same as Cronbach alpha C α . Due to this property, testing significance of C α can be equivalent to testing that of a correlation through the Fisher z-transformation. Furthermore, all of the power functions derived herein can even be applied to trials using single item instrument with measurement error since the power function depends only on C α which can be estimated via test-retest correlations for single item instruments as mentioned earlier. The demonstrations are made theoretically, and validations are made through simulation studies that show that the derived test statistics and their corresponding power functions are very close to each other. Therefore, the sample size determination formulas (21), (30), and (38) are valid and so are the determinations of number of items (22), (31), and (39) in different settings.

In fact, for longitudinal studies aiming to compare within-group effects using such as T PP (18) and T BW (27), the fixed true score variance assumption is not critical since the true score μ i ’s in model (2) are cancelled by taking differences of Y between pre and post-interventions and thus makes the variance of the pre-post differences depend only on measurement error variance \({\sigma}_e^{\mathsf{2}}\). For example, the variance equations (17) and (26) can be expressed in term of only \({\sigma}_e^{\mathsf{2}}\), a decreasing function of ρ, through equation (4) as follows: \(\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\widehat{\delta}}_{\mathit{\mathsf{P}}\mathit{\mathsf{P}}}\right)=\mathsf{2}{\sigma}_e^{\mathsf{2}}/\left(\mathit{\mathsf{k}}\mathit{\mathsf{N}}\right)\) and \(\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\widehat{\delta}}_{\mathit{\mathsf{B}}\mathit{\mathsf{W}}}\right)=\mathsf{4}{\sigma}_e^{\mathsf{2}}/\left(\mathit{\mathsf{k}}\mathit{\mathsf{N}}\right)\). In other words, both the power functions φ PP (20) and φ BW (29) are increasing function of C α or ρ regardless of whether total variance or true score variance is assumed fixed.

In contrast, however, for cross-sectional studies aiming to compare between-group effects using T TS (35), the fixed true score variance assumption is critical since the variance equation (34) cannot be expressed only in term of only \({\sigma}_e^{\mathsf{2}}\), and furthermore it can be shown that under a fixed total variance assumption \(\mathit{\mathsf{V}}\mathit{\mathsf{a}}\mathit{\mathsf{r}}\left({\widehat{\delta}}_{\mathit{\mathsf{T}}\mathit{\mathsf{S}}}\right)\) (34) is an increasing function of ρ (see equation (10)) and so is the power function. In sum, the fixed true score variance assumption enables all of the power functions to be an increasing function of C α or ρ in a unified fashion. For example, Leon et al. [20] used a real data set of HRSD ratings to empirically demonstrate that the statistical power of a two-sample between-group test is increasing with increased C α , although they increased C α by increasing number of items k, not necessarily by increasing ρ for a fixed number of items.

In most cases, item scores are designed to be binary or ordinal scores on a likert scale. Therefore, the applicability of the derived power functions and sample size formulas to such cases could be in question since the scores are not normally distributed. Furthermore, it is not easy to build a model like (2) for non-normal scores particularly because measurement error variances depend on the true construct value. For example, variance of a binary score is a function of its mean. Perhaps, construction of marginal models in the sense of generalized estimating equations [21] can be considered for derivation of power functions assumption even if this approach is beyond the scope of the present study. After all, we believe that our study results should be able to be applied to non-normal scores by virtue of the central limit theorem. Another prominent limitation of our study is the very strong assumption of essentially τ-equivalent parallel items which may not be realistic at all [8], albeit conceivable for a unidimensional construct. Therefore, further development of power functions under relaxed conditions reflecting more real world situations should be a valuable future study.

Conclusion

Instruments with greater Cronbach alpha should be used for any type of research since they have smaller measurement error and thus have greater statistical power for any research settings, cross-sectional or longitudinal. However, when items are parallel targeting a unidimensional construct, Cronbach alpha of an instrument should be enhanced by developing a set of highly correlated items but not by unduly increasing the number of items with inadequate inter-item correlations.

Abbreviations

HRSD: 

Hamilton Rating Scale of Depression

Declarations

Acknowledgements

We are grateful to the late Dr. Andrew C. Leon for initial discussion of the problems under study.

Funding

This work was in part supported by the NIH grants P30MH068638, UL1 TR001073, and the Albert Einstein College of Medicine funds.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Authors’ Affiliations

(1)
Department of Epidemiology and Population Health, Albert Einstein College of Medicine
(2)
Department of Radiology, Albert Einstein College of Medicine
(3)
Department of Nutrition, Gillings School of Public Health, University of North Carolina—Chapel Hill

References

  1. Hamilton M. A rating scale for depression. J Neurol Neurosurg Psychiatry. 1960;23:56–62.View ArticlePubMedPubMed CentralGoogle Scholar
  2. Nunnally JC, Bernstein IH. Psychometric Theory. 3rd ed. New York: McGraw-Hill; 1994.Google Scholar
  3. Lord FM, Novick MR. Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley; 1968.Google Scholar
  4. Schmitt N. Uses and abuses of coefficient alpha. Psychol Assess. 1996;8(4):350–3.View ArticleGoogle Scholar
  5. Cronbach L. Coefficeint alpha and the internal struture of tests. Psychometrika. 1951;16:297–334.View ArticleGoogle Scholar
  6. Bland JM, Altman DG. Cronbach’s alpha. Br Med J. 1997;314(7080):572–2.Google Scholar
  7. Cortina JM. What is coefficient alpha - An exmination of theory and applications. J Appl Psychol. 1993;78(1):98–104.View ArticleGoogle Scholar
  8. Sijtsma K. On the Use, the Misuse, and the Very Limited Usefulness of Cronbach’s Alpha. Psychometrika. 2009;74(1):107–20.View ArticlePubMedGoogle Scholar
  9. Novick MR, Lewis C. Coefficient alpha and the reliability of composite measurements. Psychometrika. 1967;32(1):1–13.View ArticlePubMedGoogle Scholar
  10. Charter RA. Statistical approaches to achieving sufficiently high test score reliabilities for research purposes. J Gen Psychol. 2008;135(3):241–51.View ArticlePubMedGoogle Scholar
  11. Feldt LS, Charter RA. Estimating the reliability of a test split into two parts of equal or unequal length. Psychol Methods. 2003;8(1):102–9.View ArticlePubMedGoogle Scholar
  12. Feldt LS, Ankenmann RD. Determining sample size for a test of the equality of alpha coefficients when the number of part-tests is small. Psychol Methods. 1999;4(4):366–77.View ArticleGoogle Scholar
  13. Feldt LS, Ankenmann RD. Appropriate sample size for comparing alpha reliabilities. Appl Psychol Meas. 1998;22(2):170–8.View ArticleGoogle Scholar
  14. Padilla MA, Divers J, Newton M. Coefficient Alpha Bootstrap Confidence Interval Under Nonnormality. Appl Psychol Meas. 2012;36(5):331–48.View ArticleGoogle Scholar
  15. Bonett DG, Wright TA. Cronbach’s alpha reliability: Interval estimation, hypothesis testing, and sample size planning. J Organ Behav. 2015;36(1):3–15.View ArticleGoogle Scholar
  16. Bonett DG. Sample size requirements for testing and estimating coefficient alpha. J Educ Behav Stat. 2002;27(4):335–40.View ArticleGoogle Scholar
  17. Bonett DG. Sample size requirements for comparing two alpha coefficients. Appl Psychol Meas. 2003;27(1):72–4.View ArticleGoogle Scholar
  18. Donner A, Birkett N, Buck C. Randomization by cluster. Sample size requirements and analysis. Am J Epidemiol. 1981;114(6):906–14.PubMedGoogle Scholar
  19. Goldstein H. Multilevel Statistical Models. 2nd ed. New York: Wiley & Sons; 1996.Google Scholar
  20. Leon AC, Marzuk PM, Portera L. More reliable outcome measures can reduce sample size requirements. Arch Gen Psychiatry. 1995;52(10):867–71.View ArticlePubMedGoogle Scholar
  21. Zeger SL, Liang KY, Albert PS. Models for longitudinal data - A generalized estimating equation approach. Biometrics. 1988;44(4):1049–60.View ArticlePubMedGoogle Scholar

Copyright

© Heo et al. 2015

Advertisement