Sample size calculations for cluster randomised controlled trials with a fixed number of clusters
© BioMed Central Ltd 2011
Received: 15 February 2011
Accepted: 30 June 2011
Published: 30 June 2011
Skip to main content
© BioMed Central Ltd 2011
Received: 15 February 2011
Accepted: 30 June 2011
Published: 30 June 2011
Cluster randomised controlled trials (CRCTs) are frequently used in health service evaluation. Assuming an average cluster size, required sample sizes are readily computed for both binary and continuous outcomes, by estimating a design effect or inflation factor. However, where the number of clusters are fixed in advance, but where it is possible to increase the number of individuals within each cluster, as is frequently the case in health service evaluation, sample size formulae have been less well studied.
We systematically outline sample size formulae (including required number of randomisation units, detectable difference and power) for CRCTs with a fixed number of clusters, to provide a concise summary for both binary and continuous outcomes. Extensions to the case of unequal cluster sizes are provided.
For trials with a fixed number of equal sized clusters (k), the trial will be feasible provided the number of clusters is greater than the product of the number of individuals required under individual randomisation (n I ) and the estimated intra-cluster correlation (ρ). So, a simple rule is that the number of clusters (k) will be sufficient provided:
Where this is not the case, investigators can determine the maximum available power to detect the pre-specified difference, or the minimum detectable difference under the pre-specified value for power.
Designing a CRCT with a fixed number of clusters might mean that the study will not be feasible, leading to the notion of a minimum detectable difference (or a maximum achievable power), irrespective of how many individuals are included within each cluster.
Cluster randomised controlled trials (CRCTs), in which clusters of individuals are randomised to intervention groups, are frequently used in the evaluation of service delivery interventions, primarily to avoid contamination but also for logistic and economic reasons [1–3]. Whilst a well conducted individually Randomised Controlled Trial (RCT) is the gold standard for assessing the effectiveness of pharmacological treatments, the evaluation of many health care service delivery interventions is difficult or impossible without recourse to cluster trials. Standard sample size formulae for CRCTs require the investigator to pre-specify an average cluster size, to determine the number of clusters required. In so doing, these sample size formulae implicitly assume that the number of clusters can be increased as required [1, 3–5].
However, when evaluating health care service delivery interventions the number of clusters might be limited to a fixed number even though the sample size within each cluster can be increased. In a real example, evaluating lay pregnancy support workers, clusters consisted of groups of pregnant women under the care of different midwifery teams [6, 7]. The available number of clusters was restricted to the midwifery teams within a particular geographical region. Yet within each midwifery team it was possible to recruit any reasonable number of individuals by extending the recruitment period. In another real example, a CRCT to evaluate the effectiveness of a combined polypill (statin, aspirin and blood pressure lowering drugs) in Iran was limited to a fixed number of villages participating in an existing cohort study . Other such examples of designs in which a limited number of clusters were available include trials of community based diabetes educational programs  and general practice based interventions to reduce primary care prescribing errors , both of which were limited to the number of general practices which agreed to participate.
The existing literature on sample size formulae for CRCTs focuses largely on the case where there is no limit on the number of available clusters [3–5, 11, 12]. Whilst it is well known that the statistical power that can be achieved by additional recruitment within clusters is limited, and that this depends on the intra-cluster correlation [11–13], little attention has been paid to the limitations imposed when the number of clusters is fixed in advance. This paper aims to fill this gap by exploring the range of effect sizes, and differences between proportions, that can be detected when the number of clusters is fixed. We describe a simple check to determine whether it is feasible to detect a specified effect size (or difference between proportions) when the number of clusters are fixed in advance; and for those cases in which it is infeasible, we determine the minimum detectable difference possible under the required power and the maximum achievable power to detect the required difference. We illustrate these ideas by considering the design of a CRCT to detect an increase in breastfeeding rates where the number of clusters are fixed.
For completeness we outline formulae for simpler designs for which the sample size formulae are relatively well known, or easily derived, as an important prelude. In so doing, the simple relationships between the formulae are clear and this allows progressive development to the less simple situation (that of binary detectable difference or power). It is hoped that by developing the formulae in this way the material will be accessible to applied statisticians and more mathematically minded health care researchers. We also provide a set of guidelines useful for investigators when designing trials of this nature.
Generally, suppose a trial is to be designed to test the null hypothesis H 0 : μ 0 = μ 1 where μ 0 and μ 1 represent the means of some variable in the control and intervention arms respectively; and where it is assumed that var(μ 0) = var(μ 1) = σ 2. Suppose further that there are an equal number of individuals to be randomised to both arms, letting n denote the number of individuals per arm and letting d denote the difference to be detected such that d = μ 0 - μ 1, 1 - β denotes the power and α the significance level. We limit our consideration to trials with two equal sized parallel arms, with common standard deviation, two-sided test, and assume normality of outcomes and approximate the variance of the difference of two proportions. The sub-script, I (for Individual randomisation), is used throughout to highlight any quantities which are specific to individual randomisation; and likewise the sub-script, C (for Cluster randomisation), is used throughout to highlight any quantities which are specific to cluster randomisation. No subscripts are used to distinguish cluster from individual randomisation for variables which are pre-specified by the user.
where z α/2 denotes the upper 100α/2 standard normal centile.
where Φ is the cumulative standardised Normal distribution.
for testing the two sided hypothesis H 0 : π 1 = π 2.
again with rounding up to the average cluster size.
Where a CRCT is to be designed with a completely fixed size, that is with a fixed number of clusters, each of a fixed size (although this size may vary between clusters), then it is possible to evaluate both the detectable difference and the power, as would be the case in a design using individual randomisation. CRCTs of fixed size might not be the commonest of designs, but formulae presented below: are an important prelude to later formulae, might be useful for retrospectively computing power once a trial has commenced (and thus the size has been determined), and will also be useful in those limited number of studies for which the trial sample size is indeed completely fixed (for example within a cohort study) [9, 10].
where d I is the detectable difference using individual randomisation and VIF might be either of those presented at equations 5 and 6. So the detectable difference in a CRCT can be thought of as the detectable difference in a trial using individual randomisation, inflated by the square-root of the variance inflation factor.
where again, VIF might be either of those presented at equations 5 and 6. So, power in a CRCT can be thought of as the power available under individual randomisation for a standardised effect size which is deflated by the square-root of the variance inflation factor.
Standard sample size formulae for CRCTs, by assuming knowledge of the cluster size (m) and determining the required number of clusters (k), implicitly assume that the number of clusters can be increased as required. However, in the design of health service interventions, it is often the case that the number of clusters will be limited by the number of cluster units willing or able to participate. So for example, in two general practice based CRCTs (one to evaluate lay education in diabetes and the other to evaluate a general practice-based intervention to reduce primary care prescribing errors), the number of clusters was limited to the number of primary care practices that agreed to participate in the study. From an estimate of the number of clusters available, it is relatively straightforward to determine the required cluster size for each of the clusters. However, due to the limited increase in precision available by increasing cluster sizes, it might not always be feasible to detect the required difference at required power under a design with a fixed number of k clusters. These issues are explored below.
where n I is the sample size required under individual randomisation. This increase in sample size, over that required under individual randomisation, is no longer a simple inflation, as the inflation required is now dependent on the sample size required under individual randomisation.
this time rounding up the total sample size to a multiple of the number of clusters (k) available (using the ceiling function).
again rounding up to a multiple of the number of clusters (k) available.
for unequal cluster sizes. Here, n I is the required sample size under individual randomisation, k is the available number of clusters, ρ is the estimated intra-cluster correlation coefficient, and cv represents the coefficient of variation of cluster sizes. When this inequality does not hold, it will be necessary to re-evaluate the specifications of this sample size calculation. This might consist of a re-evaluation of the power and significance level of the trial, or it might consist of a re-evaluation of the detectable difference. Bounds, imposed as a result of the limited precision, on the detectable difference and power are derived below.
which follows naturally from the formula for detectable difference (equation 1) and the bound on precision (equation 18). This therefore gives a bound on the detectable difference achievable in a trial with a fixed number of clusters.
Each of these two solutions to this quadratic will provide the limit on π 2 for two sided tests.
This therefore provides an upper limit on the power available under a design with a fixed number of clusters k.
Determine the required number of individuals per arm in a trial using individual randomisation (n I ).
where cv is the coefficient of variation of cluster sizes.
Where the design is not feasible and cluster sizes are unequal, determine whether the design becomes feasible with equal cluster sizes (i.e. if k > n I ρ).
Either: the power must be reset at a value lower than the maximum available power (equation 28),
Or: the detectable difference must be set greater than the minimum detectable difference (equations 23 (continuous outcomes) and 26 (binary outcomes)),
Or: both power and detectable difference are adjusted in combination.
Once a feasible design is found, determine the required number of individuals per cluster from equations 14 (for equal cluster sizes) and 16 (for varying cluster sizes).
In a real example, a CRCT is to be designed to evaluate the effectiveness of lay support workers to promote breastfeeding initiation and sustainability until 6 weeks postpartum. Due to fears of contamination, whereby new mothers indivertibly gain access and support from the lay workers, the intervention is to be randomised over cluster units. Cluster randomisation will also ensure that the trial is logistically simpler to run, as randomisation will be carried out at a single point in time, and midwives will have the benefit of remaining in either the intervention or control arm for the duration of the trial. The cluster units to be used are midwifery teams, which are teams of midwives who visit a set number of primary care general practices to deliver antenatal and postnatal care. The trial is to be carried out within a single primary care trust within the West Midlands. The nature of this design therefore means that the number of clusters available is fixed at the number of midwifery teams delivering care within the region.
At the time of designing the trial, current breastfeeding rates, at 6 weeks postpartum, in the region were around 40%. National targets had been set to encourage all regions to increase rates to around 50%. It was known that 40 clusters are available (i.e. there are 40 midwifery teams within the region), so that the number of clusters per arm was fixed to k = 20. Estimates of ICC range from 0.005 to 0.07 in similar trials [6, 7].
Estimates of the Minimum Detectable Difference (MDD) for trial with 20 clusters per arm, to detect an increase in an event rate from 40%
Power = 80%
Power = 90%
40% vs 50%
40% vs 50%
ICC = 0.005
n I = 385
n I = 515
n C = 440
n C = 600
m = 22
m = 30
ICC = 0.07
MDD = 12%
MDD = 14%
n I = 267
n I = 262
n C = 3,780
n C = 2,920
m = 189
m = 146
Secondly, the feasibility check is evaluated to determine whether the 20 available clusters per arm is sufficient to detect the 10 percentage point change assuming the higher estimated ICC (0.07). However, in this case as 385 × 0.07 = 26.95 > 20, so the condition is not met at the 80% power level (and so neither at the 90% power level). Therefore, 20 clusters per arm is not a sufficient number of clusters, however many individuals are included within each cluster, to detect the required effect size at the pre-specified power and significance.
Since this latter design is not feasible, formulae at equation 25 allow determination of the minimum detectable difference (or maximum achievable power from equation 27). For a cluster trial with 80% power, and assuming a baseline event rate of π 1 = 0.40, the minimum detectable difference is 0.12 (to 2 d.p.). That is, a change from 40% to 52%. To detect a change from 40% to 52% with 80% power, 189 individuals would be required per cluster. For a trial with 90% power, the minimum detectable difference is 0.14 (i.e. a change from 40% to 54%). To detect a change from 40% to 54% with 90% power, 146 individuals would be required per cluster.
In health care service evaluation cluster RCTs, pre-specifying the numbers of clusters available, are frequently used. That is, trials are designed based on a limited number of cluster units (e.g. GP practices) willing or able to participate [6, 7, 9, 10]. In contrast, sample size methods are almost exclusively based on pre-specified average cluster sizes, as opposed to number of clusters available [1, 4]. Whilst mapping sample size formulae from one method to the other is straightforward, a limit on the precision of estimates in such designs leads to a maximum available power (that is, a limit on the power available irrespective of how large the clusters are) and minimum detectable differences (that is, a limit on the difference detectable irrespective of how large the clusters are).
For example, with just 15 clusters available per arm and an ICC of 0.05, power achievable for a trial aiming to detect an increase in percentage change from 40% to 50% is limited to about 62%, irrespective of how large the clusters are made. Cluster trials with just 15 clusters available per arm are not uncommon and a 10 percentage point change not an unrealistic goal in many settings. However, power levels as low as 60% are clearly sub-optimal, and might not be regarded as sufficiently high to warrant the costs of a clinical trial. Formulae provided here for minimum detectable differences show that to retain a power level in the region of 80%, triallists would have to be content with detecting a difference above a twelve percentage point change. Re-formulation of the problem in terms of minimum detectable difference can thus be used to compare the difference which is statistically detectable (at acceptable power levels) to that which is clinically, or managerially, important.
Should the situation arise in which the postulated ICC suggests that it is not possible to detect the required difference (at pre-specified power), it might be tempting to lower the estimated ICC. Such an approach should be strongly discouraged, since loss of power will most likely result, potentially leading to a non-significant finding . Rather, formulae here allow sensitivity of the design to be explored in light of possible variations in the ICC. However, other avenues to increase available power might reasonably be considered. For example, it may be plausible to consider relaxing alpha and even to set alpha and beta equivalent . Or alternatively, incorporating prior information in a Bayesian framework may lead to increases in power. It might further be argued that studies of limited power are of importance as they contribute to the evidence framework by ultimately becoming part of future systematic reviews , and the methods presented here thus allow for the achievable power to be computed. Before-and-after type studies offer a further avenue of exploration, as by their very nature induce smaller intra-cluster correlations.
Methodological limitations of the work presented here include the assumption of equal sized arms; equal standard deviations; Normality assumptions (which might not be tenable for small numbers of clusters as well as small numbers of individuals); and lack of continuity correction for binary variables. Furthermore, CRCTs with a small number of clusters are controversial, primarily because the small number of units randomised open results to the possibility of bias and approximations to Normality become questionable. However, despite this, CRCTs with a small number of clusters are frequently reported. The Medical Research Council, for instance, has issued guidelines that cluster trials with fewer than 5 clusters per arm are inadvisable . Others have considered some of the issues involved in community based intervention trials with a small number of clusters, but have focused on issues of restricted randomisation and whether the analysis should be at the individual or cluster level .
For infeasible designs to retain acceptable levels of power, detectable difference might not be as small as desired, leading to the notion of a minimum detectable difference. Useful aidese memoires are that the detectable difference in a CRCT is that of an individual RCT inflated by the square root of the variance inflation factor; and the power is that under individual randomisation with the standardised effect size deflated by the square root of the variance inflation factor. A STATA function, clusterSampleSize.ado, allows practical implementation of all formulae discussed here and is available from the author.
K. Hemming, R. J. Lilford and A. Girling were funded by the Engineering and Physical Sciences Research Council of the UK through the MATCH programme (grant GR/S29874/01) and by a National Institute of Health Research grant for Collaborations for Leadership in Applied Health Research and Care (CLAHRC), for the duration of this work. The views expressed in this publication are not necessarily those of the NIHR or the Department of Health. The authors would like to express their gratitude to Monica Taljaard and Sandra Eldridge for review comments which helped to develop the material.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.