Undue reliance on I 2 in assessing heterogeneity may mislead
© Rücker et al. 2008
Received: 15 August 2008
Accepted: 27 November 2008
Published: 27 November 2008
Skip to main content
© Rücker et al. 2008
Received: 15 August 2008
Accepted: 27 November 2008
Published: 27 November 2008
The heterogeneity statistic I 2, interpreted as the percentage of variability due to heterogeneity between studies rather than sampling error, depends on precision, that is, the size of the studies included.
Based on a real meta-analysis, we simulate artificially 'inflating' the sample size under the random effects model. For a given inflation factor M = 1, 2, 3,... and for each trial i, we create a M-inflated trial by drawing a treatment effect estimate from the random effects model, using /M as within-trial sampling variance.
As precision increases, while estimates of the heterogeneity variance τ 2 remain unchanged on average, estimates of I 2 increase rapidly to nearly 100%. A similar phenomenon is apparent in a sample of 157 meta-analyses.
When deciding whether or not to pool treatment estimates in a meta-analysis, the yard-stick should be the clinical relevance of any heterogeneity present. τ 2, rather than I 2, is the appropriate measure for this purpose.
In meta-analysis, three principal sources of heterogeneity can be distinguished. These are (i) clinical baseline heterogeneity between patients from different studies, measured, e.g., in patient baseline characteristics and not necessarily reflected on the outcome measurement scale; (ii) statistical heterogeneity, quantified on the outcome measurement scale, that may or may not be clinically relevant and may or may not be statistically significant, and (iii) heterogeneity from other sources, e.g. design-related heterogeneity. In this article, we only deal with statistical heterogeneity. References [1–7] give an introduction to the large literature in this area. We do not discuss how to assess clinical baseline heterogeneity.
In this paper, we show that I 2 increases with the number of patients included in the studies in a meta-analysis. In the light of this, we argue that I 2 is in general of limited use in assessing clinically relevant heterogeneity.
The article is structured as follows. After introducing existing measures of heterogeneity in meta-analysis and discussing their properties, we illustrate the problem of interpreting the measure I 2 using an example from the literature. We then present a simulation study which explores the effect of sample size inflation on I 2, and finally conclude with a discussion.
Properties of measures of heterogeneity.
number of studies in meta-analysis
precision (size of studies)
H, H 2
R, R 2
Q, which follows a χ2 distribution with k - 1 degrees of freedom under H0, is the weighted sum of squared differences between the study means and the fixed effect estimate. It always increases with the number of studies, k, in the meta-analysis.
In contrast to Q, the statistic I 2 was introduced by Higgins and Thompson  as a measure independent of k, the number of studies in the meta-analysis. I 2 is interpreted as the percentage of variability in the treatment estimates which is attributable to heterogeneity between studies rather than to sampling error.
τ2 describes the underlying between-study variability. Its square root, τ, is measured in the same units as the outcome. Its estimates do not systematically increase with either the number, or size, of studies in a meta-analysis.
H2 is a test statistic. It describes the relative difference between the observed Q and its expected value in the absence of heterogeneity. Thus it does not systematically increase with the number of studies . H corresponds to the residual standard deviation in a radial (Galbraith) plot . H = 1 indicates perfect homogeneity.
R2 is the square of a statistic R which describes the inflation of the random effects confidence interval compared to that from the fixed effect model. It does not increase with k. R2 = 1 indicates perfect homogeneity .
Notice that, in contrast to τ 2, the measures Q, I 2, H and R all depend on the precision, which is proportional to study size . Thus, given an underlying model, if the study sizes are enlarged, the confidence intervals become smaller and the heterogeneity, measured (say) using I 2, increases. This is reflected in the interpretation: As I 2 is the percentage of variability that is due to between-study heterogeneity, 1 - I 2 is the percentage of variability that is due to sampling error. When the studies become very large, the sampling error tends to 0 and I 2 tends to 1. Such heterogeneity may not be clinically relevant.
We now explore this further using simulation. Note first that simply looking at the effect of scaling up all sample sizes by a common factor (leaving their treatment effects unchanged) is not appropriate. This is because if study sizes were truly to increase, estimates would approach the true value for each study and not be fixed at the original observed value. Instead, we simulate under the random effects model. Under this model, μ and τ 2 are assumed constant, and the total variance in study i is + τ 2, which decreases with increasing study sample size, eventually tending to τ 2.
We generate an illustrative meta-analysis for each inflation factor. For each trial in each meta-analysis, we generate a random M-inflated trial by drawing a treatment effect estimate x M,ifrom this model, using /M as the within-trial sampling variance and the DerSimonian-Laird estimate for the heterogeneity parameter τ 2.
We use data from a large meta-analysis (of 70 trials) to estimate the effect of thrombolytic therapy in acute myocardial infarction . The original analysis using the fixed effects model (Mantel-Haenszel method) gives an odds ratio of 0.747 with a 95% confidence interval (95% CI) of [0.705; 0.792]. Using the random effects model, the odds ratio is 0.732, 95% CI [0.664; 0.808]. The DerSimonian-Laird estimate of τ 2 is 0.018 (H = 1.11, 95% CI [1; 1.29], I 2 = 18.6%, 95% CI [0%; 40.1%]). As Q = 85, p = 0.0953, there is no evidence of heterogeneity.
Effect of increasing within trial precision (factor M) on heterogeneity measures (data in ).
18.6% [0%; 40.1%]
1.11 [1; 1.29]
29.2% [4.5%; 47.6%]
1.19 [1.02; 1.38]
84.8% [81.4%; 87.5%]
2.56 [2.32; 2.83]
96.0% [95.4%; 96.5%]
4.98 [4.65; 5.32]
In order to examine the behavior and the order of magnitude of I 2 empirically, we further looked at a sample of 157 meta-analyses with binary endpoints. This data set was kindly provided by Peter Jüni . We calculated τ 2 and I 2 for each meta-analysis. Further, for each meta-analysis, we calculated the median study size of the contributing studies, denoted n i , i = 1,..., 157. After excluding all meta-analyses with both τ 2 = I 2 = 0 (n = 58), we fitted a linear model to the remaining 99 meta-analyses with I 2 as outcome and and log n i as covariates (thus implicitly assuming a log-normal distribution for study size).
Light, grey and black dots and regression lines correspond to the first, second and third tercile of the distribution of τ 2. Within each class of meta-analyses, I 2 is increasing with median study size.
The main advantage of the statistic I 2 is that it does not depend on the number of studies in a meta-analysis. Thus, using I 2 instead of Q, it is possible to compare the statistical heterogeneity of meta-analyses with different numbers of studies . Also, I 2 is easily interpreted by clinicians as the percentage of variability in the treatment estimates which is attributable to heterogeneity between studies rather than to sampling error.
However, an immediate (but often overlooked) consequence of this interpretation is that I 2 increases with the number of patients included in the studies in a meta-analysis. In a recent simulation using continuous outcomes, others found empirically that I 2 increased with increasing numbers of patients per trial though τ 2 was kept fixed . Unfortunately, as demonstrated by a recent empirical study , reviewers seem to be unaware of this when they use I 2 to decide whether to pool studies in a meta-analysis. Some authors also seem to be reluctant to call I 2 a statistic, using instead words such as metric , index , or even point estimate [17, 18, 20]. On the other hand, the term 'statistical test' is used in connection with I 2 in one of these references , p. 915. In another reference , the authors proposed an algorithm for a sensitivity analysis that successively excludes 'outlying' trials until I 2 falls below a prespecified level. In response to this , Higgins showed that the exclusion of a large trial with its effect close to the pooled estimate can be the most efficient way to reduce I 2.
Our simulation highlights the problem of interpreting heterogeneity measured by I 2 as clinical heterogeneity. This is analogous to interpreting statistically significant effects (P < 0.05) as clinically relevant. In our view the decision on whether or not to pool studies in a meta-analysis should not solely be based on I 2. Instead, studies with relatively large I 2 may usefully be pooled when the clinically relevant heterogeneity (in efficacy and covariates) is acceptably small.
Further, as τ is measured on the same scale as the outcome, it can be directly used to quantify variability. Indeed, clinically meaningful heterogeneity on the outcome scale could be pre-specified. Thus, in advance a reviewer may decide that three studies with odds ratios of 0.8, 1 and 1.25 cannot be pooled; in other words the relative effect ratios of 0.8 = 1/1.25 are too great. This corresponds to a standard deviation τ 0 = - log 0.8 = log 1.25 = 0.22 = on the log scale and thus a threshold of = 0.05 for the heterogeneity variance τ 2.
Ranges for interpretation of I 2 following the Cochrane Handbook for Systematic Reviews of Interventions (Version 5.0.1) .
0% to 40%
might not be important
30% to 60%
may represent moderate heterogeneity
50% to 90%
may represent substantial heterogeneity
75% to 100%
We believe the interpretation issues stem from the concept of I 2 as 'the proportion of variance (un)explained', referred to as 'widely familiar' to clinicians by Higgins and Thompson  (Section 4). However, there is a fundamental difference between the interpretation of the coefficient of determination in regression analysis, which is sub-consciously invoked by this phrase, and that of I 2: On the one hand, (that is, the square of the correlation coefficient) is a measure of the association between the dependent and the independent variable, which homes in on the true value as the sample size increases. However, I 2 tends to 100% as the number of patients increases. Although one may argue that the 'unit' corresponding to the 'observation' in a regression is the study, not the patient, this link is only strictly valid if sample size of new studies are distributed similarly to those of existing studies. This is not universally true. Often small trials are followed by larger ones. Thus I 2 will tend to increase artificially as evidence accumulates.
To address this, more weight should be given to often overlooked comments by Higgins and Thompson, , p 1545, who state 'Note that we do not propose that our measure should be independent of the precisions of estimates observed in the studies. Thus sets of studies with identical heterogeneity τ 2, but with different degrees of sampling error σ 2, will produce different measures.... Describing the underlying between-study variability ... can best be achieved simply by estimating the between-study variance, τ 2.'
When deciding whether or not to pool treatment estimates in a meta-analysis, the yard-stick should be the clinical relevance of any heterogeneity present. τ 2, rather than I 2 is the appropriate measure for this purpose.
GR and JC are funded by Deutsche Forschungsgemeinschaft (FOR 534 Schw 821/2-2). The authors wish to thank Peter Jüni for providing data and all reviewers and Douglas G Altman for helpful discussion.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.