### Tests for heterogeneity

Several tests have been developed to assess heterogeneity. The so-called Cochrane's Q (or Cochrane's χ^{2} test) weights the observed variation in treatment effects by the inverse of the variation in each study [5]. A large value of Q indicates large differences between studies, and hence, the effects from the included studies can be considered heterogeneous [2]. A modification of Cochrane's Q is the measure I^{2}, which is the ratio of variation that exceeds chance variation and the total variation in the treatment effects. Possible values for I^{2} range from zero to one, with a high value for I^{2} indicating much heterogeneity. Both Q and I^{2} are standardized measures, meaning that they don't depend on the metric of the effect size. A third measure of heterogeneity, indicating the variance of the true effect sizes is T^{2}, where (similar to Q and I^{2}) large values of T^{2} indicate heterogeneity. This method of estimating the variance between studies (T^{2}) is also known as the method of moments, or the DerSimonian and Laird method [6]. A fourth measure is the prediction interval, which indicates the distribution of true effect sizes and is based on T^{2} [2]. Cochrane's Q is sensitive to the number of studies and especially when the number of studies included in a meta-analysis is small, Cochrane's Q too often leads to false-positive conclusions (too large type I error) [7]. The modification I^{2} takes account of the number of included studies and has a correct probability of a type I error [3]. The measure T^{2} is insensitive to the number of studies as well, but sensitive to the metric of the effect size [2].

Currently, I^{2} appears to be used routinely in most published meta-analyses. Interestingly, the observed amount of heterogeneity depends on the effect measure that is considered in a meta-analysis: little heterogeneity when considering odds ratios implies large heterogeneity when considering risk differences and vice versa [8]. The reason for this is analogous to effect measure modification in a single study: if odds ratios are the same between strata (e.g., age categories) of a single study, risk differences are likely to differ between strata.

### Consequences of heterogeneity

Tests for heterogeneity indicate whether the variation in observed effects is either large or small. When heterogeneity is low (non-significant) for the chosen effect measure, variation between effects from different studies is (relatively) small. Thus, a fixed effects model can be used to synthesize the data, since the assumption underlying a fixed effects model is that the treatment effect is the same in each study, and variation between studies is due to sampling (i.e., chance) [3, 7]. If variation in the effects found in the different studies is (relatively) large they could be considered as sampled from a distribution of effects, i.e., the true treatment effect that is estimated in the different studies is not a single value, but rather a distribution of effects. In that case, a random effects model has been recommended [3, 7]. It has also been suggested that heterogeneity is inevitable in meta-analysis [9], and random effects models are therefore obligatory. If, however, heterogeneity is (very) large, one could even consider not pooling results from different studies at all, since studies are likely to be (very) different [2]. Furthermore, if there is a cause for heterogeneity, for example a subgroup effect, neither fixed nor random effects models take such relations between the effect size and subgroups into account. Another explanation for heterogeneity (other than differential treatment effects) could be a systematic error in the included studies. For example, systematic error that is related to e.g., the proportion of women, or differences in methodology (e.g., differences in outcome ascertainment) of the included studies [10].

### Relevant subgroup effects

Test for heterogeneity do not indicate possible causes for heterogeneity. In fact, testing for heterogeneity in two meta-analyses - one with a clear cause for heterogeneity (e.g., a subgroup effect), and the other not - can lead to the same conclusions with respect to heterogeneity. For example, consider a hypothetical meta-analysis of five randomized trials on the effects of some treatment. Each trial consisted of 200 subjects, randomized to either treatment or placebo and the baseline risk for the outcome was 50%. The effects and their 95% confidence intervals are shown in Figure 1. Testing for heterogeneity indicated that these effects could not be considered heterogeneous (Q = 3.2, p = 0.52; I^{2} = 0%; T^{2} = 0). A closer look at the individual trials revealed that the proportion of women included in the studies differed considerably: from 0% to 100%. Rearranging the order of the effects by the proportion of women included in each study resulted in Figure 2. The tests for heterogeneity reached exactly the same conclusions, since the ordering of the observed treatment effects is not taken into account when testing for heterogeneity. Clearly, in the modified forest plot (Figure 2) the data have a certain pattern, which may indicate a differential treatment effect among men and women, i.e., modification of the treatment effect by sex. In fact, the treatment was effective in women (RR = 0.7), but not in men (RR = 1.0), and when analyzing the individual patient data (i.e., fitting a regression model to the individual patient data rather than the aggregated data and including a factor to account for differences between trials), a statistically significant subgroup effect was indeed found (p-value for interaction 0.011). Hence, in aggregated data the differential treatment effect by sex was not indicated by tests for heterogeneity, but only suggested by the modified forest plot, whereas in individual patient data this differential effect was clearly observed and statistically significant. Whether this subgroup effect is clinically relevant is rather subjective, but we can conclude that tests for heterogeneity on aggregated data appear not to tell the whole story about heterogeneity on individual patient data.

What is important is that a regular forest plot (Figure 1) only contains a horizontal axis (indicating the effect size), whereas the modified forest plot (Figure 2) contains two axes. Both in the regular forest plot and in the modified forest plot the horizontal axis indicates the effect size. The additional vertical axis in the modified forest plot indicates the proportion of a certain subgroup variable in the included studies. Importantly, the vertical axis does not simply indicate the order of the subgrouping variable, but also scales this variable.

An often used quantitative approach to investigate the association between a certain subgroup characteristic and the size of the treatment effect is by applying meta-regression analysis [11]. Such analyses however rely on several assumptions, e.g., linearity of the association, and might be hard to interpret for their quantitative nature. Furthermore, in ordinary meta-regression analysis the treatment effects from the included studies are handled as if they are true values rather than estimates, which can result in bias when using least squares regression [4]. In addition, aggregated data meta-(regression) analyses are inappropriate to estimate unbiased treatment effects in patient subgroups, since such comparisons are observational by nature. As a result, the observed subgroup effect may be attributable to other variables than the subgrouping variable [12]. Furthermore, as indicated before, neither fixed nor random effects models address the cause for heterogeneity. Individual patient data meta-analysis can be a valid alternative to study subgroup effects [12]. In conclusion, the modified forest plot is a qualitative, visual alternative to assess the potential for a clinical relevant subgroup effects.