Dealing with substantial heterogeneity in Cochrane reviews. Cross-sectional study
© Schroll et al; licensee BioMed Central Ltd. 2011
Received: 18 August 2010
Accepted: 24 February 2011
Published: 24 February 2011
Dealing with heterogeneity in meta-analyses is often tricky, and there is only limited advice for authors on what to do. We investigated how authors addressed different degrees of heterogeneity, in particular whether they used a fixed effect model, which assumes that all the included studies are estimating the same true effect, or a random effects model where this is not assumed.
We sampled randomly 60 Cochrane reviews from 2008, which presented a result in its first meta-analysis with substantial heterogeneity (I2 greater than 50%, i.e. more than 50% of the variation is due to heterogeneity rather than chance). We extracted information on choice of statistical model, how the authors had handled the heterogeneity, and assessed the methodological quality of the reviews in relation to this.
The distribution of heterogeneity was rather uniform in the whole I2 interval, 50-100%. A fixed effect model was used in 33 reviews (55%), but there was no correlation between I2 and choice of model (P = 0.79). We considered that 20 reviews (33%), 16 of which had used a fixed effect model, had major problems. The most common problems were: use of a fixed effect model and lack of rationale for choice of that model, lack of comment on even severe heterogeneity and of reservations and explanations of its likely causes. The problematic reviews had significantly fewer included trials than other reviews (4.3 vs. 8.0, P = 0.024). The problems became less pronounced with time, as those reviews that were most recently updated more often used a random effects model.
One-third of Cochrane reviews with substantial heterogeneity had major problems in relation to their handling of heterogeneity. More attention is needed to this issue, as the problems we identified can be essential for the conclusions of the reviews.
Variability among individual study results in systematic reviews virtually always occurs. This is caused partly by random error (chance) and partly by systematic differences between the trials. The variation in the true effects is called heterogeneity. Its impact on meta-analyses can be assessed by I2 that describes the percentage of the variability that is due to heterogeneity [1, 2]. Values greater than 50% are - rather arbitrarily - considered substantial heterogeneity .
Strategies for addressing heterogeneity in systematic reviews include checking that the data extracted from the trial reports are correct, which may often not be the case ; omitting meta-analysis; conducting subgroup analysis or meta-regression; choosing a fixed effect or a random effects model ; changing the statistical metric, e.g. from a risk difference to a relative risk [4, 5]; and excluding studies.
The fixed effect model assumes that all the included studies are estimating the same true effect. The variation in findings among studies is therefore due to chance . Each study will be assigned a weight depending on the study's precision (within-trial variance) and an overall estimate can be calculated. Small studies will contribute relatively little to the outcome because they have less precision .
The random effects model assumes that the effects being estimated in the different studies are not identical, but follow a distribution. The confidence interval takes account of the additional uncertainty in the location of the mean of the systematically different effects in the different studies (this between-trial variance is added to the within-trial variance). Small studies will therefore contribute more to the average than in a fixed effect analysis, which is reasonable because the studies represent different true effects. Thus, when heterogeneity is present, the confidence interval around a random effects pooled estimate is wider than a confidence interval around a fixed effect pooled estimate .
Dealing with heterogeneity is often tricky, and there is only limited advice for authors on what to do, e.g. on when a particular model should be chosen for the other , or when the heterogeneity becomes too large for a meaningful meta-analysis.
An additional complexity is that the test for detecting heterogeneity has low power when the sample sizes are small or when few trials are included. For example, 11 trials give just 10 degrees of freedom, like a t-test on two groups of 6 people each does. There is also variation in practice as to which P-value demonstrates significant heterogeneity , but as the power of the test is so low, it is common to choose P = 0.10. It is important to be aware, however, that the choice of statistical model should not be based on the outcome of a test of heterogeneity .
The aim of our study was to investigate how authors address different degrees of substantial heterogeneity in meta-analyses in Cochrane reviews.
We listed all Cochrane reviews from the Cochrane Database of Systematic Reviews 2008, Issue 1, which had at least one meta-analysis and where the first outcome in the first comparison involved all studies ('total'), and not only subgroups of studies ('subtotals'). We assumed that in most cases the first outcome in the first comparison would be the primary outcome and that in the remainder, it would still be important for the review.
Because of the relatively smooth distribution, we randomly selected 60 reviews with an I2 of more than 50% for our study, using the random numbers generator in Excel. After having assessed the 60 reviews, it was clear that we had enough information to elucidate how authors address different degrees of substantial heterogeneity.
For every review, one observer (JBS) copied the relevant data into an Excel spreadsheet and a second observer (PCG) checked the data. Disagreements were few and were resolved by discussion. The extracted data were: i) The selected statistical model (random or fixed); ii) Any rationale for choosing the model; iii) The critical value for considering heterogeneity statistically significant; iv) Reservations about the results in relation to choice of model and comments on the heterogeneity; v) Attempts at explaining the heterogeneity narratively, e.g. different doses, populations, length of follow-up or quality of the included studies; vi) Attempts at addressing the heterogeneity statistically, e.g. by division of studies in subgroups, test for interaction, sensitivity analysis with omission of some studies, or meta-regression; this information was extracted from the Results section and in some cases directly from the graphs; vii) Point estimate and its P-value; the point estimate was also calculated with the alternative effect model, using the built-in facility for this in The Cochrane Library; viii) The P-value for the chi-square test for heterogeneity.
We assessed the overall methodological quality of the review based on whether the above points were addressed at all and focusing on if there were major problems in handling and interpretation of heterogeneity. We decided a priori that using a random effects model was a reasonable way of addressing substantial heterogeneity (unless there were special circumstances, as discussed below), and our assessments therefore focused mostly on those reviews where the authors had used a fixed effect model or where only one of the two models yielded a statistically significant estimate. We strived to be conservative in our judgments. If, for example, the authors had used a fixed effect model and gave the result of the heterogeneity test in the Results section, we interpreted this as a reservation about the result in relation to choice of model even if the authors provided no comments. Similarly, when a random effects model was chosen we interpreted this as a reservation.
We investigated whether the choice of model depended on the degree of heterogeneity, or on the P-value for the heterogeneity test. We did this because the reviews were produced at different times. Before the I2 was developed, authors often relied on the P-value to identify heterogeneity .
Choice of model in relation to degree of heterogeneity
Choice of model in relation to the P-value for the heterogeneity test.
> = 0.1
Significant effects in relation to choice of model
Results using the authors' model and the alternative model we applied
For 2 of the 6 reviews [9, 10] where a significant result changed to a non-significant when we used a random effects model, the authors were cautious about their heterogeneous result and didn't base their conclusion on the significant finding they had obtained with a fixed effect model, which we consider a correct approach. One review was explicit about this: "Substantial heterogeneity was also detected (p = 0.03, I 2 = 79%). Because of this, the result of this analysis should be interpreted with caution and not be considered a definitive statement" .
The authors of the other 4 reviews were less cautious. One review  calculated mean differences instead of standardized mean differences, although the outcomes were measured on very different scales. Because of this error, both the means and the standard deviations differed by a factor of 10. This resulted in extreme heterogeneity (I2 = 93%, P = 0.0002) despite very low power, as only two studies were included. In the methods section, the authors promised to use a random effects model in case of heterogeneity, but this was not done (and would not have solved the other problem).
In another review , the authors calculated the standardized mean difference both with a fixed effect model (1.07, 95% confidence interval 0.43 to 1.70) and a random effects model (1.74, -0.71 to 4.19). In the methods section, they stated they would use a random effects model if heterogeneity was present, which it was (I2 = 89%, P = 0.002). With a random effects model, the result was not significant. They wrote that no definite conclusion could be made but added that there was reasonable evidence that cognitive therapy was beneficial in treating depression. We find this conclusion doubtful, given the data and their declared methods. In this example, the effect estimate calculated by the two models differs substantially due to a one small outlying study. Hence, the choice of model should have been considered and explained in detail.
Another review  used Peto's odds ratio (0.28, 0.11 to 0.73). Significant heterogeneity was present (I2 = 59%, P = 0.05), and when using the ordinary odds ratio and a random effects model, the result became 0.28 (0.05 to 1.55). The authors concluded in the abstract that albumin showed a clear benefit at preventing severe ovarian hyperstimulation syndrome, although they were much more cautious in the main text.
The authors of the last review  reported that the P-value for heterogeneity was insignificant even though it was 0.07 and the power of the heterogeneity test was very low, as there were only 5 studies. They reported less mortality in the intervention group, relative risk 0.86 (0.74 to 1.00). With a random effects model the relative risk became 0.82 (0.57 to 1.18), with P = 0.30. The authors mentioned that heterogeneity was present and noted that one outlying trial had a very low mortality in the control group. The meta-analysis was driven by a big trial, which comprised 69% of the deaths and showed the same result as the pooled result, 0.86 (0.75 to 1.00). Even so, we find it pretty bold that the authors believed in a result with borderline significance (P = 0.05), and only when using the fixed effect model, with so much heterogeneity, and with unexplained discrepancies between the results of the trials.
Cautions about the heterogeneity
Our assessments of the 60 reviews in relation to their handling of heterogeneity
Overall acceptable methodological quality
Rationale given for choice of model
Valid reservations against results
Explanation of causes of heterogeneity
Heterogeneity addressed statistically in the analysis
Five other reviews used Peto's odds ratio [23, 24, 13, 18, 19], and three reviews didn't follow the analysis plan that was set out in the methods section [12, 25, 26], which included using a random effects model or omitting meta-analysis in case of heterogeneity, and there was no explanation why. Three other reviews paid no attention to the heterogeneity and didn't discuss it, even though the P-value was between 0.05 and 0.10 [27, 14, 28]. An additional review described the heterogeneity (I2 = 71%, P = 0.06) but ignored it due to "lack of stability of the known tests" , which is not a valid reason for ignoring heterogeneity. In another review, the authors divided the analysis into subgroups because they had found heterogeneity, but although the consequence was that the chi-square test for heterogeneity was no longer significant due to loss of power, the I2 actually increased, which the authors failed to comment on . In yet another review, which was discussed above, the authors pooled two risk scores measured on different scales that varied by a factor of ten . The complete list of included reviews can be found on the web http://sites.google.com/site/dealingwithheterogeneity/.
It is also important to consider that the fixed effect model only allows an inference about the studies included in the meta-analysis, whereas the random effects model allows an inference about the mean effect in a hypothetical population of studies if we can assume that the studies included in the meta-analysis constitute a random selection of studies from this hypothetical population.
The random effects model is more conservative than the fixed effect model in the sense that the confidence interval is broader, but sometimes the point estimate is farther from the null and the P value for the pooled effect smaller than with a fixed effect model .
When using a random effects model, the between-study variance needs to be calculated, but if there are few studies, this cannot be calculated with any precision, and a fixed effect model is therefore sometimes used in this situation .
It was surprising that we did not find a relation between the degree of heterogeneity and the choice of model. Some Cochrane groups instruct their authors to routinely use a fixed effect model, although few statisticians would find such blanket recommendations reasonable. Furthermore, in all types of research, authors should change their planned analysis and explain why if it would not be sensible.
Although our sample consisted only of reviews with substantial heterogeneity, about a third of the authors had not paid any attention to it. This omission was quite uniform over the spectrum of I2 values, and it might therefore partly reflect the well-known lack of statistical skills among authors of medical research papers [33–35]. However, as authors are recommended to routinely assess whether the results are consistent across studies , and what the likely causes are if they are not, they could do better even without having access to statistical expertise. Cochrane review groups could also do better, as they are required to have access to statistical expertise . Recently, summary of findings tables were introduced in Cochrane reviews as part of GRADEprofiler, where the authors are asked to assess the quality of the body of evidence. This includes assessing the likelihood that the pooled estimate for each outcome is free from bias , and a judgment related to the degree of heterogeneity.
Reviews that were devoid of major problems had included more trials than those with problems. The likely reason for this is that authors are usually too influenced by whether or not a P-value is significant and often do not take into account, or do not know, that P-values depend on the number of trials. When fewer trials are included, it is harder to identify heterogeneity using a chi-square test. This test is therefore not the recommended way to investigate heterogeneity . I2 is more sensitive but with few included trials there is a small risk of false positives.
Our sampling method precludes us from drawing general conclusions about the quality of Cochrane reviews in relation to heterogeneity. As we sampled meta-analyses, we did not assess how often the authors had abstained from pooling the results because of heterogeneity, which would have been an arduous task, given our total sample of 3,385 reviews.
The most important assessment - whether a review was devoid of major problems related to heterogeneity - was not as thoroughly specified in our protocol as we would have wished. It would not have been possible to specify in advance rigid rules because of the great diversity in handling and reporting heterogeneity. We have compensated for this limitation by describing the problematic reviews we encountered. More strict criteria could be used in future studies based on our findings.
In a few reviews, our outcome was not a primary one, which could be the reason that the heterogeneity was not addressed. On the other hand, these reviews tended to not address heterogeneity at all, for any outcomes.
We specified in our protocol that we wanted to investigate to which extent the point estimates and the confidence interval varied when a different model was chosen, but decided to focus on reviews where the result changed from significant to nonsignificant and vice versa.
Some of our analyses were exploratory. During data extraction, we decided to investigate if there was a relation between the choice of model and the P-value for heterogeneity, and we couldn't help noticing that the reviews we judged to be most problematic also tended to be those that had included fewest trials.
It is known that I2 increases when the sizes of the included studies increase and alternative measures of heterogeneity have been suggested . However, the problematic reviews identified in our study included very few trials and relatively few participants. When there are only few included trials there is a small risk of I2 above 50% even though no heterogeneity is present.
Other studies of heterogeneity
In the early years of the Cochrane Collaboration, randomly selected Cochrane reviews were assessed by two different observers, and 29% were judged to have major problems , but these concerned other issues than heterogeneity. In another study of Cochrane reviews, heterogeneity, defined as P < 0.10, was identified in 34 out of 86 meta-analyses, and in 12 of the 34 meta-analyses, heterogeneity was not addressed . In 2002, Higgins et al.  investigated the newest Cochrane reviews and tested if heterogeneity was present, and collected information about choice of model and subgroup analyses. The study compared the protocol to the review and identified problems concerning choice of statistical model and problems with conducting subgroup analyses, as there were often too few included trials.
One-third of Cochrane reviews with substantial heterogeneity in the first reported outcome had major problems in relation to their handling of heterogeneity. These consisted mainly of the use of a fixed effect model without an explicit rationale for choice of that model, and lack of reservations and explanations of the likely causes of the heterogeneity. These problems became less pronounced with time, as those reviews that were most recently updated much more often used a random effects model. More attention is needed to this issue, as the problems we identified can be essential for the conclusions of the reviews.
We thank statistician Julian Higgins for comments on the manuscript.
- Higgins JPT, Thompson SG, Deeks JJ, Altman DG: Measuring inconsistency in meta-analyses. BMJ. 2003, 327: 557-60. 10.1136/bmj.327.7414.557.View ArticlePubMedPubMed CentralGoogle Scholar
- Higgins JPT, Green S: Cochrane Handbook for Systematic Reviews of Interventions. 2008, Chichester: John Wiley & Sons LtdView ArticleGoogle Scholar
- Gøtzsche PC, Hrobjartsson A, Maric K, Tendal B: Data extraction errors in meta-analyses that use standardized mean differences. JAMA. 2007, 298: 430-7.PubMedGoogle Scholar
- Engels EA, Schmid CH, Terrin N, Olkin I, Lau J: Heterogeneity and statistical significance in meta-analysis: an empirical study of 125 meta-analyses. Stat Med. 2000, 19: 1707-28. 10.1002/1097-0258(20000715)19:13<1707::AID-SIM491>3.0.CO;2-P.View ArticlePubMedGoogle Scholar
- Deeks JJ: Issues in the selection of summary statistic for meta-analysis of clinical trials with binary outcomes. Stat Med. 2002, 21: 1575-1600. 10.1002/sim.1188.View ArticlePubMedGoogle Scholar
- Borenstein M, Hedges L, Rothstein H: Meta-Analysis: Fixed effect vs. random effects. downloaded 18th June 2008, [http://www.meta-analysis.com]
- Higgins J, Thompson S, Deeks J, Altman D: Statistical heterogeneity in systematic reviews of clinical trials: a critical appraisal of guidelines and practice. J Health Serv Res Policy. 2002, 7: 51-61. 10.1258/1355819021927674.View ArticlePubMedGoogle Scholar
- Review Manager (RevMan) [Computer program]. Version 5.0. 2008, Copenhagen: The Nordic Cochrane Centre, The Cochrane CollaborationGoogle Scholar
- Wilcken N, Hornbuckle J, Ghersi D: Chemotherapy alone versus endocrine therapy alone for metastatic breast cancer. Cochrane Database of Systematic Reviews. 2003, CD002747-2Google Scholar
- Sasse EC, Sasse AD, Brandalise SR, Clark OAC, Richards S: Colony stimulating factors for prevention of myelosupressive therapy induced febrile neutropenia in children with acute lymphoblastic leukaemia. Cochrane Database of Systematic Reviews. 2005, CD004139-3Google Scholar
- Edwards AGK, Evans R, Dundon J, Haigh S, Hood K, Elwyn GJ: Personalised risk communication for informed decision making about taking screening tests. Cochrane Database of Systematic Reviews. 2006, CD001865-4Google Scholar
- Thomas PW, Thomas S, Hillier C, Galvin K, Baker R: Psychological interventions for multiple sclerosis. Cochrane Database of Systematic Reviews. 2006, CD004431-1Google Scholar
- Aboulghar M, Evers JH, Al-Inany H: Intra-venous albumin for preventing severe ovarian hyperstimulation syndrome. Cochrane Database of Systematic Reviews. 2002, CD001302-2Google Scholar
- Henderson-Smart DJ, Wilkinson A, Raynes-Greenow CH: Mechanical ventilation for newborn infants with respiratory failure due to pulmonary disease. Cochrane Database of Systematic Reviews. 2002, CD002770-4Google Scholar
- Wiysonge CS, Shey MS, Sterne JAC, Brocklehurst P: Vitamin A supplementation for reducing the risk of mother-to-child transmission of HIV infection. Cochrane Database of Systematic Reviews. 2005, 4: CD003648-PubMedGoogle Scholar
- Wilkinson D, Ramjee G, Tholandi M, Rutherford G: Nonoxynol-9 for preventing vaginal acquisition of sexually transmitted infections by women from men. Cochrane Database of Systematic Reviews. 2002, CD003939-1Google Scholar
- Villar J, Widmer M, Lydon-Rochelle MT, Gülmezoglu AM, Roganti A: Duration of treatment for asymptomatic bacteriuria during pregnancy. Cochrane Database of Systematic Reviews. 2000, CD000491-2Google Scholar
- Engelter S, Lyrer P: Antiplatelet therapy for preventing stroke and other vascular events after carotid endarterectomy. Cochrane Database of Systematic Reviews. 2003, CD001458-3Google Scholar
- Cook LA, Pun A, van Vliet H, Gallo MF, Lopez LM: Scalpel versus no-scalpel incision for vasectomy. Cochrane Database of Systematic Reviews. 2007, CD004112-2Google Scholar
- Harrison JE, O'Brien KD, Worthington HV: Orthodontic treatment for prominent upper front teeth in children. Cochrane Database of Systematic Reviews. 2007, CD003452-3Google Scholar
- Young GL, Jewell D: Antihistamines versus aspirin for itching in late pregnancy. Cochrane Database of Systematic Reviews. 1997, CD000027-1Google Scholar
- Kuschel CA, Harding JE: Multicomponent fortified human milk for promoting growth in preterm infants. Cochrane Database of Systematic Reviews. 2004, CD000343-1Google Scholar
- Ebrahim S, Beswick A, Burke M, Davey Smith G: Multiple risk factor interventions for primary prevention of coronary heart disease. Cochrane Database of Systematic Reviews. 2006, 4Google Scholar
- Martin-Hirsch P, Jarvis G, Kitchener H, Lilford R: Collection devices for obtaining cervical cytology samples. Cochrane Database of Systematic Reviews. 2000, CD001036-3Google Scholar
- Mochtar MH, Van der Veen F, Ziech M, van Wely M: Recombinant Luteinizing Hormone (rLH) for controlled ovarian hyperstimulation in assisted reproductive cycles. Cochrane Database of Systematic Reviews. 2007, CD005070-2Google Scholar
- Jørgensen H, Wetterslev J, Møiniche S, Dahl JB: Epidural local anaesthetics versus opioid-based analgesic regimens for postoperative gastrointestinal paralysis, PONV and pain after abdominal surgery. Cochrane Database of Systematic Reviews. 2001, CD001893-1Google Scholar
- Huertas-Ceballos A, Logan S, Bennett C, Macarthur C: Dietary interventions for recurrent abdominal pain (RAP) and irritable bowel syndrome (IBS) in childhood. Cochrane Database of Systematic Reviews. 2008, CD003019-1Google Scholar
- Askie LM, Henderson-Smart DJ: Restricted versus liberal oxygen exposure for preventing morbidity andmortality in pretermor lowbirth weight infants. Cochrane Database of Systematic Reviews. 2001, CD001077-4Google Scholar
- Barden J, Edwards J, Moore RA, McQuay HJ: Single dose oral diclofenac for postoperative pain. Cochrane Database of Systematic Reviews. 2004, CD004768-2Google Scholar
- Fidelix TSA, Soares BGDO, Trevisani VFM: Diacerein for osteoarthritis. Cochrane Database of Systematic Reviews. 2006, CD005117-1Google Scholar
- Poole C, Greenland S: Random-effects meta-analysis are not always conservative. American Journal of Epidemiology. 1999, 150: 469-75.View ArticlePubMedGoogle Scholar
- He FJ, MacGregor GA: Effect of longer-term modest salt reduction on blood pressure. Cochrane Database of Systematic Reviews. 2004, CD004937-1Google Scholar
- Wulff HR, Andersen B, Brandenhoff P, Güttler F: What do doctors know about statistics?. Stat Med. 1987, 6: 3-10. 10.1002/sim.4780060103.View ArticlePubMedGoogle Scholar
- Scheutz F, Andersen B, Wulff HR: What do dentists know about statistics?. Scand J Dent Res. 1988, 96: 281-7.PubMedGoogle Scholar
- Windish DM: Medicine residents' understanding of the biostatistics and results in the medical Literature. JAMA. 2007, 298: 1010-22. 10.1001/jama.298.9.1010.View ArticlePubMedGoogle Scholar
- Rücker G, Schwarzer G, Carpenter J, Schumacher M: Undue reliance on I2 in assessing heterogeneity may mislead. BMC Medical Research Methodology. 2008, 8: 79-View ArticlePubMedPubMed CentralGoogle Scholar
- Olsen O, et al: Quality of Cochrane reviews: assement of sample from 1998. BMJ. 2001, 323: 829-32. 10.1136/bmj.323.7317.829.View ArticlePubMedPubMed CentralGoogle Scholar
- Hahn S, Paul Garner, Williamson P: Are systematic reviews taking heterogeneity into account? An analysis from the Infectious Diseases Module of the Cochrane Library (research letter). J Eval Clin Pract. 2000, 6: 231-3. 10.1046/j.1365-2753.2000.00230.x.View ArticlePubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/11/22/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.