Skip to main content

Similarities, reliability and gaps in assessing the quality of conduct of systematic reviews using AMSTAR-2 and ROBIS: systematic survey of nutrition reviews



AMSTAR-2 (‘A Measurement Tool to Assess Systematic Reviews, version 2’) and ROBIS (‘Risk of Bias in Systematic Reviews’) are independent instruments used to assess the quality of conduct of systematic reviews/meta-analyses (SR/MAs). The degree of overlap in methodological constructs together with the reliability and any methodological gaps have not been systematically assessed and summarized in the field of nutrition.


We performed a systematic survey of MEDLINE, EMBASE, and the Cochrane Library for SR/MAs published between January 2010 and November 2018 that examined the effects of any nutritional intervention/exposure for cancer prevention. We followed a systematic review approach including two independent reviewers at each step of the process. For AMSTAR-2 (16 items) and ROBIS (21 items), we assessed the similarities, the inter-rater reliability (IRR) and any methodological limitations of the instruments. Our protocol for the survey was registered in PROSPERO (CRD42019121116).


We found 4 similar domain constructs based on 11 comparisons from a total of 12 AMSTAR-2 and 14 ROBIS items. Ten comparisons were considered fully overlapping. Based on Gwet’s agreement coefficients, six comparisons provided almost perfect (> 0.8), three substantial (> 0.6), and one a moderate level of agreement (> 0.4). While there is considerable overlap in constructs, AMSTAR-2 uniquely addresses explaining the selection of study designs for inclusion, reporting on excluded studies with justification, sources of funding of primary studies, and reviewers’ conflict of interest. By contrast, ROBIS uniquely addresses appropriateness and restrictions within eligibility criteria, reducing risk of error in risk of bias (RoB) assessments, completeness of data extracted for analyses, the inclusion of all necessary studies for analyses, and adherence to predefined analysis plan.


Among the questions on AMSTAR-2 and ROBIS, 70.3% (26/37 items) address the same or similar methodological constructs. While the IRR of these constructs was moderate to perfect, there are unique methodological constructs that each instrument independently addresses. Notably, both instruments do not address the reporting of absolute estimates of effect or the overall certainty of the evidence, items that are crucial for users’ wishing to interpret the importance of SR/MA results.

Peer Review reports


With the ever-growing amount of published data, systematic reviews (SRs) and meta-analyses (MAs) became recognised methods for summarising the evidence in support of evidence-based decision-making in healthcare [1,2,3]. High quality systematic reviews/meta-analyses (SR/MAs) are considered acceptable and important for decision-makers [4, 5]. However, with the increasing number of SR/MAs there are often issues of reliability, particularly when SR/MAs have conflicting results and suffer from extensive methodological shortcomings [1, 6, 7]. In the context of these findings, users of the literature must distinguish lower versus higher quality SR/MAs to support healthcare decision-making. Instruments to distinguish the quality of conduct of SR/MAs have been designed and validated.

Currently, two instruments, namely AMSTAR-2 (‘A Measurement Tool to Assess Systematic Reviews, version 2’) and ROBIS (‘Risk of Bias in Systematic Reviews’), are commonly used to formally assess the quality of conduct of SR/MAs. Both instruments provide a structured approach for readers to perform rapid and reproducible assessments of the quality, including a detailed evaluation of conduct and methodological rigour; however original constructs and specific details differ [8, 9]. AMSTAR-2 has been developed as a critical appraisal tool for SR/MAs that include randomised or non-randomised studies of health care interventions and is an updated version of previously widely accepted AMSTAR that has been in use for over a decade [10]. AMSTAR-2 is comprised of 16 items, of which seven were determined to be critically important to the validity of a review, while the other nine are considered not critically important. Users of AMSTAR-2 are asked to make an overall judgment of ‘high’, ‘moderate’, ‘low’, or ‘very low’ confidence in the results of SR/MA based on the assessment of critical and non-critical items [11].

ROBIS focuses intrinsically on the risk of bias (RoB) in the SR/MA and comprises three phases: assessment of relevance (optional), identification of concerns within the review process that put the SR/MA at RoB, and judgement of RoB. The second of the aforementioned phases is composed of four domains with 21 items highlighting specific issues that need to be considered. In the third phase a judgement of ‘low’, ‘high’, or ‘unclear’ RoB is assigned after consideration of assessments performed in the second phase [12].

Upon applying both instruments, users can determine that they are similar in their general approach; however, differences do exist. A number of studies have investigated the similarity of assessments between original AMSTAR and ROBIS tools [13,14,15]. Nevertheless, so far, only one study has investigated the comparability of both instruments in terms of their domains and corresponding items, demonstrating a satisfactory correlation between the overall ratings of AMSTAR-2 and ROBIS while highlighting the differences in the conceptual frameworks of both tools [16].

There has been a profusion of SR/MAs in the health sciences literature [1], with several studies having already investigated their quality [7, 17, 18]. Nutritional epidemiology is an area of scientific interest to the public, and while the quality of SR/MAs in the field has recently been shown to be sub-optimal [7], the related and burgeoning field of SR/MAs assessing nutrition for cancer prevention has not been systematically evaluated. In this study, performed within the context of the systematic survey addressing trustworthiness of SR/MAs assessing nutrition for cancer prevention, we aimed to compare the similarities, the inter-rater reliability (IRR) and any methodological gaps of instruments for assessing the quality of conduct of those SR/MAs.


The protocol for the systematic survey was prepared a priori and registered in PROSPERO with an identification number CRD42019121116.

Searches, eligibility, and sample selection

We systematically searched MEDLINE, Embase, and the Cochrane Library for SR/MAs published between January 2010 and November 2018 that examined the effects of any nutritional intervention/exposure for cancer prevention in the general population or in people at higher risk for cancer. Search strategies are provided in Supplementary file. We accepted studies labelled as SR/MAs as described in the title, abstract, or full text, which included, according to their eligibility criteria, primary studies comprising a comparator group (i.e., interventional studies with a control group such as randomised or non-randomised controlled trials, observational studies with participants categorized by intake or exposure level (e.g. lower versus upper quartiles)). The methods have been described in detail in the companion paper [19].

Screening and data extraction

Following a calibration exercise, pairs of two independent reviewers performed study selection, data extraction, as well as both AMSTAR-2 and ROBIS assessments, with conflicts resolved by discussion or consultation with a third reviewer. Each step was preceded with calibration exercises to ensure common understanding of inclusion criteria and to discuss any ambiguities. With respect to quality assessments, a number of authors have considerable experience in conducting SRs and assessing their methodological quality (MJS, DS, JZ, MK, BCJ, MMB), while the remaining authors (PT, WS, MG, AS, AW, KK, JBC) underwent training. AMSTAR-2 and ROBIS assessments were piloted on a set of three studies.

Quality of conduct and risk of bias assessment instruments

AMSTAR-2 consists of 16 items for which ‘yes (Y)’ or ‘no (N)’ judgments can be applied. For five items (2, 4, 7, 8, 9) in addition to ‘Y’ or ‘N’ responses, ‘partially yes (PY)’ can be selected. Items 11, 12, and 15 are not considered if a meta-analysis was not undertaken. Among the 16 items, seven are considered to be critical: ‘development of the study protocol’ (item 2); ‘comprehensiveness of the literature search strategy’ (item 4); ‘providing a list of excluded studies with reasons’ (item 7); ‘appropriate assessment of the RoB of individual included studies’ (item 9); ‘use of appropriate meta-analytical methods’ (item 11); ‘consideration of RoB when interpreting and discussing the results’ (item 13); and ‘assessment of the presence of publication bias and discussion of its impact on the results’ (item 15). The remaining nine items are considered non-critical. Subsequent to judging the 16 items, investigators can make an overall judgment of ‘high’, ‘moderate’, ‘low’, or ‘very low’ confidence in the results of the target SR/MA, as follows [11]:

  • High: no major flaws in critical items and ≤ 1 flaw in non-critical items;

  • Moderate: no major flaws in critical items and > 1 flaw in non-critical items;

  • Low: one major flaw in critical items with or without non-critical items;

  • Critically low: > 1 major flaw in critical items with or without non-critical items.

ROBIS consists of 21 items assigned to four domains (study eligibility criteria; identification and selection of studies; data collection and study appraisal; synthesis and findings), for which respondents can answer ‘yes (Y)’, ‘partial yes (PY)’, ‘partial no (PN)’, ‘no (N)’, or ‘no information (NI)’. The overall concerns associated with each of the four domains are then judged as ‘low’, ‘high’ or ‘unclear’. On the basis of the domain assessments, supported by consideration of correctness of SR/MA interpretation of findings, relevance of included studies to the SR/MA’ question, as well as fairness and thoroughness within presentation of the results, a final consideration is performed on whether the SR/MA as a whole is at ‘low’, ‘high’, or ‘unclear’ risk of bias [12].

Domain matching

For data collection and analyses we used Microsoft Excel (version 2016). After reviewing all items of each instrument, based on the ROBIS instrument we categorized the items under four main domains based on conceptual similarities:

  • Domain 1: Study eligibility criteria;

  • Domain 2: Identification and selection of studies;

  • Domain 3: Data collection and study appraisal;

  • Domain 4: Synthesis and findings.

After assessing the concept, approach and definitions for each item, we matched items from each instrument to produce 11 comparisons including 12 AMSTAR-2 and 14 ROBIS items. In some cases, two or more items from one of the instruments were combined within a single comparison (e.g. AMSTAR-2 item 4 was compared with ROBIS items 2.1, 2.2, 2.3, and 2.4). For 10 comparisons we judged items of both instruments as satisfactorily comparable with respect to concept, approach and definitions, while in the case of one comparison (examination of publication bias/robustness of the results) we judged the items from the instruments as only partially overlapping (i.e. robustness of SR/MA results includes an assessment of publication bias as well as other considerations). There were four items on AMSTAR-2 and seven items on ROBIS that did not sufficiently overlap in concept, approach, and description. Table 1 provides a summary of the overlapping and non-overlapping items.

Table 1 Comparison of matched AMSTAR-2 and ROBIS items


For comparing the similar items across instruments, using the Gwet’s AC1 statistic (Gwet’s first-order agreement coefficient) we calculated the reliability between raters [20, 21]. In order to do so, pairs of reviewers independently assessed each SR/MA using AMSTAR-2 and ROBIS. When we found ambiguities in our assessments, we discussed, and if we could not come to consensus a third senior reviewer was consulted. Subsequently, items with consensus appraisals for each study were used to calculate the IRR. Assumptions for each comparison are provided in the footnotes of the Table. 1. Based on established guidance, we classified agreement as poor (≤ 0.00), slight (0.01–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), and almost perfect (0.81–1.00) [16, 22].


We identified 24,739 records, of which 20,413 were screened after duplicates were removed. Based on the eligibility criteria, we included 737 studies, of which a random sample of 101 articles was selected and analysed. The study flow is presented in Fig. 1 [23].

Fig. 1
figure 1

PRISMA 2020 flow diagram. PRISMA - Preferred Reporting Items for Systematic Reviews and Meta-Analyses

The 11 comparisons produced varying levels of agreement coefficients, presented below.

Domain 1: study eligibility criteria

Two comparisons were created within this domain. The comparisons addressed the comprehensiveness of eligibility criteria and the prospective publication of review methods (protocol), with an almost perfect agreement: 0.87 (95% CI, 0.78 to 0.96) and 0.99 (95% CI, 0.97 to 1), respectively.

Domain 2: identification and selection of studies

Two comparisons were discerned within this domain. One addressed the comprehensiveness of search strategies with a substantial level of agreement: 0.79 (95% CI, 0.74 to 0.85), and the other investigated duplicate study selection with an almost perfect level of agreement: 0.87 (95% CI, 0.77 to 0.96).

Domain 3: data collection and study appraisal

Three comparisons were formed within this domain. One addressed duplicate data extraction with an almost perfect level of agreement: 0.88 (95% CI, 0.79 to 0.98). A second comparison explored the comparability of items regarding the adequate description of characteristics of studies included in the review showing a moderate level of agreement: 0.6 (95% CI, 0.44 to 0.76). A third comparison addressed the use of appropriate RoB assessment methods showing an almost perfect level of agreement: 0.88 (95% CI, 0.79 to 0.98).

Domain 4: synthesis and findings

Four comparisons were created within this domain. Three were considered fully overlapping, while one was partially overlapping. One comparison, concerning an appropriate statistical combination of results, proved an almost perfect level of agreement: 0.81 (95% CI, 0.69 to 0.92). Two comparisons, one regarding assessment and interpretation of biases in included studies, and one concerning appropriate consideration of heterogeneity within the results, both showed substantial levels of agreement: 0.77 (95% CI, 0.64 to 0.89) and 0.73 (95% CI, 0.59 to 0.86), respectively. The fourth comparison addressing publication bias and robustness of the results (e.g. funnel plot or sensitivity analysed) was considered as partially overlapping and showed a slight level of agreement: 0.18 (95% CI, − 0.03 to 0.38).

Methodological gaps

In addition to documenting the similarities and IRR between instruments, we also noted major methodological gaps in both tools. Both instruments could be improved with respect to guidance and assessment of subgroup analysis, ideally based on an a priori publicly available study protocol detailing the planned assessment of effect modification [24]. We also noted that both instruments do not consider the presentation of results using of absolute estimates of effect (e.g. risk difference for all dichotomous outcomes) [25], nor do they have an item on the overall certainty of evidence (e.g. assessed using the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach) for each outcome [26].


Our study aimed to compare the similarity and reliability of the AMSTAR-2 and ROBIS instruments based on 101 SR/MAs assessing nutritional interventions/exposures for cancer prevention. AMSTAR-2 is comprised of 16 items while ROBIS has 21 items, of which 12 and 14, respectively, were combined into 11 comparisons based on their conceptual similarities. Overall, we found that 70.3% (26/37) of items assess the same or similar methodological constructs. Ten comparisons were judged to fully overlap in concept and definitions, and one comparison was partially overlapping. A number of items from both tools (four in AMSTAR-2 and seven in ROBIS) were unique to each instrument and were not amenable for paired comparisons due to non-overlapping concepts, approaches, and descriptions. Both instruments do not address the reporting of absolute estimates of effect and the overall certainty of the evidence.

The study by Pieper et al. was the first to compare both instruments in terms of validity, reliability, and applicability [16]. The authors matched relevant AMSTAR-2 and ROBIS items into 12 comparisons, of which 10 were considered as fully overlapping, and two comparisons as partially overlapping (appropriateness of restriction of eligibility criteria and publication bias/robustness of the results). Our approach was similar; however, we dismissed the partially overlapping comparison between AMSTAR-2 item 3 ‘Did the review authors explain their selection of the study designs for inclusion in the review?’ and ROBIS item 1.4 ‘Were all restrictions in eligibility criteria based on study characteristics appropriate?’ as we believe these items are different constructs and are not similar enough based on underlying definitions and assessment guidance. Furthermore, while for data extraction we compared AMSTAR-2 item 6. ‘Did the review authors perform data extraction in duplicate?’ with ROBIS item 3.1 ‘Were efforts made to minimize error in data collection?’, Pieper et al. additionally considered ROBIS item 3.5 ‘Were efforts made to minimize error in risk of bias assessment?’ within this comparison. We did not include ROBIS item 3.5 into this comparison as we believe duplicate RoB assessment and duplicate data extraction should be assessed separately.

Before AMSTAR-2 was published, researchers attempted to compare the reliability of ROBIS and the original AMSTAR tool. The correlation coefficients ranged from moderate to substantial [13]. Generally, apart from one comparison of AMSTAR-2 item 8 with ROBIS item 3.2, our calculations resulted in higher coefficient values as compared to those reported by Pieper et al. [16]. Their calculated agreement levels concerning similar methodological constructs were reported to be perfect for one comparison, substantial in six comparisons, moderate in two comparisons, fair in one comparison, and slight in one comparison. By contrast, our calculations provided six items with almost perfect comparisons, three with substantial, one with a moderate, and one with a slight level of agreement. One possible explanation for these discrepancies could be the quality of included studies. In our sample of 101 articles published within the field of nutrition for cancer prevention, only 1% of SR/MAs were of high quality according to AMSTAR-2, and 3% were of low RoB according to ROBIS, which indicates mostly low scores in the majority of items of both instruments, which might result in high agreement coefficients’ values. Alternatively, unlike Pieper et al., it may be that our coefficients were higher because pairs of reviewers participated in calibration and consensus procedures, which ensured that differences in assessments were discussed, thus reducing the number of outlying assessments that might have occurred. In Pieper et al., no consensus procedure between reviewers was introduced and final judgement on items within each comparison was based on the judgments of most of the raters, resulting in the possibility of higher variation of assessments, and thus lower agreement scores.

After performing assessments using both instruments, we were surprised that both instruments did not have items devoted to the assessment of the magnitude of the effects based on absolute estimates (e.g. risk difference) for dichotomous outcomes, or the certainty of evidence for each outcome. Providing the information on these items is supported by the GRADE guidance, the Cochrane Handbook, and the Joanna Briggs Institute Manual [27,28,29,30]. Rating certainty of the evidence for each assessed health outcome improves the interpretation of SR/MA results and should be considered a vital characteristic of quality in reviews. Regarding the magnitude of the effects, authors commonly report effects as relative estimates such as risk ratios or hazard ratios while underreporting absolute measures such as the risk difference or number needed to treat [25]. Evidence suggests that reporting both relative and absolute estimates and their corresponding certainty of evidence allows for optimal interpretation of review findings [7, 25, 31,32,33]. Future updates of ROBIS and AMSTAR-2 instruments should consider adding these items, and interim users of the instruments might consider these items, particularly in nutrition for long term health outcomes, where the absolute effects may be small and uncertain [34,35,36].

We followed Cochrane guidance on systematic review methods strengthening the validity of our findings, including calibration exercises and duplicate screening, abstraction, and quality assessment. Furthermore, our methods followed an a priori study protocol and included a random sample of 101 nutrition studies, a large sample in the same healthcare field. With regard to weaknesses, first, many items in the AMSTAR-2 and ROBIS instruments were dissimilar and did not always allow for reliability comparisons, so our coefficients could be misleading to readers who may have the impression that the instruments are the same or close to the same for assessing SR/MA quality. That is, while there were many overlapping conceptual items (70.3%), there were a substantial number of dissimilar items (11/37), and so applying each instrument to the same study could result in important material differences with respect to the quality of conduct of a SR/MA. Second, while our team reported that ROBIS took longer than AMSTAR-2 assessments, we did not formally measure the time it took for reviewers to complete the assessments for each instrument. Comparisons have previously been reported indicating varying results, ranging from AMSTAR assessment taking slightly longer than ROBIS, to ROBIS assessment taking substantially longer than AMSTAR [13, 16, 37]. Third, we chose a random subsample of 101 of 737 identified studies, as completing assessments of all identified studies was not deemed feasible due to time constraints. Fourth, since the reliability of the majority of included studies assessed with AMSTAR-2 and ROBIS was critically low, the agreement coefficients between instruments in other fields of health care might differ from ours, particularly if there is more variability in the quality of SR/MAs or if higher quality SR/MAs are included.


AMSTAR-2 and ROBIS are instruments designed to facilitate the assessment of SR/MA quality. Among the instruments, 70.3% of items address the same or similar methodological constructs. While the IRR of these items was moderate to perfect in fully overlapping comparisons, and slight in partially overlapping, there are unique methodological items that each instrument independently addresses. Further investigation based on samples of SR/MAs from different fields of medicine and health science might further elucidate similarities and discrepancies between both tools. Notably, AMSTAR-2 and ROBIS do not address the reporting of absolute estimates of effect or the overall certainty of the evidence, both of which are important for the optimal interpretation of SR/MA findings. The choice to use one or both of the instruments should depend on the aim of the investigators or users’ of the SR/MAs (i.e. overall methodological quality versus RoB assessment only) and other factors such as experience with the instrument or time constraints. It has previously been suggested that both instruments have areas for improvement [16, 37], findings that our systematic survey corroborates. One pragmatic instrument that fully considers RoB together with other methodological quality items such as the presentation of both relative and absolute estimates and the certainty of these estimates would optimally help users’ of SR/MAs better assess and interpret a reviews overall quality and importance of reported results.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.



A Measurement Tool to Assess systematic Reviews, version 2


Grading of Recommendations Assessment, Development, and Evaluation


Inter-rater reliability




Non-randomised studies of interventions/exposures


Preferred Reporting Items for Systematic Reviews and Meta-Analyses


Randomized controlled trial


Risk of bias


Risk of Bias in Systematic reviews


Systematic review

95% CI:

95% confidence interval


  1. Ioannidis JP. The mass production of redundant, misleading, and conflicted systematic reviews and Meta-analyses. Milbank Q. 2016;94(3):485–514

    Article  Google Scholar 

  2. In: Graham R, Mancher M, Miller Wolman D, Greenfield S, Steinberg E, editors. Clinical Practice Guidelines We Can Trust. Washington (DC)2011.

  3. Bastian H, Glasziou P, Chalmers I. Seventy-five trials and eleven systematic reviews a day: how will we ever keep up? PLoS Med. 2010;7(9):e1000326

    Article  Google Scholar 

  4. Mulrow CD. Rationale for systematic reviews. BMJ. 1994;309(6954):597–9

    Article  CAS  Google Scholar 

  5. Lunny C, Ramasubbu C, Gerrish S, Liu T, Salzwedel DM, Puil L, et al. Impact and use of reviews and 'overviews of reviews' to inform clinical practice guideline recommendations: protocol for a methods study. BMJ Open. 2020;10(1):e031442

    Article  Google Scholar 

  6. Pussegoda K, Turner L, Garritty C, Mayhew A, Skidmore B, Stevens A, et al. Systematic review adherence to methodological or reporting quality. Syst Rev. 2017;6(1):131

    Article  Google Scholar 

  7. Zeraatkar D, Bhasin A, Morassut RE, Churchill I, Gupta A, Lawson DO, et al. Characteristics and quality of systematic reviews and meta-analyses of observational nutritional epidemiology: a cross-sectional study. Am J Clin Nutr. 2021;113(6):1578–92

    Article  Google Scholar 

  8. Pollock M, Fernandes RM, Becker LA, Featherstone R, Hartling L. What guidance is available for researchers conducting overviews of reviews of healthcare interventions? A scoping review and qualitative metasummary. Syst Rev. 2016;5(1):190

    Article  Google Scholar 

  9. Lunny C, Ramasubbu C, Puil L, Liu T, Gerrish S, Salzwedel DM, et al. Over half of clinical practice guidelines use non-systematic methods to inform recommendations: a methods study. PLoS One. 2021;16(4):e0250356

    Article  CAS  Google Scholar 

  10. Shea BJ, Grimshaw JM, Wells GA, Boers M, Andersson N, Hamel C, et al. Development of AMSTAR: a measurement tool to assess the methodological quality of systematic reviews. BMC Med Res Methodol. 2007;7:10

    Article  Google Scholar 

  11. Shea BJ, Reeves BC, Wells G, Thuku M, Hamel C, Moran J, et al. AMSTAR 2: a critical appraisal tool for systematic reviews that include randomised or non-randomised studies of healthcare interventions, or both. BMJ. 2017;358:j4008

    Article  Google Scholar 

  12. Whiting P, Savovic J, Higgins JP, Caldwell DM, Reeves BC, Shea B, et al. ROBIS: a new tool to assess risk of bias in systematic reviews was developed. J Clin Epidemiol. 2016;69:225–34

    Article  Google Scholar 

  13. Banzi R, Cinquini M, Gonzalez-Lorenzo M, Pecoraro V, Capobussi M, Minozzi S. Quality assessment versus risk of bias in systematic reviews: AMSTAR and ROBIS had similar reliability but differed in their construct and applicability. J Clin Epidemiol. 2018;99:24–32

    Article  Google Scholar 

  14. Buhn S, Mathes T, Prengel P, Wegewitz U, Ostermann T, Robens S, et al. The risk of bias in systematic reviews tool showed fair reliability and good construct validity. J Clin Epidemiol. 2017;91:121–8

    Article  Google Scholar 

  15. Lorenz RC, Matthias K, Pieper D, Wegewitz U, Morche J, Nocon M, et al. A psychometric study found AMSTAR 2 to be a valid and moderately reliable appraisal tool. J Clin Epidemiol. 2019;114:133–40

    Article  Google Scholar 

  16. Pieper D, Puljak L, Gonzalez-Lorenzo M, Minozzi S. Minor differences were found between AMSTAR 2 and ROBIS in the assessment of systematic reviews including both randomized and nonrandomized studies. J Clin Epidemiol. 2019;108:26–33

    Article  Google Scholar 

  17. Wiseman MJ. Nutrition and cancer: prevention and survival. Br J Nutr. 2019;122(5):481–7

    Article  CAS  Google Scholar 

  18. Salam RA, Welch V, Bhutta ZA. Systematic reviews on selected nutrition interventions: descriptive assessment of conduct and methodological challenges. BMC Nutr. 2015;1(9):1–12

    Google Scholar 

  19. Zajac J, Storman D, Swierz MJ, Koperny M, Tobola P, Staskiewicz W, et al. Are articles published as systematic reviews addressing nutritional exposures for cancer prevention trustworthy? A systematic survey of quality and risk of bias. Nutrition Reviews 2021:Accepted for publication.

  20. Gwet KL. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008;61(Pt 1):29–48

    Article  Google Scholar 

  21. Gwet KL. Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among raters. 4th ed. Gaithersburg: Advanced Analytics; 2014.

    Google Scholar 

  22. Landis JRKG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74.

    Article  CAS  Google Scholar 

  23. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71

    Article  Google Scholar 

  24. Schandelmaier S, Briel M, Varadhan R, Schmid CH, Devasenapathy N, Hayward RA, et al. Development of the instrument to assess the credibility of effect modification analyses (ICEMAN) in randomized controlled trials and meta-analyses. CMAJ. 2020;192(32):E901–6

    Article  Google Scholar 

  25. Alonso-Coello P, Carrasco-Labra A, Brignardello-Petersen R, Neumann I, Akl EA, Vernooij RW, et al. Systematic reviews experience major limitations in reporting absolute effects. J Clin Epidemiol. 2016;72:16–26

    Article  Google Scholar 

  26. Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, Alonso-Coello P, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ. 2008;336(7650):924–6

    Article  Google Scholar 

  27. Schünemann HJ, Higgins JPT, Vist GE, Glasziou P, Akl EA, Skoetz N, et al. chapter 14: completing ‘summary of findings’ tables and grading the certainty of the evidence. In: Higgins JPT, Thomas J, Chandler J, Cumpston M, li T, Page MJ, Welch VA (editors) Cochrane handbook for systematic reviews of interventions version 62 (updated February 2021) Cochrane, 2021. Available from

  28. Santesso N, Glenton C, Dahm P, Garner P, Akl EA, Alper B, et al. GRADE guidelines 26: informative statements to communicate the findings of systematic reviews of interventions. J Clin Epidemiol. 2020;119:126–35

    Article  Google Scholar 

  29. Langendam MW, Akl EA, Dahm P, Glasziou P, Guyatt G, Schunemann HJ. Assessing and presenting summaries of evidence in Cochrane reviews. Syst Rev. 2013;2:81

    Article  Google Scholar 

  30. Aromataris E. Munn Z, (editors). JBI: JBI Manual for Evidence Synthesis; 2020. Available from

    Google Scholar 

  31. Agarwal A, Johnston BC, Vernooij RW, Carrasco-Labra A, Brignardello-Petersen R, Neumann I, et al. Authors seldom report the most patient-important outcomes and absolute effect measures in systematic review abstracts. J Clin Epidemiol. 2017;81:3–12

    Article  Google Scholar 

  32. Akl EA, Oxman AD, Herrin J, Vist GE, Terrenato I, Sperati F, et al. Using alternative statistical formats for presenting risks and risk reductions. Cochrane Database Syst Rev. 2011;3:CD006776

    Google Scholar 

  33. Neumann I, Alonso-Coello P, Vandvik PO, Agoritsas T, Mas G, Akl EA, et al. Do clinicians want recommendations? A multicenter study comparing evidence summaries with and without GRADE recommendations. J Clin Epidemiol. 2018;99:33–40

    Article  Google Scholar 

  34. Han MA, Zeraatkar D, Guyatt GH, Vernooij RWM, El Dib R, Zhang Y, et al. Reduction of red and processed meat intake and Cancer mortality and incidence: a systematic review and Meta-analysis of cohort studies. Ann Intern Med. 2019;171(10):711–20

    Article  Google Scholar 

  35. Vernooij R, Guyatt GH, Zeraatkar D, Han MA, Valli C, El Dib R, et al. Reconciling contrasting guideline recommendations on red and processed meat for health outcomes. J Clin Epidemiol. 2021.

  36. Johnston BC, Zeraatkar D, Han MA, Vernooij RWM, Valli C, El Dib R, et al. Unprocessed red meat and processed meat consumption: dietary guideline recommendations from the nutritional recommendations (NutriRECS) consortium. Ann Intern Med. 2019;171(10):756–64

    Article  Google Scholar 

  37. Gates M, Gates A, Duarte G, Cary M, Becker M, Prediger B, et al. Quality and risk of bias appraisals of systematic reviews are inconsistent across reviewers and centers. J Clin Epidemiol. 2020;125:9–15

    Article  Google Scholar 

Download references


We would like to thank Dr. Ning Liang from Institute Of Basic Research In Clinical Medicine, China Academy Of Chinese Medical Sciences, Beijing, China for help in retrieving and screening articles in Chinese, and Anna Witkowska from the Chair of Epidemiology and Preventive Medicine, Jagiellonian University Medical College, Krakow for help in retrieving full texts of the articles as well as for creating PRISMA flow diagram graphics.

Some of the preliminary results of the study were presented during the 1st Evidence-Based Research Conference: “Increasing the Value of Research” - Online Conference - 16th-17th November 2020, as well as submitted and accepted for presentation at the Cochrane Colloquium that was planned to take place in Toronto, Canada, on 4-7 October 2020, but was cancelled, with research abstracts published in the supplement.

Authors’ information

Not applicable.


Project funded by National Science Centre, No. UMO-2017/25/B/NZ7/01276. The funding body had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations



All authors contributed significantly to the work as follows: concept (MMB), design (MJS, DS, JZ, MK, BCJ, MMB), data acquisition (all authors), data interpretation and analysis (MJS, DS, JZ, MK, BCJ, MMB), drafting the manuscript (MJS, DS, BCJ, MMB), revising the manuscript critically (all authors), final approval of the version to be submitted (all authors).

Corresponding author

Correspondence to Malgorzata M. Bala.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

MMB received funding for the research from National Science Centre grant number UMO-2017/25/B/NZ7/01276. BCJ received funds from Texas A&M AgriLife Research (2019) to support investigator-initiated research related to saturated and polyunsaturated fats for a separate research project. Funds were from interest and investments earnings, not a sponsoring organization, industry or company. MMB and BCJ are GRADE working group members. Otherwise, the authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Swierz, M.J., Storman, D., Zajac, J. et al. Similarities, reliability and gaps in assessing the quality of conduct of systematic reviews using AMSTAR-2 and ROBIS: systematic survey of nutrition reviews. BMC Med Res Methodol 21, 261 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: