Despite the presence of many guidelines and checklists concerning the methodology of SRs/MAs [13, 14, 23], the quality of published SR/MAs is variable and lacks consistency [24,25,26,27]. Our study showed considerable variability in the methods that authors choose to adopt when conducting and reporting SR/MAs. Many authors still do not apply recommended methods. Thus, for example, authors fail to conduct manual searches for new studies, fail to update searches, carry out data extraction by only one reviewer, and report the same data twice in subgroup MA.
In our survey, the five most common information sources searched were PubMed, EMBASE, Cochrane library, WoS and CINAHL. Three of these five information sources (EMBASE, Cochrane library, CINAHL) corresponded to the most frequently searched databases in reviews of physiotherapy [28]. However, authors in our survey use databases that are freely available (PubMed) rather than those requiring a paid subscription (EMBASE). Besides its free availability, PubMed has updated several features to facilitate more comprehensive searching [28]. Conducting a comprehensive literature search is central to reducing selection and publication bias in SRs [29].
The number of databases and the quality of the search strategy are crucial for an effective literature search. In recent years, there is an increasing reliance on a range of databases or on the combination of different database including Medline and EMBASE, allied health databases (e.g., CINAHL and PsycINFO), and web-based searching to locate grey literature [30]. Some of the “databases” mentioned by the respondents are not exactly databases. This indicates that not all SR authors have sufficient knowledge regarding search engines and information sources. Less than 50% of authors use manual searches, or “hand-searches”, based on reference lists of published papers and also possibly study relevant conference proceedings or specific journal issues [1, 31]. This is unsatisfactory as such additional searches are very important to retrieve reports missed from the electronic databases search and to overcome the problem of inadequate search strategies. Methods of manual searching can vary. Authors are not routinely expected to undertake manual searches of journal contents; however scrutinizing reference lists is recommended. One may expect to find significant correlations between searching grey literature or manual search and respondent characteristics indicating experience, such as having conducting SR/MA for more than 5 years, or having more than 14 SR/MA publications. A previous study concluded that grey literature searching, adjusting terms and author-reported searching in SRs were sub-optimal and need to be improved [32]. Also, this study suggested that librarian involvement contributes to a comprehensive and reproducible search strategy to study identification and helps to produce high quality SR/MAs [32]. Another study stated that searching for grey literature with the help of a librarian would be easier [33]. The involvement of other experts, including statisticians can also affect quality.
The tool used for quality assessment should cover all methodological criteria relevant to the validity and interpretation of the included papers, taking into consideration the design of the studies considered [34]. Several domains for detecting and controlling the risk and source of bias should be evaluated. In many SR/MAs of clinical trials, absence of allocation concealment and inadequate randomization and blinding were associated with overestimation of the effect. Pildal et al, who replicated a MA of 70 studies, found that more than two-thirds of papers, with an overall effect estimate favouring certain interventions, showed no significant effect estimate after excluding papers with inappropriate allocation concealment [35]. Among the different metrics, Cochrane proposed a robust tool for assessing risk of bias of the included clinical trials [1, 36]. Although about 75.0% of the respondents used the Cochrane tool, we did not properly assess the usage of other metrics [26, 37, 25]. Most of the other tools mentioned by the authors are not risk of bias assessment tools, but tools for assessing the quality of reporting. Therefore, their usage for risk of bias assessment is inappropriate.
Data extraction and handling is a fundamental step, and one of those that most determines the reliability of a SR/MA. Although a third (30.2%) of our respondents considered that only one reviewer was needed to extract the data of interest, they used other reviewers to check the extracted data to avoid potential bias, a procedure which is considered acceptable by AMSTAR [38]. Jones et al analyzed the data extraction methods in 34 Cochrane reviews and reported that it was carried out by two extractors independently in 30, by only one extractor in two, with two not stating the number of extractors [39]. Recently, some software packages, such as Plot Digitizer and Getdata Graph Digitizer [40], have become available to extract data represented only in graphs. Using such software for extracting data from figures was faster and provided higher interrater reliability [41]. However, those software solutions have not yet been incorporated into methodological guidelines, so it was unsurprising to find that only 20.5% of authors used them.
Turning now to methods of conducting MA, a widespread barrier for computing and calculating effect sizes (when extracting data from studies) is when crucial data, such as variances, standard deviations and standard errors, are not available from the study [42]. To try to cope with this, a large diversity of conversions and alternative formulations of effect sizes are available, many offered as computer packages [43]. Lajeunesse et al outlined a few simple imputation approaches that can be used to fill gaps in missing SDs when conversions are not possible [44]. These approaches include relying on resampling approaches to fill gaps, and estimating the coefficient of variation from the (complete) observed data. These approaches should only be applied when data extraction from all the studies has been completed. These SD imputation tools include metagear, which provides two variations on Rubin & Schenker’s (1991) ‘hot deck’ imputation approach and imputes only SDs that are nearest neighbours relative to their means (i.e. it imputes SDs from data with means of a similar scale) [45]. Another SD imputation tool is Bracken’s (1992) method for filling missing information using the coefficient of variation from all studies with complete information, which is a strictly random hot deck imputation [46].
Handling and analyzing pre/post continuous data remains a point for debate as the data in each study that should be pooled is the effect estimate, not the post data, and usually the correlation is not present. Regarding this point, the responses of the authors were similar for the proposed solutions. Not many respondents chose to contact the authors of the original reports to get the data. Contacting the authors does not occur frequently in reviews due to the low and delayed response rates [47]. However, 28% of the surveyed authors did not know what the correlation means, which may reflect their not having faced this issue before.
In MA, pooling the analyzed or estimated data is not recommended and may be misleading. In our case, only 6.8% used analyzed results. How to deal with adjusted or unadjusted data in MA is an issue that needs to be highlighted and further investigated. The percentage of authors who analyzed only adjusted data, only non-adjusted data, or both was 91.1%. Like meta-regression, subgroup MA is a method for testing the effect of covariates on the overall effect estimate. However, a common mistake is repeating the control group data in subgroup MA when the cause of subgrouping relates to the nature of the intervention group (such as different doses), which leads to hyperinflation of the control group in the overall effect size. Among our respondents, the percentage of authors who indicated they do that is 21.4%. When the primary studies reported a correlation, the pooled effect size is the correlation coefficient. Therefore, it is satisfactory to find that two thirds of our respondents chose to use both Pearson and Spearman correlations in separate MAs.
A limitation of our study is the low response rate. However, since we contacted many potential participants, we still managed to get response from more than three hundred SR/MA authors. Many online surveys traditionally experience low response rates, which may result in selection bias and lower generalizability of results. Further, while SR/MA methodology has advanced in the years since Cochrane reviews first began to be published, our study did not choose to limit the search to reviews published since then.
Although we conducted piloting of the questionnaire, some of the questions may have still remained unclear to respondents. Furthermore, our questions did not cover some aspects, such as changing inclusion and exclusion criteria, or even changing the main study question after the search revealed an inadequate number of studies (sample size). Some Cochrane SRs have been published with zero studies included. Moreover, we did not ask any question about the composition of the SR/MA team, for example whether or not it included specialists like librarians and statisticians. We focused more on how each step was done. Similarly, we did not include a question about participation in Cochrane reviews to compare Cochrane authors with non-Cochrane authors. Furthermore, the absence of content analysis of published SRs is a limitation as it may provide more information compared to our survey. In the search strategy to identify SR/MAs in PubMed we did not choose to limit our search on publication date. This is one limitation of our study, as the methodology of SRs keeps evolving.