The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr’s relevance predictions in systematic and rapid reviews

Background We investigated the feasibility of using a machine learning tool’s relevance predictions to expedite title and abstract screening. Methods We subjected 11 systematic reviews and six rapid reviews to four retrospective screening simulations (automated and semi-automated approaches to single-reviewer and dual independent screening) in Abstrackr, a freely-available machine learning software. We calculated the proportion missed, workload savings, and time savings compared to single-reviewer and dual independent screening by human reviewers. We performed cited reference searches to determine if missed studies would be identified via reference list scanning. Results For systematic reviews, the semi-automated, dual independent screening approach provided the best balance of time savings (median (range) 20 (3–82) hours) and reliability (median (range) proportion missed records, 1 (0–14)%). The cited references search identified 59% (n = 10/17) of the records missed. For the rapid reviews, the fully and semi-automated approaches saved time (median (range) 9 (2–18) hours and 3 (1–10) hours, respectively), but less so than for the systematic reviews. The median (range) proportion missed records for both approaches was 6 (0–22)%. Conclusion Using Abstrackr to assist one of two reviewers in systematic reviews saves time with little risk of missing relevant records. Many missed records would be identified via other means.


Background
Systematic evidence syntheses provide the foundation of informed decision-making; however, the large and growing body of primary studies makes it difficult to complete them efficiently and keep them up-to-date [1]. To avoid missing relevant studies, rigorously conducted evidence syntheses typically include comprehensive searches of multiple sources [2]. Often, two reviewers screen through the records retrieved, first by title and abstract and then by full text, to identify those that are relevant. The process requires substantial effort and time to return a relatively small body of relevant studies [1]. Machine learning (ML) tools provide the potential to expedite title and abstract screening by predicting and prioritizing the relevance of candidate records.
At the time of writing the SR Tool Box, an online repository of software tools that support and/or expedite evidence synthesis processes, referenced 37 tools aimed at supporting title and abstract screening [3]. Freelyavailable, off-the-shelf tools like Abstrackr, RobotAnalyst, and Rayyan allow review teams without ML expertise and/or limited resources to create efficiencies during title and abstract screening. By prioritizing relevant records, such tools provide reviewers with the opportunity to identify relevant studies earlier and move forward with subsequent review tasks (e.g., data extraction, risk of bias appraisal) sooner [4]. The relevance predictions produced by ML tools can also be leveraged by review teams to semi-automate title and abstract screening by eliminating records predicted to be irrelevant [5].
Mounting interest in the use of ML tools to expedite title and abstract screening has been accompanied by skepticism and distrust by review teams and end users of reviews, and adoption has been slow [6]. A fundamental concern associated with automatically or semiautomatically eliminating candidate records is that important studies may be missed, compromising the comprehensiveness of the review and potentially the validity of its conclusions. Evidence of reliable ways to leverage ML tools' relevance predictions in real-world evidence synthesis projects is one step toward garnering trust and promoting adoption. In the present study, our objective was to explore the benefits (workload and estimated time savings) and risks (proportion of studies missed) of leveraging a ML tool's predictions to expedite citation screening via four retrospective screening simulations.

Protocol
In advance of the study the research team developed a protocol, available via the Open Science Framework (https://osf.io/2ph78/, doi: https://doi.org/10.17605/OSF. IO/2PH78). We undertook the following changes to the protocol during the conduct of the study: (1) added an additional systematic review to the sample; and (2) added a post-hoc analysis to determine if missed studies would have been located by scanning reference lists. We added the additional systematic review prior to data analysis, as it had recently been completed at our centre and allowed for a larger sample of reviews. The post-hoc analysis was added recognizing that electronic database searching is just one of the means by which relevant studies are typically sought in systematic reviews.

Abstrackr
Abstrackr is a freely available ML tool (http://abstrackr. cebm.brown.edu) that aims to enhance the efficiency of title and abstract screening [7]. To screen in Abstrackr, all citations retrieved via the electronic searches must first be uploaded to the software. The reviewer is then prompted to select review settings, including how many reviewers will screen each title and abstract (one or two), and the order in which the records will be viewed (in random order, or by predicted relevance). Once the review is set up, records are presented to reviewers one at a time on the user interface, including the title, authors, abstract, and keywords. As records appear on the screen, the reviewer is prompted to label each as relevant, irrelevant, or borderline, after which the next record appears.
While reviewers screen in Abstrackr, the ML model learns to predict the relevance of the remaining (unscreened) records via active learning and dual supervision [7]. In active learning, the reviewer(s) must first screen a "training set" to teach the model to distinguish between relevant and irrelevant records based on common features (e.g., words or combinations or words that are indicative of relevance or irrelevance). In dual supervision, the reviewers can impart their knowledge of the review task to the model in the form of labeled terms. When setting up the review, reviewers can tag terms that are indicative of relevance or irrelevance. For example, the terms "systematic review" or "review" may be tagged as irrelevant in systematic reviews that seek to include only primary research. The relevance terms are exploited by the model, along with the reviewers' screening decisions, when developing predictions [7].
After screening a training set, the reviewers can view and download Abstrackr's relevance predictions for the records that have not yet been screened. The predictions are typically available within 24 h of screening an adequate training set (i.e., upon server reload). The predictions are presented to reviewers in two ways: a numeric value representing the probability of relevance (0 to 1), and a binary relevance prediction (i.e., the "hard" screening prediction, true or false). Review teams may choose to leverage these predictions to prioritize relevant records, or to automatically eliminate records that are less likely to be relevant. Although many ML tools aimed at expediting title and abstract screening exist, we chose Abstrackr for this study because: (1) its development is well documented; (2) empirical evaluations of its performance exist [8][9][10][11]; (3) experiences at our centre showed that it was more reliable and user-friendly than other available tools [11]; and (4) it is freely available, so more review teams are likely to benefit from practical evaluations of its capabilities.

Sample of reviews
We selected a convenient sample of 11 systematic reviews and 6 rapid reviews completed at our centre. The reviews were heterogeneous with respect to the type of research question, included study designs, screening workload, and search precision ( Table 1).
The median (range) screening workload for the systematic reviews was 2928 (651 to 12,156) records.
Across systematic reviews, 8 (2 to 16)% of the records retrieved by the searches were included at the title and abstract screening stage, and 1 (0.01 to 3)% following scrutiny by full text. The median (range) number of included records was 40 (1 to 137). The median (range) screening workload for the rapid reviews was 1250 (451 to 2413) records. Across rapid reviews, 14 (5 to 26)% of the records retrieved by the searches were included at the title and abstract screening stage, and 5 (0.04 to 8)% following scrutiny by full text. The median (range) number included records was 33 (1 to 179).
Although there can be several differences in conduct between systematic and rapid reviews, for the purpose of this study we defined the review types based solely on the method of study selection. For the systematic reviews, two reviewers independently screened all records at the title and abstract stage, and any record marked as relevant by either reviewer was scrutinized by full text. Retrospective screening workload for each of the two reviewers in systematic reviews, and for the single reviewer in rapid reviews The two reviewers agreed on the full texts included in each review, and all disagreements were resolved through discussion or the involvement of a third reviewer. In all cases, the two reviewers included: (a) a senior reviewer (typically the researcher involved in planning and overseeing the conduct of the review, and the reviewer with the most systematic review and/or content experience), and (b) one or more junior reviewers (i.e., second reviewers), who were typically research assistants involved in screening and sometimes (but not always) other aspects of the systematic review (e.g., data extraction, risk of bias appraisal). For the rapid reviews, a single, experienced (i.e., senior) reviewer selected the relevant studies, both at the title and abstract and full text stages. Compared with dual independent screening, the risk for missing relevant studies is increased when study selection is performed by a single reviewer; however, the approach is likely appropriate for rapid reviews [12]. We selected this approach in order to create efficiencies while maintaining an acceptable level of methodological rigour, in consultation with the commissioners and/or end users of each review.

Screening procedure
For each review, we uploaded the records identified via the electronic database searches to Abstrackr and selected the single screener mode and random citation order setting. Abstrackr's ability to learn and accurately predict the relevance of candidate records depends on the correct identification and labeling of relevant and irrelevant records in the training set. Thus, members of the research team (AG, MG, MS, SG) retrospectively replicated the senior reviewer's (i.e., the reviewer we presumed would have made the most accurate screening decisions) original screening decisions based on the screening records maintained for each review, for a 200record training set. Although the ideal training set size is not known, similar tools suggest a training set containing at least 40 excluded and 10 included records, up to a maximum of 300 records [13].
For systematic reviews conducted at our centre, any record marked as "include" or "unsure" by either of two independent reviewers is eligible for scrutiny by full text (i.e., the responses are deemed equivalent). Thus, our screening records include one of two decisions per record: include/unsure or exclude. It was impossible to retrospectively determine whether the "include/unsure" decisions were truly includes or unsures, so we considered all to be includes.
After screening the training sets, we waited for Abstrackr's relevance predictions. When predictions were not available within 48 h, we continued to screen in batches of 100 records until they were. Once available, we downloaded the predictions. We used the "hard" screening predictions (true or false, i.e., relevant or irrelevant) rather than deciding on custom eligibility thresholds based on Abstrackr's relevance probabilities. As the ideal threshold is not known, using the hard screening predictions likely better approximated realworld use of the tool.

Retrospective simulations
We tested four ways to leverage Abstrackr's predictions to expedite screening: 1. In the context of single reviewer screening (often used in rapid reviews): a. Fully automated, single screener approach: after screening a training set of 200 records, the senior reviewer downloads the predictions, excludes all records predicted to be irrelevant, and moves the records predicted to be relevant forward to full text screening; or b. Semi-automated, single screener approach: after screening a training set of 200 records, the senior reviewer downloads the predictions and excludes all records predicted to be irrelevant. To reduce the full text screening workload, the reviewer screens the records predicted to be relevant. Of these, those that the reviewer agrees are relevant move forward to full text screening.
2. In the context of dual independent screening (often used in systematic reviews): a. Fully automated, dual independent screening approach: after screening a training set of 200 records, the senior reviewer downloads the predictions. The second reviewer screens all of the records as per usual. Abstrackr's predictions and the second reviewer's decisions are compared and any marked as relevant by either the second reviewer, or the senior reviewer/Abstrackr move forward to full text screening; or b. Semi-automated, dual independent screening approach: after screening a training set of 200 records, the senior reviewer downloads the predictions and excludes all records predicted to be irrelevant. To reduce the full text screening workload, the senior reviewer screens the records predicted to be relevant. The second reviewer screens all the records as per usual. Abstrackr's predictions and the second reviewer's decisions are compared and any marked as relevant by either the second reviewer, or the senior reviewer/Abstrackr move forward to full text screening.
Appendix A includes a visual representation of each screening approach. To test the feasibility of the approaches, we downloaded Abstrackr's relevance predictions for each review. In Excel (v. 2016, Microsoft Corporation, Redmond, Washington), we created a workbook for each review, including a row for each record and a column for each of: the title and abstract screening decisions (retrospective); the full text consensus decisions (retrospective); and Abstrackr's relevance predictions. We then determined the title and abstract consensus decisions that would have resulted via each approach. Two researchers tabulated the results of each simulation, and compared their results to minimize the risk of error.
Comprehensive search strategies include not only searching bibliographic databases but scanning reference lists, searching trial registries and grey literature, and contacting experts [2,14]. To determine whether the records missed by each approach would have been located via other means, one researcher (MG) performed a cited references search in Scopus and Google Scholar (for records not indexed in Scopus) to simulate scanning the reference lists of the included studies.

Analysis
We exported all data to SPSS Statistics (v. 25, IBM Corporation, Armonk, New York) for analyses. Using data from 2 × 2 cross-tabulations, we calculated performance metrics for each approach using standard formulae: [4] 1. Proportion of records missed (i.e., error): of the records included in the final report, the proportion that were excluded during title and abstract screening. 2. Workload savings (i.e., absolute screening reduction): of the records that need to be screened by title and abstract, the proportion that would not need to be screened manually. 3. Estimated time savings: the time saved by not screening the records manually. We assumed a screening rate of 0.5 min per record [15] and an 8-h work day.
These performance metrics were selected because they (a) have been reported in previous published evaluations [8,11], allowing for comparisons to other studies, and (b) are relevant to review teams and end users of reviews who are considering the balance of benefits and risks of adopting ML-assisted screening approaches. Appendix B shows the 2 × 2 tables and calculation of the performance metrics for one systematic review (Activity and pregnancy) and one rapid review (Community gardening).

Screening characteristics and Abstrackr's predictions
The predictions became available after the 200-record training set for all reviews, except Visual Acuity, for which we needed to screen 300 records (likely due to the small proportion of included studies). Table 2 shows the characteristics of the training sets and Abstrackr's predictions for each review. The median (range) proportion of included records in the training sets was 7 (1 to 13)% for the systematic reviews and 25 (4 to 38)% for the rapid reviews. Abstrackr predicted that a respective median (range) 30 (12 to 67)% and 48 (10 to 65)% of the remaining records in the systematic and rapid reviews were relevant. Table 3 shows the performance metrics for the single reviewer approaches. For the fully automated approach, the median (range) proportion missed across the systematic reviews was 11 (0 to 38)%, or 7 (0 to 35) records in the final reports. The proportion missed for the semiautomated approach was 20 (0 to 44)%, or 9 (0 to 37) included records. Across the rapid reviews, the proportion missed was 6 (0 to 22)% for both the fully and semiautomated simulations, or 2 (0 to 25) included records. In all but two systematic reviews, the semi-automated and fully automated approaches resulted in more missed records than independent screening by a single reviewer (i.e., the second reviewer).

Single reviewer simulations
For the fully automated approach, the median (range) workload savings across systematic reviews was 97 (85 to 99)%, or 5656 (1102 to 24,112) records that would not need to be screened manually. For the semiautomated simulation, the workload savings was 83 (65 to 93)%, or 5337 (991 to 21,995) records. Across the rapid reviews, the median (range) workload savings for the fully automated approach was 83 (56 to 92)%, or 1050 (251 to 2213) records. For the semi-automated approach, the workload savings was 39 (30 to 78)%, 418 (161 to 1197) records.
For the fully automated approach, the median (range) estimated time savings across systematic reviews was 47 (9 to 201) hours, or 6 (1 to 25) days. For the semiautomated approach, the time savings was 44 (8 to 183) hours, or 7 (1 to 23) days. For the rapid reviews, the time savings for the fully automated simulation was 9 (2 to 18) hours, or 1 (< 1 to 2) days. For the semi-automated simulation, the time savings was 3 (1 to 10) hours, or < 1 (< 1 to 1) day. Table 4 shows the performance metrics for the dual independent screening approaches (relevant only to the systematic reviews). Across systematic reviews, the median (range) proportion missed was 0 (0 to 14)% for the fully automated approach, or 0 (0 to 3) records in the final reports. For the semi-automated simulation, the proportion missed was 1 (0 to 14)%, or 1 (0 to 6) included records. For six (55%) of the systematic reviews, fewer records were missed via the fully automated approach compared with independent screening by a single reviewer (i.e., the second reviewer). For the semi-automated simulation, the same was true for five (45%) of the systematic reviews.

Dual independent screening simulations
The median (range) workload savings was 47 (35 to 49)% for the fully automated simulation and 33 (15 to 43)% for the semi-automated simulation, accounting for a respective 2728 (451 to 11,956) and 2409 (340 to 9839) records that would not need to be screened manually. The median (range) estimated time savings was 23 (4 to 100) hours for the fully automated simulation and 20 (3 to 82) hours for the semi-automated simulation, equivalent to a respective 4 (< 1 to 12) and 3 (< 1 to 10) days.

Cited references search
The dual independent screening, semi-automated approach provided the best balance of benefits and risks (i.e., relatively large workload savings and few missed records). We identified 10 (59%) of the 17 studies erroneously excluded across systematic reviews via the cited references search. This resulted in a reduction in the proportion missed among five (83%) of the six systematic reviews in which studies were missed. In the Biomarkers, VBAC, and Experiences of bronchiolitis reviews, the number of studies missed was reduced from 6 (13%) to 2 (4%), 3 (14%) to 2 (10%), and 3 (11%) to 1 (4%), respectively. In the Antipsychotics and Treatments for Bronchiolitis reviews, where a respective 3 (2%) and 1 (1%) studies were missed, all were successfully identified via the cited references search. Across systematic reviews, the median (range) proportion missed diminished to 0 (0 to 10)%, accounting for 0 (0 to 2) of the studies in the final reports.

Discussion
We evaluated the risks and benefits of four approaches to leveraging a ML tool's relevance predictions to expedite title and abstract screening in systematic and rapid reviews. Although the potential for workload and time savings were greatest in the single reviewer approaches, up to more than 40% of relevant studies were missed. We did not evaluate the impact of the missed studies on the reviews' conclusions, but given the inherent risk it is Retrospective screening workload for each of the two reviewers in systematic reviews, and for the single reviewer in rapid reviews b The training sets were 200 records for all reviews, with the exception of the Visual Acuity systematic review, for which 300 records were needed for Abstrackr to develop predictions unlikely that review teams would readily adopt the single reviewer approaches. Conversely, the dual independent screening approaches both resulted in few missed studies, and the potential time savings remained considerable, especially in reviews with larger search yields (e.g., up to an estimated 100 h in the Antipsychotics review). Balanced with the relatively small risk of missing relevant studies, the dual independent screening, semiautomated approach (which reduces the full text screening volume compared to the fully automated approach) may be trustworthy enough for review teams to implement in practice. The gains in efficiency afforded by the automated and semi-automated approaches were less apparent among the rapid reviews compared with the systematic reviews. One means of expediting review processes in the rapid reviews was to limit their scope, and thus the search yield and number of records to screen. This, in addition to the fact that records were screened by a single reviewer, considerably limited the potential for gains in efficiency. Although limitations on scope and modifications to screening procedures are common [16] and well accepted [17] in rapid reviews, the potential for ML-assisted screening to expedite their completion should not be discounted. The slow adoption of ML has largely been influenced by review teams' and end users' distrust in a machine's ability to perform at the level of a human reviewer [6]. Since end users of rapid reviews are sometimes more willing to compromise methodological rigour in order to obtain information to support decision-making sooner, rapid reviews may be an appealing medium for early adopters of ML-assisted screening. Our findings are supportive of methods whereby a ML tool's predictions are used to complement the work of a human reviewer. Although the proposed approaches are admittedly less trustworthy (albeit slightly) than dual independent screening, to fully appreciate their potential, the findings must be interpreted in context. In rigorously conducted systematic reviews, electronic database searches are supplemented with additional search methods, e.g., contacting experts, hand-searching grey literature, so the limited risk of missing relevant records would be further diminished. As we have demonstrated, most missed records are likely to be identified via reference list scanning alone. We also speculate that any large, well-conducted study that would change the findings of a review would be identified by conscientious review teams at some point during the evidence synthesis process.

Strengths and limitations
Building on earlier studies that evaluated Abstrackr [8][9][10][11], we used a heterogeneous sample of reviews to compare and contrast the benefits and risks for four approaches to leveraging its relevance predictions. We used cited references searches to determine if missed studies would have been located via other means, simulating real-world evidence synthesis methodology. Although human reviewer judgement is imperfect, in this study it provided a realistic reference standard against which to compare the automated and semi-automated screening approaches.
Although the training set was sufficient, in most cases, to bring about predictions, it is possible that another training set size would have resulted in different findings. Research at our centre showed that modest increases in the training set size (i.e., 500 records) did not improve upon the reliability of the predictions [11]. Whether the missed studies would affect the conclusions of reviews is an important concern for review teams; however, we did not evaluate this outcome. So few studies were missed via the dual independent screening approaches that substantial changes to review findings are highly unlikely.
The retrospective nature of this study did not allow for precise estimates of time savings. Potential gains in efficiency were estimated from a standard screening rate of two records per minute, as reported in an earlier study [15]. Although the selected screening rate was ambitious, it provided for conservative estimates of time savings for the purpose of this study.

Conclusions
Using Abstrackr's relevance predictions to assist one of two reviewers in a pair saves time while posing only a small risk of missing relevant studies in systematic reviews. In many cases, the approach was advantageous compared with screening by a single reviewer (i.e., fewer studies were missed). It is likely that missed studies would be identified via other means in the context of a comprehensive search. In the circumstance of screening via a single reviewer (i.e., in rapid reviews), the time savings of the fully and semi-automated approaches were considerable; however, adoption is unlikely due to the larger risk of missing relevant records.