Skip to main content

Methodological insights into ChatGPT’s screening performance in systematic reviews



The screening process for systematic reviews and meta-analyses in medical research is a labor-intensive and time-consuming task. While machine learning and deep learning have been applied to facilitate this process, these methods often require training data and user annotation. This study aims to assess the efficacy of ChatGPT, a large language model based on the Generative Pretrained Transformers (GPT) architecture, in automating the screening process for systematic reviews in radiology without the need for training data.


A prospective simulation study was conducted between May 2nd and 24th, 2023, comparing ChatGPT’s performance in screening abstracts against that of general physicians (GPs). A total of 1198 abstracts across three subfields of radiology were evaluated. Metrics such as sensitivity, specificity, positive and negative predictive values (PPV and NPV), workload saving, and others were employed. Statistical analyses included the Kappa coefficient for inter-rater agreement, ROC curve plotting, AUC calculation, and bootstrapping for p-values and confidence intervals.


ChatGPT completed the screening process within an hour, while GPs took an average of 7–10 days. The AI model achieved a sensitivity of 95% and an NPV of 99%, slightly outperforming the GPs’ sensitive consensus (i.e., including records if at least one person includes them). It also exhibited remarkably low false negative counts and high workload savings, ranging from 40 to 83%. However, ChatGPT had lower specificity and PPV compared to human raters. The average Kappa agreement between ChatGPT and other raters was 0.27.


ChatGPT shows promise in automating the article screening phase of systematic reviews, achieving high sensitivity and workload savings. While not entirely replacing human expertise, it could serve as an efficient first-line screening tool, particularly in reducing the burden on human resources. Further studies are needed to fine-tune its capabilities and validate its utility across different medical subfields.

Peer Review reports


The domain of deep learning has witnessed significant development over the past decade, dramatically transforming numerous fields, including medicine [1]. Machine translation, an application of deep learning that employs computer algorithms to automatically translate text or speech from one language to another, has achieved remarkable advancements in recent years. The advent of Attention mechanisms and the subsequent introduction of the Transformer architecture, proposing the self-attention concept, has revolutionized this field [2, 3].

The Transformer architecture forms the backbone of many large language models (LLMs). It uses a certain type of neural network that comprises two components—an encoder and a decoder. The encoder analyzes the input, which is a sequence of words, helping the network understand the overall context. While the decoder generates an output sequence conditioned on the established context. However, not all neural network models use both components. Some only utilize the encoder part (e.g., Bidirectional Encoder Representations from Transformers [BERT]), some the decoder part (e.g., Generative Pretrained Transformer [GPT]), and some both (e.g., Text-to-Text Transfer Transformer [T5]) [4,5,6].

GPT models, which utilize the decoder portion of the Transformer architecture, have achieved considerable success in text completion tasks [5, 7]. The recent development of ChatGPT, an LLM primarily based on GPT-3.5 and reinforced with human feedback (RLHF), extends its capabilities beyond text completion [8]. It can answer questions, maintain human-like dialogue, provide assistance, devise plans, and write performant code [9].

In the medical literature, systematic reviews and meta-analyses occupy the apex of the evidence pyramid [10]. They collect, critically appraise, and synthesize the results of multiple studies within a specific field. The production of these types of studies demands considerable effort due to the numerous steps required to ensure fair and comprehensive results. One of the earliest steps, the article screening phase, can be particularly labor-intensive and time-consuming. However, since it controls what studies are fed into the process, it is of utmost importance. It is vital that the systematic review’s conclusions are drawn based on the best available evidence, free from bias, and relevant to the research question. Errors at this stage can severely degrade the review’s validity and its utility in guiding practice and policy.

Historically, several studies have sought to apply machine learning or deep learning methods to assist in this process [11,12,13,14]. Despite these efforts, they usually require some form of annotation input by the user and mostly have evaluated their performance on retrospective data. Our study seeks to automate the process without the need for training. We hypothesize that delegating a portion of manual labor to ChatGPT can reduce missed potential articles and increase efficiency while conserving human resources. To scrutinize our hypothesis, we set up this study to assess the efficacy of ChatGPT concerning the screening process and compared its performance to human raters.

Materials and methods

Study design

We undertook a prospective simulation study from May 2nd to 24th 2023, designed to assess the accuracy and speed of ChatGPT in screening abstracts for systematic reviews. Our main objective was to measure how effective ChatGPT is in reliably excluding abstracts collected from the primary screened results of a systematic review. We also compared ChatGPT against a group of researchers, specifically general physicians (GPs), who are typically involved in the abstract screening process.

We used metrics such as sensitivity, specificity, precision or positive predictive value (PPV), negative predictive value (NPV), positive likelihood ratio (PLR), negative likelihood ratio (NLR), false negative rate, proportion missed, and workload saving to evaluate performance. False negative rate (FNR) is the proportion of actual positive cases that were incorrectly classified as negative. The proportion missed is similar to the FNR but is often expressed in a different context. It is the number of relevant studies that the rater has failed to identify, out of those it predicted to be irrelevant. This is essentially a measure of how many relevant studies were missed by the rater. Workload saving is the proportion of citations that were correctly identified as irrelevant, thereby reducing the workload for human reviewers. It is the proportion of citations predicted irrelevant out of the total number of citations. Below is the mathematical description for them:

$$FNR=\frac{FN}{TP+FN}, \,\,Proportion \,Missed= \frac{FN}{FN+TN}, \,\,Workload \,Saving= \frac{TN}{TN+FN+TP+FP}$$

FN: False negative, TP: True positive, TN: True negative, FP: False positive.

Data collection

We surveyed three extensive fields of radiology: diagnostic, interventional, and nuclear medicine. Six synthetic broad topics were proportionally proposed based on the distribution of corresponding PubMed search results frequency, then we designed a PICOS (Population, Intervention, Comparison, Outcomes, Study design) for each topic (Additional file 1: Table S1). It is noteworthy that the topics and their corresponding PICOS were conceived by our group of experts focusing on broad and diverse subjects. Subsequently, we systematically searched PubMed, Embase, and Web of Science from inception until April 29th, 2023 (Additional file 1: Table S2) using the queries that were carefully designed by the author and verified by the experts. We aimed for broader rather than specific terms when choosing the keywords as one would for real scenarios. Thereafter, we eliminated duplicates and citations that were missing abstracts using the software package EndNote X9 [15]. Next, due to the constraints of time and human resources, a random subset of 200 articles from each topic was selected using a Python script.

Three general physicians, who had experience in medical research synthesis and systematic review composition for over a year, were independently given the topics, corresponding PICOS, titles, and abstracts to screen. They were unaware that they were being compared to other raters and AI. Moreover, raters were not allowed to access the full texts of the articles. Their task was to determine whether to include or exclude the citation based solely on the provided PICOS, the title, and the abstract. Each individual marked 1198 citations (~ 200 articles for each of the six topics) as either included or excluded. Two experts, including a physician with over twenty years of research experience in radiology and a faculty member radiologist with more than five years of research experience in the field, both having previously published systematic reviews, were assigned the same task. The study employed a fair and thorough process for resolving any disagreements between the two experts. In such cases, a third expert—a physician with over two decades of research experience and published systematic reviews in the field—was consulted. The third expert was unaware of the identities of the previous experts and made the final decision in such situations. The final verdicts of the expert group were considered the study’s ground truth.

Overall, three types of consensuses were employed for each of the GP and expert groups: sensitive consensus included studies if at least one of the raters included them, specific consensus included studies if all of the raters included them, and voting consensus included studies if the majority of raters included them (in the case of the experts’ group, denoting the verdict of the third expert). Thus, in total, we had 6 outputs from the GPs group and 3 outputs from the experts group.

Finally, we interfaced ChatGPT via a custom Python script and OpenAI’s application programming interface (API), prompting it to rate the citations on a scale of 1 (least relevance) to 5 (most relevance) based on the provided PICOS. We only presented ChatGPT with titles, abstracts, reference types, publish dates and PICOS. This study utilized the May 3rd release of ChatGPT (specifically, “gpt-3.5-turbo”) with the parameter “temperature” set to 0.0 for a more deterministic behavior or “greedy search”.

Statistical analysis

Data preprocessing, cleaning, and analysis were accomplished with Python version 3.9.13, supported by various libraries such as pandas (for data-frame manipulation), numpy (for math and random number generation), random (for random sampling of the articles), scikit-learn (for statistical tests and metrics), and matplotlib and seaborn (for plotting) [16,17,18,19,20]. Our analysis included the computation of the Kappa (κ) coefficient for inter-rater agreement, plotting receiver operating characteristic curve (ROC), calculation of the area under the curve (AUC) or c-statistic, Youden’s index and threshold, and various other metrics, including sensitivity, specificity, PPV, NPV, PLR, NLR, FNR, proportion missed, workload saving, Jaccard index or Intersection over Union (IoU), and balanced accuracy [21,22,23,24]. Balanced accuracy was used to account for the imbalance present in our data. Confidence intervals (CIs) were calculated with a 95% threshold, and p-values below 0.05 were considered statistically significant. For the calculation of p-values, confidence intervals, and comparison between the metrics, bootstrapping with 1000 samples was employed [25]. Numbers in square brackets ([]) denote confidence interval.


Overall, the study involved the review of 1,198 abstracts and titles (Fig. 1, Table 1). The topics were chosen conforming to the following distribution: 3/6, 2/6, and 1/6 concerning diagnostic radiology, nuclear medicine, and interventional radiology respectively. Further details regarding the systematic searches for each topic are presented in Additional file 1: Table S2.

Fig. 1
figure 1

Article identification and sampling process. WoS: Web of Science

Table 1 Selected topics and corresponding sampling details

Three general physicians and two experts independently assessed the citations for inclusion. The review process took the GPs 7, 8, and 10 days (averaging ~ 2–3 h of work per day) (Additional file 1: Table S3), while it took the experts 3 and 5 days to rate and one day for the third expert to reach final verdicts (55 disagreements were addressed in total, Additional file 1: Table S4). In contrast, ChatGPT completed the process in less than one hour (~ 3 s per citation or ~ 1 h in total).

For the sake of brevity, we only considered the voting consensus (the final verdict reached by the third expert) as the gold standard in this study. More detailed results including alternative gold standards are presented in Additional file 1: Figures S2 to S9.

Inter-rater agreement

The average inter-rater agreement, as measured by the kappa statistic [κ], was moderate among the GPs at 0.45, and substantial between the two experts at 0.79. However, the agreement between ChatGPT, at threshold ≥ 3, and the other raters was lower with a mean kappa of 0.27. Details are provided in Table 2.

Table 2 Inter-rater agreements

Screening performance

Comparing GP consensuses—sensitive, specific, and voting—with our gold standard, they achieved sensitivities of 90%, 32%, and 62%, respectively (Table 3). As evident in the table, sensitive consensus performs better in almost every aspect compared to each individual.

Table 3 Comparing human raters against ChatGPT at threshold ≥ 3, across the whole dataset

ChatGPT was asked to rate the citations on a scale from 1 to 5, in alignment with the provided PICOS. The ROC curve derived from this rating process resulted in an AUC of 0.86 [0.83–0.89] (Fig. 2). Based on Youden’s index, the optimal rating threshold for ChatGPT was determined to be ≥ 3 (including ratings 3, 4, and 5 while excluding ratings 1 and 2). Hereafter, all of the reported results are obtained using this threshold unless otherwise noted. ChatGPT achieved a sensitivity of 95% and an NPV of 99%, slightly exceeding the GPs’ sensitive consensus, albeit not statistically significant. However, it did not perform as well in terms of specificity and PPV (Table 3). On the other hand, the AI exhibited remarkably low false negative counts, with only 7 and 8 at thresholds ≥ 2 and ≥ 3, respectively. These are lower than any other rater as shown in Table 4.

Fig. 2
figure 2

ChatGPT ratings ROC curve. Voting Consensus refers to the final verdict of the expert panel. ROC receiver operating characteristics, AUC area under the curve, CI confidence interval

Table 4 Classification details at different cut-offs over the whole dataset

ChatGPT in general had better performance in terms of false negative rates and proportions missed compared to other raters (both consensuses and individuals) as shown in Fig. 3. Workload savings were especially high, ranging from 40 to 83%, and overall exceeding 50% as depicted in Fig. 4. In addition, it was on average ~ 21 times faster than the physicians’ group.

Fig. 3
figure 3

Comparing precision, false negative rate, and proportion missed between raters. Error bars indicate 95% confidence intervals

Fig. 4
figure 4

ChatGPT workload savings across topics. Error bars indicate 95% confidence intervals


Three GPs and two experts independently reviewed 1,198 records and categorized them as included or excluded, which took several days. In contrast, it took ChatGPT less than one hour to do the same. ChatGPT exhibited remarkable sensitivity and NPV, both exceeding 95%. Additionally, it had the lowest false negative rates among the raters. On average, the proposed method achieved over 50% workload savings while being an order of magnitude faster.

This study’s primary objective was to assess the efficacy of ChatGPT as an AI adjunct in the task of abstract screening, a time-consuming initial phase of developing systematic reviews and meta-analyses. Rather than complete replacement, the goal is to augment the procedure of human evaluation by reducing workload and increasing efficiency. This can potentially help reduce biases and oversights encountered in the screening process by eliminating subjective inconsistencies and judgmental errors.

While a high AUC suggests a decent concordance between ChatGPT’s ratings and the gold standard, this alone does not translate into lower false negative rates or proportions missed. The count of false negatives, reflected in the aforementioned metrics, is of higher priority in the context of screening. Under specific thresholds (≥ 2 and ≥ 3), ChatGPT exhibited superior performance, achieving very low false negative rates (5%) and proportions missed (1%), compared to its human counterparts. This suggests that citations with low ratings can be confidently eliminated from screening processes, which make up more than half (on average 57%) of the citations. Capitalizing on the speed and scalability of the AI model, the screening process can be split into distinct stages. The first stage allows for the majority of the articles to be excluded automatically. The second stage, characterized by a higher model threshold (e.g., ≥ 4 or = 5), emphasizes the inclusion of only highly relevant articles. Human raters can then evaluate the indeterminate articles (e.g., articles with ratings between 2 and 4) with greater attentiveness. By adopting this hybrid approach, the burden on human raters can be reduced significantly, leading to greater time efficiency and increased accuracy, potentially comparable to those of experts.

We noticed a surprisingly low agreement score between ChatGPT and the other raters (mean κ = 0.27). This might point to the AI model’s fundamentally different “thinking” process, the subjective nature of this process, or simply be due to the selected threshold. This matter requires further investigations, however, since the LLM is basically an average of its training data –as are most statistical models, this could introduce some levels of objectivity into the field. It is imperative to state that ChatGPT has been “aligned” with human preferences using RLHF. While this has greatly increased the usability of the model for everyday tasks, it does not directly translate to better performance in other domains such as medical fields. Thus, we encourage more research in this less-explored area, either by developing medicine-centered language models or by scrutinizing existing models.

Our study incorporated various consensus types: sensitive, specific, and voting. The purpose of this decision was to evaluate the overall performance of the GPs, with each type emphasizing different aspects. Regarding each consensus type, ChatGPT outperformed all with respect to sensitivity (thresholds ≥ 2 and ≥ 3). However, human raters demonstrated superior specificity and precision (PPV). Making use of each rater’s strengths (ChatGPT being more sensitive and humans being more specific), these findings highlight the need for a hybrid approach that incorporates both humans and AI. If ChatGPT alone is to be used, many records will be included unnecessarily leading to a lengthy process of screening the articles in full text.

There are currently several tools, such as Rayyan, Abstrackr, and Colandr, that share a common objective [26,27,28,29]. However, they typically employ machine learning (ML) algorithms to rate the articles and need the users to annotate some citations as relevant, unsure, or irrelevant [30]. Some studies have attempted to evaluate the above tools’ performance but mostly relied on retrospective data from earlier systematic reviews [31, 32]. Furthermore, they generally did not have experts as ground truth and solely compared the algorithms’ performance based on the ratings from a single reviewer group [31, 32]. Due to the above reasons and the fact that our approach does not require users to provide annotations, their results may not fully be in alignment with ours.

Gates, A., Guitard, S., Pillay, J. et al. in their review of three ML tools designed for this purpose, employed two approaches: automated approach, delegating all of the screening process to the tools after a 200-record training and semi-automated approach, complementing the work of a single reviewer [33]. They reported by using Abstrackr, DistillerSR, and RobotAnalyst, respectively, the median proportion missed was 5%, 97%, and 70% for the automated simulation and 1%, 2%, and 2% for the semi-automated simulation. Without the need for prior training, ChatGPT with a 1% proportion missed outperforms their automated and is on par with their semi-automated approach (Fig. 3, Table 3).

LLMs are inherently more robust than ML models since they operate excellently even without needing to be pretrained or fine-tuned on a specific dataset (zero-shot performance), as is the case with a lot of ML models [30]. Our research provides a preliminary exploration of the application of LLMs, specifically ChatGPT, in medical research synthesis. The potential for AI models, especially LLMs, could extend to data extraction, objective quality assessment, questionnaire design, and criticism of methods, among other facets of research.


Despite our best efforts to simulate real-world topics, certain constraints limit the broader applicability of our findings.

While our study mainly focuses on the field of radiology, we still had limited resources regarding the number of chosen topics and the diversity of the topics. Limited by time and human resources, we decided to choose 6 titles, each encompassing ~ 200 articles. To cover a representative range of radiology topics, we searched and analyzed the volume of literature in each field to ensure a reasonable distribution. Although we attempted to provide general and representative topics for each field, it is imperative to note that the proposed topics may not entirely reflect real-world issues.

While the PICOS framework remains widely utilized as a prominent approach for defining research scope, alternative frameworks such as SPIDER, SPICE, and ECLIPSE may lead to different outcomes [34,35,36]. Considering the potential variations in results, further investigations are necessary to comprehensively assess the efficacy and performance of each framework.

Our team of general physicians consisted of three individuals under the age of 30, each with prior experience in article screening and writing review-type pieces. However, they have not had received any specialized training in radiology before the study. To reduce bias, we ensured that they did not communicate with one another and were unaware that they were being compared. It is worth noting that our selection of GPs may not perfectly represent those active in the medical research field, so the generalizability of our results regarding this matter should be verified. Our group of experts, on the other hand, consisted of one young board-certified radiologist and two physicians with over 20 years of experience as radiology researchers with numerous published systematic reviews. To form an even more qualified expert group, we recommend including more experienced radiologists with diverse backgrounds and a higher number of experts.

Although the raters were proficient in English, it was not their native language, thus their performance might have been suboptimal. In contrast, ChatGPT, being an LLM trained on massive English corpora, had an inherent advantage.

While ChatGPT was prompted to rate on a scale of 1 to 5, human raters were tasked with simple inclusion or exclusion. Using an ordinal scale allows us to establish clear cutoffs and prioritize various metrics more effectively. This mismatch could be addressed by asking human raters to provide similar scale-based ratings. However, this would introduce additional complexity and may not mirror real-world practices.

Even though our selection of a prompt template was done through trial and error, we only used a single template in the end. There may exist prompt templates with better performance, hence this is an active area of research. A handful of techniques are available for prompting LLMs such as Chain of Thought [37]. These techniques make use of the auto-regressive nature of these models to achieve more accurate responses through “reasoning”. Auto-regressive language models work by predicting the next tokens (could be words or sub-words) in a sequence based on the previous tokens they have already generated or are presented to them (“prompts”) [8]. In this study, we attempted to exploit this feature, similar to the Chain of Thought technique, by having ChatGPT explain its decision before outputting a rating, allowing for more “contemplation” and potentially more reliable responses (Fig. 5).

Fig. 5
figure 5

ChatGPT prompting template and rating process. Role: defines the role of AI, or in other words how the AI will behave. Task: the specific task assigned to the AI. Article: provided title, abstract, reference type, and published year. PICOS: determined PICOS based on the specified topic. API: application programming interface, the interface used to communicate with ChatGPT. A temperature of 0.0 was used. Response: the retrieved response from ChatGPT. JSON Parser: a utility tool to parse JSON (JavaScript Object Notation) formatted text. Rating: the rating extracted from the JSON object

The optimal threshold determined in this study may not be universally applicable, particularly in cases where the field of study differs. Therefore, the findings of this study call for external validation by other research groups. Moreover, the knowledge gained from our experience can be replicated across diverse medical domains and alternative research methodologies, such as clinical trials, cohort studies, and case–control studies. Evaluating the outcomes in these different fields has the potential to provide a more comprehensive understanding of the overall effectiveness of the approach in this particular context.

Even though we utilized ChatGPT for screening only radiology-centered publications, we believe that the results may very well extrapolate to other fields of medicine, especially fields with more publication volumes such as cardiology and neurology, since these fields most likely constitute a bigger portion of the model’s training data.

For our study, we utilized a particular version of ChatGPT. It is important to note that as the model is continuously updated, other attempts to replicate our study may yield different results. Additionally, analyzing false positive and negative results can inform strategies to further enhance ChatGPT’s efficacy in this regard.


Our study demonstrates ChatGPT’s potential as a valuable tool in the initial screening phase of systematic reviews confidently excluding more than 50% of irrelevant citations. It showed superior false negative rates and proportions missed within specific thresholds but lagged in specificity and precision (PPV) compared to human raters. A hybrid approach combining AI and human raters could optimize efficiency and accuracy. Further research is necessary to validate findings across fields and explore broader applications of large language models in medical research.

Declaration of Generative AI and AI-assisted technologies in the writing process

We utilized AI tools, specifically Grammarly GO and ChatGPT, to aid in rephrasing portions of this article. The purpose was to enhance clarity and readability only. Nevertheless, all AI-generated content was meticulously reviewed and verified by the authors.

Availability of data and materials

The data that support the findings of this study and the screening script are available online. Project name: ChatGPT Screener; Project home page (script):; Archived version (containing supporting data):; Operating system(s): Platform independent. Programming language: Python. Other requirements: Python installation, required libraries as specified in the repository. License: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. Any restrictions to use by non-academics: same as license.



Artificial intelligence


Application programming interface


Area under the curve


Confidence interval


False negative rate


General physician


Generative pretrained transformers


Intersection over union


Large language model


Machine learning


Negative likelihood ratio


Negative predictive value


Population, intervention, comparison, outcomes, study design


Positive likelihood ratio


Positive predictive value


Reinforcement learning with human feedback


Receiver-operator characteristics


  1. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.

    Article  CAS  PubMed  Google Scholar 

  2. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. arXiv pre-print server. 2016.

  3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, eds. 2017.

  4. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018.

  5. Alec R, Karthik N. Improving language understanding by generative pre-training. 2018.

  6. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(1).

  7. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language Models are few-shot learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H, eds. 2020. p. 1877–901.

  8. Alec R, Jeff W, Rewon C, David L, Dario A, Ilya S. Language models are unsupervised multitask learners. 2019.

  9. Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4. 2023.

  10. Alessandro L, Douglas GA, Jennifer Marie T, Cynthia DM, Peter Christian G, John PAI, et al. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. PLoS Med. 2009;6.

  11. Byron CW, Kevin S, Carla EB, Joseph L, Thomas AT. Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. 2012.

  12. Ana Helena Salles dos R, Ana Luiza Miranda de O, Carolina F, James Z, Paulo F, Janaine Cunha P. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst Rev. 2023;12.

  13. Kevin EKC, Robin LJL, Daniel FG, Leo N. Research Screener: a machine learning tool to semi-automate abstract screening for systematic reviews. Syst Rev. 2021;10.

  14. Amir V, Mana M, Amin N-A, Seyed Hossein Hosseini A, Mehrnush Saghab T, Reyhaneh A, et al. Abstract screening using the automated tool Rayyan: results of effectiveness in three diagnostic test accuracy systematic reviews. BMC Med Res Methodol. 2022;22.

  15. The EndNote Team. EndNote. EndNote X9 ed. Philadelphia, PA: Clarivate; 2013.

  16. McKinney W. Data Structures for Statistical Computing in Python2010. 56–61 p.

  17. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–62.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in P ython. J Mach Learn Res. 2011;12:2825–30.

    Google Scholar 

  19. Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9:90–5.

    Article  Google Scholar 

  20. Waskom M. seaborn: statistical data visualization. J Open Source Softw. 2021;6:3021.

    Article  Google Scholar 

  21. Viera AJ, Garrett JM. Understanding interobserver agreement: the kappa statistic. Fam Med. 2005;37(5):360–3.

    PubMed  Google Scholar 

  22. Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27(8):861–74.

    Article  Google Scholar 

  23. Schisterman EF, Perkins NJ, Liu A, Bondell H. Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples. Epidemiology. 2005;16(1):73–81.

    Article  PubMed  Google Scholar 

  24. Jaccard index: Wikipedia; 2023. updated 2023, May 21. Available from:

  25. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. 1st ed. New York: Chapman and Hall/CRC; 1994.

    Book  Google Scholar 

  26. Rayyan—a web and mobile app for systematic reviews. Syst Rev. 2016;5(1):210.

  27. Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA, editors. Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. International Health Informatics Symposium; 2012.

  28. Kahili-Heede MK, Hillgren KJ. Colandr. J Med Library Assoc. 2021;109:523–5.

    Google Scholar 

  29. dos Reis AHS, de Oliveira ALM, Fritsch C, Zouch J, Ferreira P, Polese JC. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst Rev. 2023;12.

  30. Bannach-Brown A, Przybyła P, Thomas J, Rice ASC, Ananiadou S, Liao J, Macleod MR. Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error. Syst Rev. 2019;8(1):23.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Gates A, Johnson C, Hartling L. Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool. Syst Rev. 2018;7.

  32. Rathbone J, Hoffmann TC, Glasziou PP. Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers. Syst Rev. 2015;4.

  33. Gates A, Guitard S, Pillay J, Elliott SA, Dyson MP, Newton AS, Hartling L. Performance and usability of machine learning for screening in systematic reviews: a comparative evaluation of three tools. Syst Rev. 2019;8(1):278.

    Article  PubMed  PubMed Central  Google Scholar 

  34. Methley A, Campbell SM, Chew‐Graham CA, McNally R, Cheraghi-Sohi S. PICO, PICOS and SPIDER: a comparison study of specificity and sensitivity in three search tools for qualitative systematic reviews. BMC Health Serv Res. 2014;14.

  35. Booth A. Clear and present questions: formulating questions for evidence based practice. Library Hi Tech. 2006;24(3):355–68.

    Article  Google Scholar 

  36. Wildridge V, Bell L. How CLIP became ECLIPSE: a mnemonic to assist in searching for health policy/management information. Health Info Libr J. 2002;19(2):113–5.

    Article  PubMed  Google Scholar 

  37. Wang B, Deng X, Sun H, editors. Iteratively Prompt Pre-trained Language Models for Chain of Thought2022 December; Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.

Download references


Not applicable.



Author information

Authors and Affiliations



The study was designed and drafted by KF and MI, who also supervised the process. MI analyzed the data and wrote the scripts. HG provided the necessary materials and assets. ShK, MSh, and AHJ formed the expert group and assisted with the study’s draft. DZ, SK, and MAA were the general physicians.

Corresponding author

Correspondence to Kavous Firouznia.

Ethics declarations

Ethics approval and consent to participate

Since no human subjects were involved in the intervention and no sensitive data was used, this study did not require ethical approval.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Further details extending the results and methods of the manuscript.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Issaiy, M., Ghanaati, H., Kolahi, S. et al. Methodological insights into ChatGPT’s screening performance in systematic reviews. BMC Med Res Methodol 24, 78 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: