Skip to main content

Table 3 Comparing human raters against ChatGPT at threshold ≥ 3, across the whole dataset

From: Methodological insights into ChatGPT’s screening performance in systematic reviews

Human raters

Evaluation METRIC

Value [95% CI]

ChatGPT [95% CI]

P-value (two-tailed)

GP 1

Sensitivity

0.55 [0.48,0.63]

0.95 [0.91,0.98]

< 0.001

Specificity

0.94 [0.93,0.96]

0.65 [0.62,0.68]

< 0.001

Precision (PPV)

0.58 [0.50,0.66]

0.28 [0.24,0.32]

< 0.001

Negative Predictive Value

0.94 [0.92,0.95]

0.99 [0.98,1.00]

< 0.001

Positive Likelihood Ratio

9.70 [7.48,13.34]

2.71 [2.46,3.01]

< 0.001

Negative Likelihood Ratio

0.47 [0.39,0.56]

0.08 [0.03,0.14]

< 0.001

Balanced Accuracy

0.75 [0.71,0.79]

0.80 [0.77,0.82]

0.016

Jaccard Index

0.39 [0.33,0.47]

0.27 [0.23,0.31]

< 0.001

False Negative Rate

0.45 [0.37,0.52]

0.05 [0.02,0.09]

< 0.001

Proportion Missed

0.06 [0.05,0.08]

0.01 [0.00,0.02]

< 0.001

GP 2

Sensitivity

0.55 [0.46,0.63]

0.95 [0.91,0.98]

< 0.001

Specificity

0.99 [0.98,0.99]

0.65 [0.62,0.68]

< 0.001

Precision (PPV)

0.86 [0.79,0.93]

0.28 [0.24,0.32]

< 0.001

Negative Predictive Value

0.94 [0.92,0.95]

0.99 [0.98,1.00]

< 0.001

Positive Likelihood Ratio

44.20 [26.11,90.48]

2.71 [2.46,3.01]

< 0.001

Negative Likelihood Ratio

0.46 [0.38,0.54]

0.08 [0.03,0.14]

< 0.001

Balanced Accuracy

0.77 [0.72,0.81]

0.80 [0.77,0.82]

0.16

Jaccard Index

0.50 [0.42,0.58]

0.27 [0.23,0.31]

< 0.001

False Negative Rate

0.45 [0.37,0.54]

0.05 [0.02,0.09]

< 0.001

Proportion Missed

0.06 [0.05,0.08]

0.01 [0.00,0.02]

< 0.001

GP 3

Sensitivity

0.74 [0.66,0.80]

0.95 [0.91,0.98]

< 0.001

Specificity

0.94 [0.92,0.95]

0.65 [0.62,0.68]

< 0.001

Precision (PPV)

0.62 [0.55,0.68]

0.28 [0.24,0.32]

< 0.001

Negative Predictive Value

0.96 [0.95,0.97]

0.99 [0.98,1.00]

< 0.001

Positive Likelihood Ratio

11.37 [9.00,14.60]

2.71 [2.46,3.01]

< 0.001

Negative Likelihood Ratio

0.28 [0.21,0.36]

0.08 [0.03,0.14]

< 0.001

Balanced Accuracy

0.84 [0.80,0.87]

0.80 [0.77,0.82]

0.076

Jaccard Index

0.50 [0.44,0.57]

0.27 [0.23,0.31]

< 0.001

False Negative Rate

0.26 [0.20,0.34]

0.05 [0.02,0.09]

< 0.001

Proportion Missed

0.04 [0.03,0.05]

0.01 [0.00,0.02]

< 0.001

Voting Consensus (GPs)

Sensitivity

0.62 [0.54,0.70]

0.95 [0.91,0.98]

< 0.001

Specificity

0.98 [0.97,0.99]

0.65 [0.62,0.68]

< 0.001

Precision (PPV)

0.83 [0.75,0.89]

0.28 [0.24,0.32]

< 0.001

Negative Predictive Value

0.95 [0.93,0.96]

0.99 [0.98,1.00]

< 0.001

Positive Likelihood Ratio

34.35 [22.94,58.37]

2.71 [2.46,3.01]

< 0.001

Negative Likelihood Ratio

0.39 [0.31,0.46]

0.08 [0.03,0.14]

< 0.001

Balanced Accuracy

0.80 [0.76,0.84]

0.80 [0.77,0.82]

0.906

Jaccard Index

0.55 [0.48,0.62]

0.27 [0.23,0.31]

< 0.001

False Negative Rate

0.38 [0.30,0.46]

0.05 [0.02,0.09]

< 0.001

Proportion Missed

0.05 [0.04,0.07]

0.01 [0.00,0.02]

< 0.001

Specific Consensus (GPs)

Sensitivity

0.32 [0.25,0.39]

0.95 [0.91,0.98]

< 0.001

Specificity

1.00 [0.99,1.00]

0.65 [0.62,0.68]

< 0.001

Precision (PPV)

0.94 [0.87,1.00]

0.28 [0.24,0.32]

< 0.001

Negative Predictive Value

0.91 [0.89,0.93]

0.99 [0.98,1.00]

< 0.001

Positive Likelihood Ratio

111.15 [46.58,∞]

2.71 [2.46,3.01]

< 0.001

Negative Likelihood Ratio

0.68 [0.61,0.76]

0.08 [0.03,0.14]

< 0.001

Balanced Accuracy

0.66 [0.62,0.69]

0.80 [0.77,0.82]

< 0.001

Jaccard Index

0.31 [0.24,0.38]

0.27 [0.23,0.31]

0.392

False Negative Rate

0.68 [0.61,0.75]

0.05 [0.02,0.09]

< 0.001

Proportion Missed

0.09 [0.07,0.11]

0.01 [0.00,0.02]

< 0.001

Sensitive consensus (GPs)

Sensitivity

0.90 [0.85,0.95]

0.95 [0.91,0.98]

0.074

Specificity

0.89 [0.87,0.91]

0.65 [0.62,0.68]

< 0.001

Precision (PPV)

0.53 [0.47,0.59]

0.28 [0.24,0.32]

< 0.001

Negative Predictive Value

0.98 [0.98,0.99]

0.99 [0.98,1.00]

0.364

Positive Likelihood Ratio

7.93 [6.72,9.64]

2.71 [2.46,3.01]

< 0.001

Negative Likelihood Ratio

0.11 [0.06,0.17]

0.08 [0.03,0.14]

0.364

Balanced Accuracy

0.89 [0.87,0.92]

0.80 [0.77,0.82]

< 0.001

Jaccard Index

0.50 [0.44,0.56]

0.27 [0.23,0.31]

< 0.001

False Negative Rate

0.10 [0.05,0.15]

0.05 [0.02,0.09]

0.074

Proportion Missed

0.02 [0.01,0.02]

0.01 [0.00,0.02]

0.364