From: Methodological insights into ChatGPT’s screening performance in systematic reviews
Human raters | Evaluation METRIC | Value [95% CI] | ChatGPT [95% CI] | P-value (two-tailed) |
---|---|---|---|---|
GP 1 | Sensitivity | 0.55 [0.48,0.63] | 0.95 [0.91,0.98] | < 0.001 |
Specificity | 0.94 [0.93,0.96] | 0.65 [0.62,0.68] | < 0.001 | |
Precision (PPV) | 0.58 [0.50,0.66] | 0.28 [0.24,0.32] | < 0.001 | |
Negative Predictive Value | 0.94 [0.92,0.95] | 0.99 [0.98,1.00] | < 0.001 | |
Positive Likelihood Ratio | 9.70 [7.48,13.34] | 2.71 [2.46,3.01] | < 0.001 | |
Negative Likelihood Ratio | 0.47 [0.39,0.56] | 0.08 [0.03,0.14] | < 0.001 | |
Balanced Accuracy | 0.75 [0.71,0.79] | 0.80 [0.77,0.82] | 0.016 | |
Jaccard Index | 0.39 [0.33,0.47] | 0.27 [0.23,0.31] | < 0.001 | |
False Negative Rate | 0.45 [0.37,0.52] | 0.05 [0.02,0.09] | < 0.001 | |
Proportion Missed | 0.06 [0.05,0.08] | 0.01 [0.00,0.02] | < 0.001 | |
GP 2 | Sensitivity | 0.55 [0.46,0.63] | 0.95 [0.91,0.98] | < 0.001 |
Specificity | 0.99 [0.98,0.99] | 0.65 [0.62,0.68] | < 0.001 | |
Precision (PPV) | 0.86 [0.79,0.93] | 0.28 [0.24,0.32] | < 0.001 | |
Negative Predictive Value | 0.94 [0.92,0.95] | 0.99 [0.98,1.00] | < 0.001 | |
Positive Likelihood Ratio | 44.20 [26.11,90.48] | 2.71 [2.46,3.01] | < 0.001 | |
Negative Likelihood Ratio | 0.46 [0.38,0.54] | 0.08 [0.03,0.14] | < 0.001 | |
Balanced Accuracy | 0.77 [0.72,0.81] | 0.80 [0.77,0.82] | 0.16 | |
Jaccard Index | 0.50 [0.42,0.58] | 0.27 [0.23,0.31] | < 0.001 | |
False Negative Rate | 0.45 [0.37,0.54] | 0.05 [0.02,0.09] | < 0.001 | |
Proportion Missed | 0.06 [0.05,0.08] | 0.01 [0.00,0.02] | < 0.001 | |
GP 3 | Sensitivity | 0.74 [0.66,0.80] | 0.95 [0.91,0.98] | < 0.001 |
Specificity | 0.94 [0.92,0.95] | 0.65 [0.62,0.68] | < 0.001 | |
Precision (PPV) | 0.62 [0.55,0.68] | 0.28 [0.24,0.32] | < 0.001 | |
Negative Predictive Value | 0.96 [0.95,0.97] | 0.99 [0.98,1.00] | < 0.001 | |
Positive Likelihood Ratio | 11.37 [9.00,14.60] | 2.71 [2.46,3.01] | < 0.001 | |
Negative Likelihood Ratio | 0.28 [0.21,0.36] | 0.08 [0.03,0.14] | < 0.001 | |
Balanced Accuracy | 0.84 [0.80,0.87] | 0.80 [0.77,0.82] | 0.076 | |
Jaccard Index | 0.50 [0.44,0.57] | 0.27 [0.23,0.31] | < 0.001 | |
False Negative Rate | 0.26 [0.20,0.34] | 0.05 [0.02,0.09] | < 0.001 | |
Proportion Missed | 0.04 [0.03,0.05] | 0.01 [0.00,0.02] | < 0.001 | |
Voting Consensus (GPs) | Sensitivity | 0.62 [0.54,0.70] | 0.95 [0.91,0.98] | < 0.001 |
Specificity | 0.98 [0.97,0.99] | 0.65 [0.62,0.68] | < 0.001 | |
Precision (PPV) | 0.83 [0.75,0.89] | 0.28 [0.24,0.32] | < 0.001 | |
Negative Predictive Value | 0.95 [0.93,0.96] | 0.99 [0.98,1.00] | < 0.001 | |
Positive Likelihood Ratio | 34.35 [22.94,58.37] | 2.71 [2.46,3.01] | < 0.001 | |
Negative Likelihood Ratio | 0.39 [0.31,0.46] | 0.08 [0.03,0.14] | < 0.001 | |
Balanced Accuracy | 0.80 [0.76,0.84] | 0.80 [0.77,0.82] | 0.906 | |
Jaccard Index | 0.55 [0.48,0.62] | 0.27 [0.23,0.31] | < 0.001 | |
False Negative Rate | 0.38 [0.30,0.46] | 0.05 [0.02,0.09] | < 0.001 | |
Proportion Missed | 0.05 [0.04,0.07] | 0.01 [0.00,0.02] | < 0.001 | |
Specific Consensus (GPs) | Sensitivity | 0.32 [0.25,0.39] | 0.95 [0.91,0.98] | < 0.001 |
Specificity | 1.00 [0.99,1.00] | 0.65 [0.62,0.68] | < 0.001 | |
Precision (PPV) | 0.94 [0.87,1.00] | 0.28 [0.24,0.32] | < 0.001 | |
Negative Predictive Value | 0.91 [0.89,0.93] | 0.99 [0.98,1.00] | < 0.001 | |
Positive Likelihood Ratio | 111.15 [46.58,∞] | 2.71 [2.46,3.01] | < 0.001 | |
Negative Likelihood Ratio | 0.68 [0.61,0.76] | 0.08 [0.03,0.14] | < 0.001 | |
Balanced Accuracy | 0.66 [0.62,0.69] | 0.80 [0.77,0.82] | < 0.001 | |
Jaccard Index | 0.31 [0.24,0.38] | 0.27 [0.23,0.31] | 0.392 | |
False Negative Rate | 0.68 [0.61,0.75] | 0.05 [0.02,0.09] | < 0.001 | |
Proportion Missed | 0.09 [0.07,0.11] | 0.01 [0.00,0.02] | < 0.001 | |
Sensitive consensus (GPs) | Sensitivity | 0.90 [0.85,0.95] | 0.95 [0.91,0.98] | 0.074 |
Specificity | 0.89 [0.87,0.91] | 0.65 [0.62,0.68] | < 0.001 | |
Precision (PPV) | 0.53 [0.47,0.59] | 0.28 [0.24,0.32] | < 0.001 | |
Negative Predictive Value | 0.98 [0.98,0.99] | 0.99 [0.98,1.00] | 0.364 | |
Positive Likelihood Ratio | 7.93 [6.72,9.64] | 2.71 [2.46,3.01] | < 0.001 | |
Negative Likelihood Ratio | 0.11 [0.06,0.17] | 0.08 [0.03,0.14] | 0.364 | |
Balanced Accuracy | 0.89 [0.87,0.92] | 0.80 [0.77,0.82] | < 0.001 | |
Jaccard Index | 0.50 [0.44,0.56] | 0.27 [0.23,0.31] | < 0.001 | |
False Negative Rate | 0.10 [0.05,0.15] | 0.05 [0.02,0.09] | 0.074 | |
Proportion Missed | 0.02 [0.01,0.02] | 0.01 [0.00,0.02] | 0.364 |