From: Methodological insights into ChatGPT’s screening performance in systematic reviews
Rater | Threshold | TPs [95% CI] | TNs [95% CI] | FPs [95% CI] | FNs [95% CI] |
---|---|---|---|---|---|
ChatGPT | ≥ 2 | 141 [120–163] | 564 [531–599] | 486 [452–520] | 7 [3–12] |
≥ 3* | 140 [119–162] | 684 [650–718] | 366 [333–398] | 8 [3–14] | |
≥ 4 | 107 [87–127] | 912 [884–938] | 138 [116–158] | 41 [29–53] | |
≥ 5 | 15 [8–23] | 1037 [1014–1060] | 13 [6–20] | 133 [113–156] | |
GP 1 | n/a | 82 [67–99] | 990 [965–1013] | 60 [46–74] | 66 [51–83] |
GP 2 | n/a | 81 [65–99] | 1037 [1013–1059] | 13 [6–21] | 67 [52–84] |
GP 3 | n/a | 109 [90–128] | 982 [955–1006] | 68 [53–83] | 39 [28–52] |
Voting consensus | n/a | 92 [75–110] | 1031 [1007–1054] | 19 [11–28] | 56 [42–71] |
Specific consensus | n/a | 47 [34–59] | 1047 [1024–1069] | 3 [0–7] | 101 [83–122] |
Sensitive consensus | n/a | 133 [113–154] | 931 [902–958] | 119 [99–139] | 15 [8–22] |