Methodological insights into ChatGPT’s screening performance in systematic reviews

Table 4 Classification details at different cut-offs over the whole dataset

Rater	Threshold	TPs [95% CI]	TNs [95% CI]	FPs [95% CI]	FNs [95% CI]
ChatGPT	≥ 2	141 [120–163]	564 [531–599]	486 [452–520]	7 [3–12]
	≥ 3*	140 [119–162]	684 [650–718]	366 [333–398]	8 [3–14]
	≥ 4	107 [87–127]	912 [884–938]	138 [116–158]	41 [29–53]
	≥ 5	15 [8–23]	1037 [1014–1060]	13 [6–20]	133 [113–156]
GP 1	n/a	82 [67–99]	990 [965–1013]	60 [46–74]	66 [51–83]
GP 2	n/a	81 [65–99]	1037 [1013–1059]	13 [6–21]	67 [52–84]
GP 3	n/a	109 [90–128]	982 [955–1006]	68 [53–83]	39 [28–52]
Voting consensus	n/a	92 [75–110]	1031 [1007–1054]	19 [11–28]	56 [42–71]
Specific consensus	n/a	47 [34–59]	1047 [1024–1069]	3 [0–7]	101 [83–122]
Sensitive consensus	n/a	133 [113–154]	931 [902–958]	119 [99–139]	15 [8–22]

*Youden’s threshold
TPs true positives, TNs true negatives, FPs false positives, FNs false negatives, CI confidence interval, n/a not applicable

ISSN: 1471-2288