Skip to main content

Table 4 Classification details at different cut-offs over the whole dataset

From: Methodological insights into ChatGPT’s screening performance in systematic reviews

Rater

Threshold

TPs [95% CI]

TNs [95% CI]

FPs [95% CI]

FNs [95% CI]

ChatGPT

≥ 2

141 [120–163]

564 [531–599]

486 [452–520]

7 [3–12]

≥ 3*

140 [119–162]

684 [650–718]

366 [333–398]

8 [3–14]

≥ 4

107 [87–127]

912 [884–938]

138 [116–158]

41 [29–53]

≥ 5

15 [8–23]

1037 [1014–1060]

13 [6–20]

133 [113–156]

GP 1

n/a

82 [67–99]

990 [965–1013]

60 [46–74]

66 [51–83]

GP 2

n/a

81 [65–99]

1037 [1013–1059]

13 [6–21]

67 [52–84]

GP 3

n/a

109 [90–128]

982 [955–1006]

68 [53–83]

39 [28–52]

Voting consensus

n/a

92 [75–110]

1031 [1007–1054]

19 [11–28]

56 [42–71]

Specific consensus

n/a

47 [34–59]

1047 [1024–1069]

3 [0–7]

101 [83–122]

Sensitive consensus

n/a

133 [113–154]

931 [902–958]

119 [99–139]

15 [8–22]

  1. *Youden’s threshold
  2. TPs true positives, TNs true negatives, FPs false positives, FNs false negatives, CI confidence interval, n/a not applicable