Skip to main content

Table 1 The median number of trial registrations to be screened to achieve 100% recall

From: The automation of relevant trial registration screening for systematic review updates: an evaluation study on a large dataset of ClinicalTrials.gov registrations

Model

Median [IQR]

Hierarchical clustering

 TF-IDF, Ward, Euclidean

501 [43–4363]

  Single-linkage, Euclidean

90,725 [31070–132,615]

 LDA, 50 topics, Ward, Euclidean

6352 [986–86,990]

  100 topics

4381 [465–91,954]

  150 topics

4653 [687–77,875]

  200 topics

4453 [425–70,500]

 LDA, 50 topics, Single-linkage, Euclidean

112,858 [71482–136,676]

  100 topics

115,957 [67384–137,334]

  150 topics

113,794 [76429–139,497]

  200 topics

124,259 [89399–146,225]

 Doc2Vec, 50-dimensional vector, Ward, Euclidean

2256 [198–23,926]

  100-dimensional vector

2432 [176–26,889]

  150-dimensional vector

3850 [238–78,118]

  200-dimensional vector

5113 [231–70,483]

 Doc2Vec, 50 vector dimension, Single-linkage, Euclidean

125,000 [84772–150,519]

  100-dimensional vector

127,604 [91965–150,958]

  150-dimensional vector

128,801 [89896–151,288]

  200-dimensional vector

128,978 [89398–151,171]

Document similarity

 TF-IDF, Euclidean

99 [19–491]

 LDA, 50 topics, Euclidean

1287 [271–4968]

  100 topics

842 [134–3776]

  150 topics

793 [123–4268]

  200 topics

887 [116–5417]

 Doc2Vec, 50-dimensional vector, Euclidean

18,501 [1970–51,495]

  100-dimensional vector

33,968 [7898–68,806]

  150-dimensional vector

41,116 [12036–77,218]

  200-dimensional vector

43,879 [13791–82,388]