Model | Median [IQR] |
---|---|
Hierarchical clustering | |
TF-IDF, Ward, Euclidean | 501 [43–4363] |
Single-linkage, Euclidean | 90,725 [31070–132,615] |
LDA, 50 topics, Ward, Euclidean | 6352 [986–86,990] |
100 topics | 4381 [465–91,954] |
150 topics | 4653 [687–77,875] |
200 topics | 4453 [425–70,500] |
LDA, 50 topics, Single-linkage, Euclidean | 112,858 [71482–136,676] |
100 topics | 115,957 [67384–137,334] |
150 topics | 113,794 [76429–139,497] |
200 topics | 124,259 [89399–146,225] |
Doc2Vec, 50-dimensional vector, Ward, Euclidean | 2256 [198–23,926] |
100-dimensional vector | 2432 [176–26,889] |
150-dimensional vector | 3850 [238–78,118] |
200-dimensional vector | 5113 [231–70,483] |
Doc2Vec, 50 vector dimension, Single-linkage, Euclidean | 125,000 [84772–150,519] |
100-dimensional vector | 127,604 [91965–150,958] |
150-dimensional vector | 128,801 [89896–151,288] |
200-dimensional vector | 128,978 [89398–151,171] |
Document similarity | |
TF-IDF, Euclidean | 99 [19–491] |
LDA, 50 topics, Euclidean | 1287 [271–4968] |
100 topics | 842 [134–3776] |
150 topics | 793 [123–4268] |
200 topics | 887 [116–5417] |
Doc2Vec, 50-dimensional vector, Euclidean | 18,501 [1970–51,495] |
100-dimensional vector | 33,968 [7898–68,806] |
150-dimensional vector | 41,116 [12036–77,218] |
200-dimensional vector | 43,879 [13791–82,388] |