The Kolmogorov-Smirnov test is normally used to test whether data follows a specified continuous distribution [8]. A simple calculation from the raw data in [1] shows that the Z and p values obtained there are those based on comparing the data with a distribution uniform on the continuous interval [0, 9]. But this distribution is obviously incorrect even before any comparison with the data, since the terminal digit cannot be, for example, 2.68. A check of the documentation for SPSS for Windows confirms that the program runs the Komogorov-Smirnov test for a continuous uniform distribution, rather than a discrete uniform distribution.
Because the terminal digits are naturally discrete, a χ
2 test is appropriate [8]. χ
2 tests yield χ
2 = 6.5, df = 9, p = 0.69, for the 610 test statistics, and χ
2 = 15, df = 9, p = 0.086, for the 181 p values. This changes the results from "significant" to "not significant," and we therefore have insufficient evidence to suggest terminal digit errors in the p values reported in Nature articles.
Because the reader may be suspicious that this is simply a judgment call as to the most natural statistical test, rather than a bona fide mistake in [1], we also rerun the Kolmogorov-Smirnov test. While it is unusual to run this test on discrete data, it is possible (although perhaps poorly motivated in this case), so long as appropriate modifications are made. Instead of incorrectly comparing the data against the distribution P(x) = 1/9 for 0 ≤ x ≤ 9, we use P(x) = (1/10). This gives Z = 1.1 for the 610 test statistics (Δ = 0.043), and Z = 0.60 for the 181 p values (Δ = 0.045). Since the textbook tables of Kolmogorov-Smirnov p values are computed for continuous distributions, we convert from Z to p values with Monte Carlo simulations (counting ties in Z as "half a hit"). This gives p = 0.094 and p = 0.57 for the two cases. Again, the terminal digit distributions could reasonably have occurred by chance, given a discrete uniform distribution.
The authors of [1] also found that 21 of the 181 p values in Nature had problems when compared with the corresponding test statistics. Our analysis does not change this finding, but it is worth remarking on the comparatively minor nature of many of the problems that they found. For example, 3 of the 21 problems come from a single three-row table (table 1 of [9]), in which every entry of a column labelled P reads 0.001. This is indeed somewhat misleading, since the natural implication is that p = 0.001 for all three entries, when in fact the intended meaning of the table was (presumably) that all three results were significant at the 0.1% level (i.e. p < 0.001). But it is hard to imagine that many readers would be badly misled by this table, and in any event, such errors are minor in comparison to the error of using a demonstrably invalid test.
Comments
View archived comments (1)