The concurrent increases—since the Seventies—in sample size and use of non-parametric tests over t-tests have a paradoxical quality. The usefulness of non-parametric tests as alternatives to t-tests for non-normally distributed data is most pronounced for small studies. When the sample size increases, so does the robustness of the t-tests to deviations from normality. The non-parametric WMW test, on the other hand, increases its sensitivity to distribution differences other than between means and medians, and it may detect (i.e. produce a small p-value) slight differences in spread. When the difference in spread increases, the probability that a random sample from one of the distributions is less than a random sample from the other distribution also increases. With a large sample size, the WMW test has great power to detect that that probability is not 50%. If the purpose of the study is to detect any distributional difference, using a non-parametric test is probably useful. Most studies, however, are carried out to investigate differences in means or medians, and as such, the ratio of non-parametric tests to t-tests ought to decrease when studies grow in size.
Why then has the use of non-parametric tests increased? We may propose several explanations. Perhaps, non-parametric tests were underused earlier, and that the present ratio of t-tests to non-parametric tests represents the “correct” one. If so, only the smallest of contemporary studies ought to use non-parametric tests. However, in the NEJM in 2004–2005, 27% of the studies used non-parametric tests [1], and the 25th percentile of the sample size in September 2007 in the Lancet and the BMJ were 1236 and 236 [3]. The smallest quartile of studies actually contains many quite large studies. Thus, the use of non-parametric tests is not confined to appropriately small studies. Another explanation might be that most studies do not use non-parametric tests as an alternative to t-tests but rather to analyze ordinal variables, which is a highly reasonable practice. We do not have any systematic evidence to support or reject that hypothesis, although a cursory review of articles published in the NEJM, Lancet, JAMA, and BMJ from September through November 2011 revealed several large studies that used non-parametric tests as alternatives to t-tests; for example, n=1721 [12], n=429 [13], n=107018 [14], n=44350 [15], n=1789 [16], and n=12745 [17]. The use of non-parametric tests as alternatives to t-tests may be more common in high-impact journals [18]. The NEJM, for instance, in their instructions for authors, recommend that “nonparametric methods should be used to compare groups when the distribution of the dependent variable is not normal” ( , accessed March 19, 2012). That recommendation does not take into account the sample size and may force authors of large studies to use non-parametric methods needlessly. Four more explanations can be hypothesized. First, medical research authors may use a test for normality to decide whether to use a t-test or a non-parametric test. We strongly advise against that practice. In large studies, tests for normality are very sensitive to deviations from normality and thereby unsuitable as tools to choose the most appropriate test. Second, regardless of the size of their studies, authors may rely on recommendations and advice intended solely for the analysis of smaller studies. There might be a lack of critical thinking about recommendations and a poor understanding of the practical implications of the central limit theorem. Third, authors may simply prefer small p-values, and might go shopping for the statistical method that gives them the smallest p. In the simulation study in this paper, the WMW test produced smaller p-values that the t-test more than 70% of the times when the number of subjects in each group was 250. For 1000 subjects in each group, that proportion increased to more than 80%. Last, we have publication bias. A study with a significant p-value from the WMW test may be more readily accepted for publication than a study with a non-significant p-value from the t-test.
Is the WMW test a bad test? No, but it is not always an appropriate alternative to the t-test. The WMW test is most useful for the analysis of ordinal data and may also be used in smaller studies, under certain conditions, to compare means or medians [5, 11]. Furthermore, if the results from the WMW test are interpreted strictly according to the test’s null hypothesis, Prob(X<Y)=0.5, the WMW test is an efficient and useful test. For large studies, however, where the purpose is to compare the means of continuous variables, the choice of test is easy: the t-test is robust even to severely skewed data and should be used almost exclusively.
One further benefit of using the t-test is that it facilitates interval estimation. The t-test and its corresponding confidence interval are based on the same standard error estimate; when the t-test is robust, so is the confidence interval. Combined with linear regression analysis, the t-test and its confidence interval form a simple and unified approach for analyzing and presenting continuous outcome data, which, for large studies, is sufficient for most practical purposes.
This study has only considered smooth, skewed distributions. Medical variables do not always have a smooth distribution and may include outliers. The problem with outliers is not that the t-test fails as a test of equality of means in their presence, but that the mean itself may be a poor representation of the typical value of the distribution. One solution is to use another measure of location, for instance, the trimmed mean, which may be compared in two groups with the Yuen-Welch test [5]. The problem that the mean does not reflect the central tendency of a distribution is most pronounced in small studies, where the impact of outliers is usually greater than in large studies.