Skip to main content

A logical analysis of null hypothesis significance testing using popular terminology

Abstract

Background

Null Hypothesis Significance Testing (NHST) has been well criticised over the years yet remains a pillar of statistical inference. Although NHST is well described in terms of statistical models, most textbooks for non-statisticians present the null and alternative hypotheses (H0 and HA, respectively) in terms of differences between groups such as (μ1 = μ2) and (μ1 ≠ μ2) and HA is often stated to be the research hypothesis. Here we use propositional calculus to analyse the internal logic of NHST when couched in this popular terminology. The testable H0 is determined by analysing the scope and limits of the P-value and the test statistic’s probability distribution curve.

Results

We propose a minimum axiom set NHST in which it is taken as axiomatic that H0 is rejected if P-value< α. Using the common scenario of the comparison of the means of two sample groups as an example, the testable H0 is {(μ1 = μ2) and [(\(\overline{x}\) 1\(\overline{x}\) 2) due to chance alone]}. The H0 and HA pair should be exhaustive to avoid false dichotomies. This entails that HA is ¬{(μ1 = μ2) and [(\(\overline{x}\) 1\(\overline{x}\) 2) due to chance alone]}, rather than the research hypothesis (HT). To see the relationship between HA and HT, HA can be rewritten as the disjunction HA: ({(μ1 = μ2) [(\(\overline{x}\) 1\(\overline{x}\) 2) not due to chance alone]} {(μ1 ≠ μ2) [\((\overline{x}\) 1\(\overline{x}\) 2) not due to (μ1 ≠ μ2) alone]} {(μ1  μ2) [(\(\overline{\boldsymbol{x}}\) 1 \(\overline{\boldsymbol{x}}\) 2) due to (μ1  μ2) alone]}). This reveals that HT (the last disjunct in bold) is just one possibility within HA. It is only by adding premises to NHST that HT or other conclusions can be reached.

Conclusions

Using this popular terminology for NHST, analysis shows that the definitions of H0 and HA differ from those found in textbooks. In this framework, achieving a statistically significant result only justifies the broad conclusion that the results are not due to chance alone, not that the research hypothesis is true. More transparency is needed concerning the premises added to NHST to rig particular conclusions such as HT. There are also ramifications for the interpretation of Type I and II errors, as well as power, which do not specifically refer to HT as claimed by texts.

Peer Review reports

Background

Null Hypothesis Significance Testing (NHSTFootnote 1) and the Confidence Interval (CI) or estimation method are the pillars of statistical inference [1,2,3,4,5]. NHST is perhaps the more common of the two for the analysis of research questions [6]. In NHST a null hypothesis (H0) is rejected in favour of an alternative hypothesis (HA) only if the P-value, P (observed data or more extreme│H0), falls below a pre-specified α-level. The latter is the maximum probability we are prepared to tolerate of erroneously rejecting H0. If the P-value is less than α, then this is called a statistically significant result and H0 can be rejected. Some familiarity with NHST will be assumed in this paper. NHST is a combination of two different statistical theories: R. A. Fisher’s P-value significance test, and the Neyman-Pearson technique of hypothesis testing. The two groups never intended to unite the theories, with well-known antagonisms existing between them [7]. However, NHST gained traction perhaps due to its appeal as a mechanical decision tool. Parallel to its popularity is the detailed, sharp criticism it has received from several quarters. Problems raised include: the misinterpretation of the P-value as P(H0│observed data) rather than P (observed data or more extreme│H0); the artificial dichotomous nature of statistical significance; and the conflation of statistical significance with clinical importance [8]. In fact, P-values have even been temporarily banned from some journals [9]. More recently, the correct level of statistical significance (P-value or α cut-off) has again been debated [10]. However, rather than cover old ground, we will here present a new logical analysis of a popular version of NHST presented in textbooks. NHST is perhaps best explained in terms of statistical models [11]. However, in most popular textbooks for non-statisticians, NHST is frequently presented in terms of the difference between population or sample groups and framed in reference to the research hypothesis. The need for an in-depth focus on the logic of NHST when couched in these terms can be seen from the following summary.

Starting with H0, there are various definitions offered. H0 is the hypothesis of no difference or association between groups [1, 5, 12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]. Using population means (μ) as an example, this is H0: μ1 = μ2, meaning there is no difference in the population [2, 28,29,30,31,32,33]. In addition, there is the idea that H0 is the opposite/reverse/complement/negation of the test/experimental/study/research hypothesis [1, 3, 6, 25, 27, 28]. In clinical studies, this segues to the stronger claim that the absence of a difference is due to a lack of treatment effect [3, 5, 6, 13, 20, 21, 28, 31, 34,35,36]. In contrast to the idea of “no difference” is the anticipation that chance or random variation will produce a difference between the sample means [37]. Some texts unite the two ideas about the presence and absence of difference into one H0 which states there is no difference in the population and the difference in the sample groups is due to chance [2, 38,39,40,41]. Although a symbol exists for the mean of the sample group \((\overline{x})\), there was no example of this more complex version of H0 translated into symbols in any text sampled. In fact, some texts mention this more complex H0 only to quickly drop the idea and revert to H0: μ1 = μ2 anyway [27, 42].

Moving on to the definition of HA, we find similar themes phrased in a contrary fashion. HA is the hypothesis that there is a difference or association between the groups [12, 13, 22, 23, 32]. Some specify that the groups are the populations such that HA: μ1 ≠ μ2 [2, 4, 24]. This type of difference is described as statistically significant [26] or real [2, 17, 18, 42, 43]. HA is elsewhere proposed to be: the experimental/ research/study hypothesis [3, 5, 6, 28, 36, 43]; or the hypothesis that there is a treatment effect [1, 6, 20, 33, 34, 39]; or the contradictory or complementary hypothesis to H0 [14, 34, 35, 42]. There are attempts to unite claims about the population and sample groups, namely that the difference in the sample groups is due to the difference in the population [42]. Again, in the texts sampled, the latter hypothesis was never translated into symbols or further pursued.

Another area of disagreement, apart from the content of HA, is the strength of the conclusion when rejecting H0. Some claim we accept HA as true [1, 5, 16, 20, 23] or real [18]. There are also softer versions that state HA is just “supported” or is “probably true” [6, 19]. Alternatively, conclusions can be framed in terms of the test hypothesis being true [2, 15, 16, 20, 27, 29, 33,34,35, 43, 44], or more tentatively, that we gain confidence or support for the test hypothesis [6, 25, 28, 31, 41, 42]. More bewildering still are claims suggesting there are multiple other hypotheses or explanations! [1, 12, 16, 21, 34, 35, 40]

The interpretation of the phrase “statistically significant” [2, 5, 21, 34, 39, 40, 42], often abbreviated to just “significant” [21, 25, 27, 28, 30, 33,34,35], ranges from the claim that the data are not due to chance [24, 45] to the weaker claim that the data are unlikely to be due to chance [2, 18, 40].

In NHST, H0 and HA are presented as a hypothesis pair. A commonly presented pair is H0: μ1 = μ2 and HA: μ1 ≠ μ2. This hypothesis pair is mutually exclusive and exhaustive which some texts explicitly state are desirable characteristics [1, 19, 46]. Elsewhere, however, H0 and HA are frequently presented as a non-exhaustive, false dichotomy between the test hypothesis and the hypothesis that the results are due to chance [3, 6, 16, 18, 19, 24, 25, 27, 34, 38, 40, 41, 44].

From the above we see that this family of interpretations of NHST provides no consensus on many aspects. This poses a challenge to interpreting NHST when expressed in this fashion. From within the framework of this popular terminology, the purpose of the present paper is to

  • 1/ define H0, HA, power and type I and II errors,

  • 2/ define the minimum axiom set for NHST and

  • 3/ make transparent which assumptions are needed to conclude the research hypothesis is true.

Methods

Here we assume the common terminology of expressing NHST in terms of differences between populations or sample groups and in reference to the research hypothesis. The scope and limits of the P-value, the test statistic and its probability distribution curve (PDC) will be used to arbitrate on the correct form of H0 and HA within this framework. Propositional calculus will be employed to analyse NHST. We also acknowledge multi-factorial hypotheses. For example, we can hypothesise that the difference between two sample groups is due to bias, chance or an intervention. These hypotheses are independent which entails that they can act in combination to produce the results. To disambiguate between single- or multi-factorial hypotheses, the term “alone” will be used to refer to the former. For example, “(\(\overline{x}\) 1\(\overline{x}\) 2) due to chance alone” means chance is the only factor involved in the sample group difference, as opposed to chance acting in concert with other factors to produce the results.

Results

For consistent vocabulary throughout this paper, we will use as our example the common scenario of comparing the means of two sample groups. The appropriate test statistic for this is the t-statistic which has its relevant PDC. We will commence by stating the minimum axiom set needed for a NHST to function. To this end, we accept as axiomatic that if P(observed data or more extreme│H0) < α, then reject H0 and accept HA.

The testable H 0

In the introduction we saw that H0 had various definitions including H0: μ1 = μ2 or the “opposite” of the research hypothesis. Understandably, these are H0’s that we would like to test, but that does not guarantee that these candidates are testable. Here we propose a new approach: the decision concerning which is the correct H0 should be determined by the scope and limits of the actual technique that will be used to reject H0. In our example, the decision to reject H0 is based on the P-value of the t-statistic read off from its PDC. The PDC yields the probability of finding the observed t-statistic value (or more extreme) due to chance alone when there is no difference in the population means. In symbols, (something which never appeared in the texts mentioned in the introduction), the PDC gives us

$$P\left(\mathrm{observed}\ t-\mathrm{statistic}\ \mathrm{value}\ \mathrm{or}\ \mathrm{more}\ \mathrm{extreme}\vert \left\{\left({\mu}_1={\mu}_2\right)\ \mathrm{and}\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\right).$$

Given that the definition of the P-value is

$$P\left(\mathrm{observed}\ t-\mathrm{statistic}\ \mathrm{value}\ \mathrm{or}\ \mathrm{more}\ \mathrm{extreme}\vert {H}_0\right),$$

we can now see that the H0 which the P-value and PDC can actually test must be

$$H_0:\left\{\left({\mu}_1={\mu}_2\right)\ \mathrm{and}\ \right[\left({\overline{x}}_1\ne {\overline{x}}_2)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right\}.$$

In other words, it is the hypothesis that the finding in the sample groups is due to chance or random variation alone and does not reflect a difference in the underlying population.

Rejecting (μ 1 = μ 2)

Textbooks often claim that we can use NHST to reject (μ1 = μ2). However, this is not logically possible with the minimum axiom set NHST. To demonstrate this, we will need to transform (μ1 = μ2) to a logically equivalent proposition and use propositional calculus. The proposition (μ1 = μ2) is a proposition about the equality of the population means, but states nothing about the sample group means \((\overline{x})\). Using a truth table (Table 1), we can rewrite (μ1 = μ2) in a logically equivalent way such that the sample group means do appear in the proposition but without any claim being made about them.Footnote 2 Note that P(\(\overline{x}\) 1 = \(\overline{x}\) 2) =0, so any proposition containing (\(\overline{x}\) 1 = \(\overline{x}\) 2) can be eliminated from the analysis.

Table 1 Truth table for (μ1 = μ2) and its logical equivalent

From Table 1, (μ1 = μ2) ≡

$$\left(\left\{\left({\mu}_1={\mu}_2\right)\ \wedge\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\ \lor\ \left\{\left({\mu}_1={\mu}_2\right)\ \wedge\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{not}\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\right).$$
(1)

Logical equivalence is established because whenever (μ1 = μ2) is true, 1 is true too, and whenever (μ1 = μ2) is false, 1 is also false. This transformation now allows us to see why eliminating the testable H0 does not logically imply the elimination of (μ1 = μ2). Let the first disjunct of 1 be called C, and the second disjunct E. Thus, 1 becomes the disjunction C v E. We recognise C as the testable H0. The PDC can assess C, and so it may be possible to reject C depending on the P-value. However, the PDC cannot assess E. So even if we do reject C, we cannot reject E, and therefore we cannot reject the whole proposition C v E. Since 1 is logically equivalent to (μ1 = μ2), we see that we cannot reject (μ1 = μ2) using the minimum axiom set NHST. In other words, (μ1 = μ2) is not rejected when we reject the testable H0: {(μ1 = μ2) [(\({\overline{x}}_1\)\({\overline{x}}_2\)) due to chance alone]}. To reject (μ1 = μ2), a further premise will need to be added, namely ¬{(μ1 = μ2) [(\({\overline{x}}_1\)\({\overline{x}}_2\)) not due to chance alone]}.

The real H A

We take it as axiomatic that H0 and HA are mutually exclusive: the hypotheses should not overlap in the sample space. An issue identified in the introduction was whether the hypothesis pair should also be exhaustive. There are serious consequences when the pair are made into a false dichotomy. An obvious criticism is that other possibilities are simply ignored. Furthermore, it opens a Pandora’s box of candidates for HA. Frequently the research or test hypothesis (here HT) is proposed as HA. This is the proposition that there is a difference in the population due to the study intervention or treatment and the finding in the sample groups is due to this difference alone. In symbols

$${H}_T:\{\left({\mu}_1\ne {\mu}_2\right)\ \wedge\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \left({\mu}_1\ne {\mu}_2\right)\ \mathrm{alone}\right]\}.$$

However, if false dichotomies are allowed, what is to prevent other hypotheses being proposed as HA? Such as the hypothesis that bias or confounding produced the results, or some other hypothesis, or even combinations of hypotheses given that they are all independent propositions. In a false dichotomy the selection of HA is subject to prejudice.

The above problems are avoided by forming an exhaustive hypothesis pair. To avoid logical errors of negation, it is critical to note that HA must be the negation of the entire proposition represented by H0, not just a negation of part of H0. So HA must be ¬H0 and the real HA: ¬{(μ1 = μ2) and [(\({\overline{x}}_1\)\({\overline{x}}_2\)) due to chance alone]}. Therefore, the only justifiable exhaustive hypothesis pair is

$${H}_0:\left\{\left({\mu}_1={\mu}_2\right)\ \mathrm{and}\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\},$$
$${H}_A:\neg \left\{\left({\mu}_1={\mu}_2\right)\ \mathrm{and}\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}.$$

The relationship between H A and H T

HA is a more complex proposition than HT. Once again, we can transform HA into a logically equivalent proposition which has HT as a component. Let HA be represented by ¬(G J), where G is “μ1 = μ2”, and J is “ \((\overline{x}\) 1\(\overline{x}\) 2) due to chance alone.” The truth table for ¬(G J) is shown in Table 2.

Table 2 Truth table for ¬(G J)

Table 2 shows that ¬(G J) is true (bold T in last column) when G and ¬J are true (the second row), or ¬G and J are true (the third row), or ¬G and ¬J are true (the last row). This allows us to formulate a disjunction logically equivalent to ¬(G J). Thus ¬(G J) ≡ (G ¬J) (¬G J) (¬G ¬J). Now ¬J ≡ {(\(\overline{x}\) 1 = \(\overline{x}\) 2) [(\(\overline{x}\) 1\(\overline{x}\) 2) not due to chance alone]}. However, as stated previously, we can eliminate (\(\overline{x}\) 1 = \(\overline{x}\) 2) making ¬J ≡ [(\(\overline{x}\) 1\(\overline{x}\) 2) not due to chance alone]. Substituting back, HA

$$(\left\{\left({\mu}_1={\mu}_2\right)\ \wedge\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{not}\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\ \lor\ \left\{\left({\mu}_1\ne {\mu}_2\right)\ \wedge\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\ \lor\ \left\{\left({\mu}_1\ne {\mu}_2\right)\ \wedge\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{not}\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}).$$

Furthermore, the second disjunct is a contradiction and can be eliminated giving

$${\it{H}}_{\mathrm{A}}:\left\{\left({\mu}_1={\mu}_2\right)\ \wedge\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{not}\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\ \lor\ \left\{\left({\mu}_1\ne {\mu}_2\right)\ \wedge\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{not}\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}.$$
(2)

Where does HT lie in 2? HT is contained within the last disjunct of 2, {(μ1 ≠ μ2) [(\(\overline{x}\) 1\(\overline{x}\) 2) not due to chance alone]}. The latter disjunct expresses the proposition that there is a difference found in the population and also that the sample group difference is not due to chance alone, but instead is due to some other alternative. The other alternatives include the test intervention or bias or some other unknown or even a combination of these given that the alternatives are independent hypotheses. Taking this into account we can rewrite 2 such that HA

$$\left\{\left({\mu}_1={\mu}_2\right)\ \wedge\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{not}\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\ \lor\ \left\{\left({\mu}_1\ne {\mu}_2\right)\ \wedge\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{not}\ \mathrm{due}\ \mathrm{to}\ \left({\mu}_1\ne {\mu}_2\right)\ \mathrm{alone}\right]\right\}\ \lor\ \left\{\left({\boldsymbol{\mu}}_{\mathbf{1}}\mathbf{\ne}{\boldsymbol{\mu}}_{\mathbf{2}}\right)\ \boldsymbol{\wedge}\ \left[\left({\overline{\boldsymbol{x}}}_{\mathbf{1}}\mathbf{\ne}{\overline{\boldsymbol{x}}}_{\mathbf{2}}\right)\ \mathbf{due}\ \mathbf{to}\ \left({\boldsymbol{\mu}}_{\mathbf{1}}\mathbf{\ne}{\boldsymbol{\mu}}_{\mathbf{2}}\right)\ \mathbf{alone}\right]\right\}.$$
(3)

The last disjunct of 3 is HT (in bold), indicating that HT is just one sub-hypothesis of HA.

Finally, the answer to the question “What do we accept when we reject H0?” is: we accept the real HA or its logical equivalent (3). Therefore, a statistically significant finding, expressed in these common terms, should be interpreted as meaning that the data is not due to chance alone. Statistical significance is not a licence to accept HT.

The effect of further premises on the minimum axiom set NHST

It is only by adding premises to NHST that we can conclude anything other than the real HA. The danger with this strategy is that of partially assuming what is being proved. Table 3 presents examples of premises that if added to NHST would rig different conclusions.

Table 3 Adding premises to NHST to conclude HT. Comparison of group means is used as an example. HT (in bold) is defined in the text

Some texts claim that all that is needed to conclude HT when H0 is rejected is the assumption that there is no bias [35, 47]. However, Table 3 illustrates exactly which premises are needed in order to conclude HT. Apart from assuming no bias, it is also necessary to assume there are no combination hypotheses in which chance plays a role. A corollary is that if NHST could lead us to conclude HT of its own accord, no further premises would be required. What would the conclusion be if indeed we only assumed that there was no bias? The middle column of Table 3 shows the conclusion. In a model which stipulates that the possible causes of the sample group difference are chance, bias or the intervention (or combinations thereof), the conclusion would be

$${\boldsymbol\{{\boldsymbol({\boldsymbol{\mu}}_{\mathbf{1}}\ {\boldsymbol\ne}\ {\boldsymbol{\mu}}_{\mathbf{2}}\boldsymbol)}\ \boldsymbol\wedge\ {\boldsymbol[{\boldsymbol({\boldsymbol{\overline{\boldsymbol{x}}}_{\mathbf{1}}}\ \mathbf{\boldsymbol\ne}\ {\boldsymbol{\overline{\boldsymbol{x}}}_{\mathbf{2}}}\boldsymbol)}\ \mathbf{due}\ \mathbf{to}\ {\boldsymbol({\boldsymbol{\mu}}_{\mathbf{1}}\ {\boldsymbol\ne}\ {\boldsymbol{\mu}}_{\mathbf{2}}\boldsymbol)}\ \mathbf{alone}\boldsymbol]}\boldsymbol\}}\ \lor\ \left\{\right[\left({\mu}_1\ne {\mu}_2\right)\ \wedge\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \left({\mu}_1\ne {\mu}_2\right)\ \mathrm{and}\ \mathrm{chance}\right]\}.$$

The first disjunct in bold is HT, showing that the conclusion is more complex than HT alone. The last column demonstrates that a different package of additional premises can be tailored to reach a different conclusion such as the hypothesis that bias produced the results, here represented as HB: {(μ1 = μ2) [(\(\overline{x}\) 1\(\overline{x}\) 2) due to bias alone]}. Similar to arithmetic, the process in Table 3 is commutative. The same results are achieved if we were to make the assumptions first and then do the NHST or vice versa ― the order does not matter.

Application to other statistical problems

So far we have focused on the comparison of sample group means. However, with appropriate changes in vocabulary we can define the real H0 and HA for other scenarios ― mutatis mutandis, as they say. As illustrations, H0 and HA in general form, for the comparison of sample group proportions, and for correlation are presented in Table 4.

Table 4 H0 and HA for common scenarios. HA has also been transformed into its logical equivalent to identify HT (in bold)

Failure to reject H 0

What are we to conclude if we fail to reject H0? The axiom of NHST states that we reject H0 if P-value < α. This does not logically imply that if P-value ≥ α we must accept H0 ― the axiom and the claim about accepting H0 are logically distinct ideas. So if P-value ≥ α, we should merely state we have failed to reject H0 rather than we accept H0.

Power (1-β), type I (α) and type II (β) errors

Textbooks which express NHST in terms of the research hypothesis also tend to carry this over to descriptions of Type I and II errors, as well as power calculations. However, this is fraught with error as can be seen when we apply the real definitions of H0 and HA. Type I error is the probability of eliminating H0, and accepting HA, when in fact H0 is true. Using the real definitions of H0 and HA gives us type I error:

$$P\left(\mathrm{rejecting}\ \left\{\left({\mu}_1={\mu}_2\right)\ \mathrm{and}\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\vert \left\{\left({\mu}_1={\mu}_2\right)\ \mathrm{and}\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\right).$$

Importantly, type I error is not the probability of accepting HT when H0 is true. Since HA is a disjunction, there are multiple propositions that can make it true, with HT being just one of these. So P(HA) > P(HT) and P(mistakenly accepting HT) > P(mistakenly accepting HA). The conflation of HT with HA results in underestimating the probability of mistakenly accepting HT.

Similarly for type II error which is the probability of not rejecting H0, and not accepting HA, when H0 is false and should have been rejected. Namely,

$$P\left(\mathrm{not}\ \mathrm{rejecting}\ \left\{\left({\mu}_1={\mu}_2\right)\ \mathrm{and}\ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\vert \mathrm{it}\ \mathrm{is}\ \mathrm{not}\ \mathrm{the}\ \mathrm{case}\ \mathrm{that}\ \left\{\left({\mu}_1={\mu}_2\right)\ \mathrm{and}\ \left[\left(\ {\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\right).$$

Type II error is not the probability of not accepting HT when H0 is false. A low probability of not accepting HA does not logically imply a low probability of not accepting HT. P(not accepting HT) > P(not accepting HA) because more propositions need to be rejected in order to accept HT. The conflation of HT with HA results in underestimating the probability of not accepting HT when H0 is false.

Power (1- β) refers to the probability of rejecting H0 and accepting HA given H0 is false. Specifically, power is

$$P\left(\mathrm{rejecting}\ \left\{\left({\mu}_1={\mu}_2\right)\ \mathrm{and}\ \ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\vert \mathrm{it}\ \mathrm{is}\ \mathrm{not}\ \mathrm{the}\ \mathrm{case}\ \mathrm{that}\ \left\{\left({\mu}_1={\mu}_2\right)\ \mathrm{and}\ \ \left[\left({\overline{x}}_1\ne {\overline{x}}_2\right)\ \mathrm{due}\ \mathrm{to}\ \mathrm{chance}\ \mathrm{alone}\right]\right\}\right).$$

However, it does not refer to P(accepting HTHT). The power to conclude HT < the power to conclude HA. The conflation of HT with HA results in overestimating the power to conclude HT because HT is just one part of HA.

Discussion

NHST has been well described in terms of statistical models. However, it is also commonly presented in terms of group comparisons and with reference to the research hypothesis. Despite this being a popular interpretation, there is currently no standardised approach. The variation in definitions of H0 and HA, how they should be paired and conclusions that can be drawn by eliminating H0 motivated this new logical analysis. Looking at the conditions of the P-value we can see that there can be only one testable H0. Presenting H0 and HA as a false dichotomy is common but unjustifiable. Combining these two ideas entails that HA is ¬H0. Texts should acknowledge this and also make transparent any premises added in order to reach a conclusion other than ¬H0 when H0 is rejected.

It may be thought that using the estimation or CI method can avoid the problems of expressing NHST in these terms. However, this is not true if the estimation method is used as a de facto NHST. The estimation method can be used as a NHST because the CI is mathematically related to the α-level and the P-value such that if the CI does not cross zero (or 1 for ratios), we can claim statistical significance. In the context of using CI as a NHST, the conclusions of the present paper are relevant. Consequently, when using the CI method, the correct interpretation of statistical significance would be to accept the real HA and not claim that HT is true. Of course, there are other appealing features of the CI method and the present discussion is limited only to its use as a significance test.

A limitation of the present paper is that we have not questioned the axiom of NHST that we reject H0 if the P-value < α. An analysis of this axiom deserves a paper in its own right which discusses inductive logic and defines the conditions under which the axiom is reliable. The issue in the present paper has been solely that if we are to use NHST as it is commonly presented it should at least be with justifiable definitions of H0 and HA, transparent assumptions and valid deductions from the given premises.

Conclusions

NHST is commonly expressed in terms of differences between groups and with reference to the research hypothesis. Within this framework, logical analysis reveals that the minimum axiom set NHST (for comparing sample means) is as follows:

  • H0: {(μ1 = μ2) and [(\(\overline{x}\)1\(\overline{x}\)2) due to chance alone]},

  • HA: ¬{(μ1 = μ2) and [(\(\overline{x}\)1\(\overline{x}\)2) due to chance alone]}.

  • If P-value ≥ α, then fail to reject H0.

  • If P-value < α, reject H0 and conclude HA.

At best, it can be concluded that if H0 is rejected, the data were not due to chance alone. Texts should also be transparent about which assumptions have been added to rig a conclusion such as HT. Care should also be exerted to avoid misinterpreting type I and II errors, as well as power, in terms of the research hypothesis.

Availability of data and materials

All data generated or analysed during this study are included in this published article.

Notes

  1. “NHST” is probably the most widely used abbreviation for the various names applied to hypothesis and significance tests 1. Nickerson RS. Null hypothesis significance testing: a review of an old and continuing controversy. Psychol Methods 2000; 5: 241–301. 2000/08/11. DOI: https://doi.org/10.1037/1082-989x.5.2.241.

  2. Truth tables analyse the truth of complex propositions based on giving truth values of true (T) or false (F) to its elemental components. When propositions are subject to logical analysis here, we shall use the symbols of propositional calculus: “” for “and”; “” for “or”; and “¬” for “not” used to express negation. “¬X” means “It is not the case that X.” “≡” means “is equivalent to” such that “X ≡ Y” means “proposition X is equivalent to proposition Y.”

References

  1. Daniel WW. Biostatistics : a foundation for analysis in the health sciences. 9th ed. Hoboken: Wiley; 2009.

    Google Scholar 

  2. Munro BH, Page EB. Statistical methods for health care research, vol. xi. 2nd ed. Philadelphia: Lippincott; 1993. p. 403.

    Google Scholar 

  3. Gallin JI, Ognibene FP, Johnson LL. Principles and practice of clinical research, vol. xvii. 4th ed. London: Academic Press; 2018. p. 80.

    Google Scholar 

  4. Mann PS, Lacke CJ. Introductory statistics, vol. xx. 7th ed. Hoboken: Wiley; 2010. p. 116.

    Google Scholar 

  5. Sullivan LM. Essentials of biostatistics in public health, vol. xii. 3rd ed. Burlington: Jones & Bartlett Learning; 2018. p. 376.

    Google Scholar 

  6. Field AP. Discovering statistics using IBM SPSS statistics : and sex and drugs and rock 'n' roll, vol. xxxvi. 4th ed. Los Angeles: Sage; 2013. p. 915.

    Google Scholar 

  7. Salsburg D. The lady tasting tea : how statistics revolutionized science in the twentieth century, vol. xi. New York: W.H. Freeman; 2001. p. 340.

    Google Scholar 

  8. Nickerson RS. Null hypothesis significance testing: a review of an old and continuing controversy. Psychol Methods. 2000;5:241–301. https://doi.org/10.1037/1082-989x.5.2.241.

    CAS  Article  PubMed  Google Scholar 

  9. Trafimow D, Marks M. Editorial. Basic Appl Soc Psychol. 2015;37:1–2. https://doi.org/10.1080/01973533.2015.1012991.

    Article  Google Scholar 

  10. Ioannidis JPA. The Proposal to Lower P Value Thresholds to .005. JAMA. 2018;319:1429–30. https://doi.org/10.1001/jama.2018.1536.

    Article  PubMed  Google Scholar 

  11. Lehmann EL, Romano JP. Testing statistical hypotheses, vol. xiv. 3rd ed. New York: Springer; 2005. p. 784.

    Google Scholar 

  12. Stewart A. Basic statistics and epidemiology : a practical guide, vol. iv. 3rd ed. Oxford: Radcliffe Pub; 2010. p. 200.

    Google Scholar 

  13. Everitt B. Medical statistics from A to Z : a guide for clinicians and medical students, vol. vi. 2nd ed. Cambridge: Cambridge University Press; 2006. p. 249.

    Book  Google Scholar 

  14. Gerstman BB. Basic biostatistics : statistics for public health practice, vol. xv. 2nd ed. Burlington: Jones & Bartlett Learning; 2015. p. 644.

    Google Scholar 

  15. Hickson M. Research handbook for health care professionals, vol. xiv. Chichester, U.K: Wiley-Blackwell; 2008. p. 184.

    Google Scholar 

  16. Katz MH. Study design and statistical analysis : a practical guide for clinicians. Cambridge: Cambridge University Press; 2006. p. 188.

    Book  Google Scholar 

  17. Katz DL, Jekel JF. Jekel's epidemiology, biostatistics, preventive medicine, and public health, vol. xiii. 4th ed. Philadelphia, London: Saunders; 2014. p. 405.

    Google Scholar 

  18. O'Brien PMS, Broughton-Pipkin F. Introduction to research methodology for specialists and trainees. 3rd ed. Cambridge, New York: Cambridge University Press; 2017.

    Book  Google Scholar 

  19. Townend J. Practical statistics for environmental and biological scientists, vol. x. Chichester; New York: Wiley; 2002. p. 276.

    Google Scholar 

  20. Bland M. An introduction to medical statistics, vol. xviii. 4th ed. Oxford: Oxford University Press; 2015. p. 427.

    Google Scholar 

  21. Wang D, Bakhai A. Clinical trials : a practical guide to design, analysis, and reporting, vol. xiii. London: Remedica; 2006. p. 480.

    Google Scholar 

  22. Guluma K, Wilson MP, Hayden S. Doing research in emergency and acute care : making order out of chaos. Chichester, West Sussex; Hoboken: Wiley; 2015.

    Google Scholar 

  23. Hulley SB. Designing clinical research. 4th ed. Philadelphia: Wolters Kluwer/Lippincott Williams & Wilkins; 2013.

    Google Scholar 

  24. Peat JK, Barton B. Medical statistics : a guide to SPSS, data analysis, and critical appraisal. 2nd ed. Chichester, West Sussex ; Hoboken: John Wiley & Sons Inc.; 2014.

    Google Scholar 

  25. Harris M, Taylor G. Medical statistics made easy 3, vol. xii. 3rd ed. Banbury: Scion; 2014. p. 116.

    Google Scholar 

  26. Hofmann AH. Scientific writing and communication. Papers, proposals, and presentations. 3rd ed. New York: Oxford University Press; 2017.

    Google Scholar 

  27. Campbell MJ, Walters SJ, Machin D. Medical statistics : a textbook for the health sciences, vol. xii. 4th ed. Chichester, Hoboken: Wiley; 2007. p. 331.

    Google Scholar 

  28. Hill T, Lewicki P. Statistics : methods and applications : a comprehensive reference for science, industry, and data mining, vol. xvi. Tulsa: StatSoft; 2006. p. 832.

    Google Scholar 

  29. Riegelman RK. Studying a study and testing a test : how to read the medical evidence, vol. vii. 5th ed. Philadelphia: Lippincott Williams & Wilkins; 2005. p. 403.

    Google Scholar 

  30. Rees DG. Essential statistics, vol. xiii. 2nd ed. London, New York: Chapman and Hall; 1989. p. 258.

    Book  Google Scholar 

  31. Kuzma JW, Bohnenblust SE. Basic statistics for the health sciences, vol. xvii. 4th ed. Mountain View: Mayfield Pub. Co; 2001. p. 364.

    Google Scholar 

  32. Peat JK, Barton B, Elliott EJ. Statistics workbook for evidence-based healthcare, vol. viii. Malden: Blackwell; 2008. p. 182.

    Book  Google Scholar 

  33. Altman DG. Practical statistics for medical research, vol. xii. Boca Raton: Chapman & Hall/CRC; 1999. p. 611.

    Google Scholar 

  34. Myles PGT. Statistical methods for Anaesthesia and intensive care. Edinburgh: Butterworth-Heinemann; 2000.

    Google Scholar 

  35. Rosner B. Fundamentals of biostatistics, vol. xix. 8th ed. Boston: Cengage Learning; 2016. p. 927.

    Google Scholar 

  36. Petrie A, Sabin C. Medical statistics at a glance. 3rd ed. Chichester, Hoboken: Wiley-Blackwell; 2009. p. 180.

    Google Scholar 

  37. Campbell MJ, Swinscow TDV. Statistics at square one, vol. iv. 11th ed. Chichester, Hoboken: Wiley-Blackwell/BMJ Books; 2009. p. 188.

    Google Scholar 

  38. Argyrous G. Statistics for social and Health Research. Great Britain: Sage Publications; 2000.

    Google Scholar 

  39. McCaig C, Dahlberg L. Practical research and evaluation : a start-to-finish guide for practitioners, vol. p.viii. London: SAGE; 2010. p. 263.

    Google Scholar 

  40. Daly LE, Bourke GJ, Bourke GJ. Interpretation and uses of medical statistics, vol. xiii. 5th ed. Oxford: Blackwell Science; 2000. p. 568.

    Book  Google Scholar 

  41. Kirkwood BR, Sterne JAC, Kirkwood BR. Essential medical statistics, vol. x. 2nd ed. Malden: Blackwell Science; 2003. p. 501.

    Google Scholar 

  42. Le CT, Eberly LE. Introductory biostatistics, vol. xvii. 2nd ed. Hoboken, New Jersey: Wiley; 2016. p. 591.

    Google Scholar 

  43. McKenzie S. Vital statistics: an introduction to health science statistics. Chatswood: Churchill Livingstone.

  44. Glantz SA. Primer of biostatistics. 7th ed. New York: McGraw-Hill Medical Pub. p. 2002.

  45. Gosall NaG G. The doctor's guide to critical appraisal. 4th ed. UK: Pastest.

  46. Glover T, Mitchell K. An introduction to biostatistics, vol. x. 3rd ed. Long Grove: McGraw-Hill; 2016. p. 487.

    Google Scholar 

  47. Hill AB. Principles of medical statistics. 12th ed. New York: Oxford University Press; 1989.

    Google Scholar 

Download references

Acknowledgements

The anonymous reviewers are thanked for many useful comments.

List of abbreviations and symbols

α: alpha-level. The pre-specified acceptable ceiling on the type I error. The threshold which defines the critical region of the PDC, or the threshold below which the P-value has to fall in order to reject H0.

β: type II error. The probability of not rejecting H0 when H0 is false.

HA: the alternative hypothesis to H0 which is accepted only when H0 is rejected.

HB: the hypothesis that bias is solely responsible for the research finding.

H0: the null hypothesis. In NHST, it is only rejected when P-value < α.

HT: the test or research hypothesis. Sometimes cited as the candidate for HA. For example, the hypothesis that a drug is the cause of a difference between two sample groups, or there is an association between two variables.

μ: mu. The mean of the population.

NHST: null hypothesis significance test/testing. It will be used here as an umbrella term referring to both “test” or “testing” which will be clear from the context.

P-value: P(observed data (or more extreme))│H0).

PDC: probability distribution curve of the test statistic.

p: the sample proportion.

p̂: the population proportion.

ρ (rho): population Pearson correlation coefficient.

r: sample group Pearson correlation coefficient.

\(\overline{x}\): the mean of the sample group.

: and, used to express conjunction.

: or, used to express disjunction.

¬: not, used to express negation. "It is not the case that..."

≡: logical equivalence. E.g., “X ≡ Y” means proposition X is logically equivalent to proposition Y.

Funding

N/a

Author information

Authors and Affiliations

Authors

Contributions

RM is sole author. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Richard McNulty.

Ethics declarations

Ethics approval and consent to participate

N/a

Consent for publication

N/a

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

McNulty, R. A logical analysis of null hypothesis significance testing using popular terminology. BMC Med Res Methodol 22, 244 (2022). https://doi.org/10.1186/s12874-022-01696-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12874-022-01696-5

Keywords

  • Logic
  • Null hypothesis significance test
  • Hypothesis testing
  • Statistical inference
  • Statistical significance
  • Type I error
  • Type II error
  • Power