This article has Open Peer Review reports available.
A nomogram for Pvalues
© Held; licensee BioMed Central Ltd. 2010
Received: 9 December 2009
Accepted: 16 March 2010
Published: 16 March 2010
P values are the most commonly used tool to measure evidence against a hypothesis. Several attempts have been made to transform P values to minimum Bayes factors and minimum posterior probabilities of the hypothesis under consideration. However, the acceptance of such calibrations in clinical fields is low due to inexperience in interpreting Bayes factors and the need to specify a prior probability to derive a lower bound on the posterior probability.
I propose a graphical approach which easily translates any prior probability and P value to minimum posterior probabilities. The approach allows to visually inspect the dependence of the minimum posterior probability on the prior probability of the null hypothesis. Likewise, the tool can be used to read off, for fixed posterior probability, the maximum prior probability compatible with a given P value. The maximum P value compatible with a given prior and posterior probability is also available.
Use of the nomogram is illustrated based on results from a randomized trial for lung cancer patients comparing a new radiotherapy technique with conventional radiotherapy.
The graphical device proposed in this paper will enhance the understanding of P values as measures of evidence among non-specialists.
P values are the most commonly used tool to measure evidence against a hypothesis . The P value is defined as the probability, under the assumption of no effect (the null hypothesis H 0), of obtaining a result equal to or more extreme than what was actually observed. The complexity of this definition has led to widespread misinterpretations and criticisms [2–5]. Indeed, P values are often misinterpreted (a) as the probability of obtaining the observed data under the assumption of no real effect, (b) as an "observed" type-I error rate, (c) as the false discovery rate, i.e. the probability that a significant finding is "false positive", and (d) as the (posterior) probability of the null hypothesis .
The latter misinterpretation has given rise to interesting work on the connection between P values and (posterior) probabilities of the null hypothesis. Within a Bayesian framework, the posterior probability is a function of the prior probability and the so-called Bayes factor, which summarizes the evidence against the null hypothesis.
Several attempts have been made to transform P values to lower bounds on the Bayes factor and the resulting posterior probability of the null hypothesis [7–11]. In this context Bayes factors are usually oriented as P values such that smaller values provide stronger evidence against the null hypothesis. These techniques calibrate P values such that an interpretation as minimum Bayes factor or minimum posterior probability is justified. Although the different approaches do not result in identical calibration scales, a universal finding is that the evidence against a simple null hypothesis is by far not as strong as the P value might suggest.
However, the acceptance of calibrated P values in clinical fields is low. Minimum Bayes factors have the advantage that they do not depend on the prior probability of the null hypothesis , but their interpretation requires an intuitive understanding of odds, similar to likelihood ratios in diagnostic studies . Clinicians, however, prefer to think in terms of probabilities. The calculation of the minimum posterior probability, on the other hand, requires to decide on a prior probability of the null hypothesis. Fixing a prior probability may be difficult for the clinician, who would perhaps prefer to investigate - for a given P value - the dependence of the (minimum) posterior probability of the null hypothesis on the prior probability.
In this paper I propose a graphical approach, which easily translates any prior probability and P value into minimum posterior probabilities. Likewise, the tool can be used to derive, for fixed posterior probability, the maximum prior probability compatible with a given P value. The maximum P value in accordance with a given prior and posterior probability can be also read off. The approach is inspired by the Fagan nomogram  used to derive the post-test probability in diagnostic tests . It will enhance the understanding and facilitate the interpretation of P values as measures of evidence against the null hypothesis among non-specialists.
Calibration of Pvalues
In a seminal paper, Edwards, Lindman and Savage  (ELS) studied the relationship between P values and minimum Bayes factors in several settings. Of particular interest is the case where a test statistic is normal distributed with unknown mean μ. A simple null hypothesis H 0 corresponds to a particular mean value μ = μ 0. Calculation of the Bayes factor requires fixing a prior density for μ under the alternative hypothesis H 1: μ ≠ μ 0.
This scenario reflects, at least approximately, many of the statistical procedures found in medical journals.
here z is the z-value, i.e. the test statistic which has given rise to the observed P value. This lower bound can be derived using the fact that the Bayes factor is minimized if the alternative hypothesis has all its prior density at one particular value of μ supported most by the data (the Maximum Likelihood estimate). Because this point is always on one side of the null hypothesis, ELS suggested to use a z-value based on a one-tailed rather than a two-tailed significance test. A two-tailed test, which leads to slightly larger values of z and to slightly smaller values of BF has also been suggested .
Lower bounds on the posterior probability of the null hypothesis for different P values and equal prior probabilities of null and alternative hypothesis (q = 50%).
Edwards, Lindman, and Savage (1963)
Berger and Sellke (1987, Scenario 1)
Sellke, Bayarri, and Berger (2001)
Berger and Sellke (1987, Scenario 2)
Berger and Sellke (1987, Scenario 3)
The ELS approach has been refined by Berger and Sellke  (BS). They derived lower bounds for the Bayes factor under more realistic families of prior distributions for μ under the alternative hypothesis. In particular, they considered (1) symmetric prior distributions, (2) unimodal and symmetric prior distributions, and (3) normal prior distributions, all centered at μ 0. As one would expect, the corresponding lower bounds on the posterior probability of H 0 increase with increasing restrictions on the prior family for μ, as can be seen in Table 1.
Here, x is the value of the χ 2-test statistic which has given rise to the observed P value. It can be easily shown that BF decreases with increasing degrees-of-freedom. Perhaps more interestingly, BF is equal to the BS lower bound for normal priors for ν = 1, equals the SBB lower bound for ν = 2, and is equal to the ELS lower bound for ν → ∞. This illustrates that the range of lower bounds on the posterior probability given in Table 1 reflects a large variety of different tests and scenarios.
A nomogram for Pvalues
The apparent complexity of the formulae presented in the previous section may be one of the reasons why the proposed calibration of P values has not entered routine scientific research. I therefore suggest to adapt a graphical device, originally developed for diagnostic tests , to the setting outlined above. The original Fagan nomogram allows to visually determine the post-test probability for a given pre-test probability and a likelihood ratio in a diagnostic test framework . The likelihood ratio is a function of sensitivity, specificity and the actual result of the diagnostic test considered. The likelihood ratio is a specific form of a Bayes factor where both hypotheses under consideration (either the patient has the disease or not) are simple and no additional prior assumptions have to be made.
Note that there are some notable differences compared with the original Fagan nomogram. First, the likelihood ratio is replaced with the P value. Secondly, only P values smaller than 1/e ≈ 0.37 are considered since BF is unity for larger P values, where there is lack of evidence against the null hypothesis. Therefore the prior probability scale on the left-hand side of the plot is not identical to the posterior probability scale on the right-hand side of the plot. This reflects the fact that P values are asymmetric measures of evidence, they quantify the evidence against the null hypothesis, but they do not quantify the evidence in favour of the null hypothesis. This is different in the Fagan nomogram, where likelihood ratios can be both larger and smaller than unity. Finally, the third axis gives not an exact value for the posterior probability of the null hypothesis but only the minimum posterior probability.
The proposed nomogram can be used in three different ways, as will be illustrated by the following example. In 1986 a new radiotherapy technique called CHART was introduced. Promising pilot studies led the UK Medical Research Council to instigate a large randomized trial for lung cancer patients. The objective of the study was to estimate the change in survival when given CHART compared with conventional radiotherapy.
Due to the relatively small prior probability, the minimum posterior probability of the null hypothesis is in this example numerically quite close to the P value. This will be different for larger prior probabilities. For example, for q = 50% we obtain a minimum posterior probability of no survival benefit of around 4.5% (red line). For q = 90% the minimum posterior probability is 29.9% (blue line).
The Fagan nomogram  is widely used in the context of diagnostic tests and I hope that the proposed nomogram for P values will reach similar popularity. It visually transforms P values to minimum posterior probabilities of the null hypothesis and thus avoids complicated calculations. Sensitivity with respect to prior assumptions can be studied graphically. In addition, for fixed posterior probability, the maximum prior probability compatible with a given P value can be read off. The maximum P value compatible with a given prior and posterior probability is also available.
In this paper I have adopted a Bayesian approach to calculate a lower bound on the posterior probability of the null hypothesis, derived from a prior probability and a precise P value. Even Cox [, p. 83] agrees that "conclusions expressed in terms of probability are on the face of it more powerful than those expressed indirectly via confidence intervals and P values. Further, in principle at least, they allow the inclusion of a richer pool of [prior] information." However, Cox feels that "conclusions derived from the frequentist approach are more immediately secure than those derived from most Bayesian analysis" because [prior] "information is typically more fragile or even nebulous as compared with that typically derived more directly from the data under analysis". On the other hand, Goodman [1, 3, 6, 9] argues that the misunderstanding and misuse of P values is so widespread that new tools are needed to properly convey the strength of evidence provided by research data. The nomogram proposed in this paper is such a tool and is particularly useful to study sensitivity to the prior probability of the null hypothesis, as illustrated in Figure 2. Combined with a precise P value we obtain a range of plausible values for the posterior probability of the null hypothesis, which is far easier to interpret than the P value itself.
The graphical device proposed in this paper enhances the understanding and facilitates the interpretation of P values as measures of evidence against the null hypothesis among non-specialists. For study sizes typically encountered in clinical and epidemiological research, the posterior probability of the null hypothesis will be quite close to the lower bound provided by the nomogram. We are currently preparing a JAVA applet at http://www.biostat.uzh.ch/static/pnomogram which allows to interactively use the proposed nomogram on the internet.
I am grateful to Kaspar Rufibach and two referees for helpful comments on earlier versions of this manuscript.
- Goodman SN: P Value. Encyclopedia of Biostatistics. 2005, Chichester: Wiley, 3921-3925. 2Google Scholar
- Cohan J: The Earth is Round (p < .05). Am Psychol. 1994, 49: 997-1003. 10.1037/0003-066X.49.12.997.View ArticleGoogle Scholar
- Goodman SN: Towards Evidence-Based Medical Statistics. 1: The P Value Fallacy. Ann Int Med. 1999, 130: 995-1004.View ArticlePubMedGoogle Scholar
- Hubbard R, Bayarri MJ: Confusion over measures of evidence (p 's) versus errors (α's) in classical statistical testing (with discussion). Am Stat. 2003, 57: 171-182. 10.1198/0003130031856.View ArticleGoogle Scholar
- Spiegelhalter DJ, Abrams KR, Myles JP: Bayesian Approaches to Clinical Trials and Health-Care Evaluation. 2004, New York: WileyGoogle Scholar
- Goodman SN: Introduction to Bayesian methods I: measuring the strength of evidence. Clin Trials. 2005, 2: 282-290. 10.1191/1740774505cn098oa.View ArticlePubMedGoogle Scholar
- Edwards W, Lindman H, Savage LJ: Bayesian Statistical Inference in Psychological Research. Psych Rev. 1963, 70: 193-242. 10.1037/h0044139.View ArticleGoogle Scholar
- Berger JO, Sellke T: Testing a point null hypothesis: Irreconcilability of P values and evidence (with discussion). J Am Stat Assoc. 1987, 82: 112-139. 10.2307/2289131.Google Scholar
- Goodman SN: Towards Evidence-Based Medical Statistics. 2: The Bayes Factor. Ann Int Med. 1999, 130: 1005-1013.View ArticlePubMedGoogle Scholar
- Sellke T, Bayarri MJ, Berger JO: Calibration of p Values for Testing Precise Null Hypotheses. Am Stat. 2001, 55: 62-71. 10.1198/000313001300339950.View ArticleGoogle Scholar
- Johnson VE: Bayes factors based on test statistics. J Roy Stat Soc B. 2005, 67: 689-701. 10.1111/j.1467-9868.2005.00521.x.View ArticleGoogle Scholar
- Deeks JJ, Altman DG: Diagnostic tests 4: likelihood ratios. Brit Med J. 2004, 329: 168-169. 10.1136/bmj.329.7458.168.View ArticlePubMedPubMed CentralGoogle Scholar
- Fagan TJ: Letter: Nomogram for Bayes theorem. N Engl J Med. 1975, 293: 257-PubMedGoogle Scholar
- Spiegelhalter DJ, Myles JP, Jones DR, Abrams KR: Bayesian Methods in Health Technology Assessment: A Review. Health Technol Assess. 2000, 4 (38):Google Scholar
- Hooper R: The Bayesian interpretation of a P-value depends only weakly on statistical power in realistic situations. J Clin Epidemiol. 2009, 62: 1242-1247. 10.1016/j.jclinepi.2009.02.004.View ArticlePubMedGoogle Scholar
- Cox DR: Principles of Statistical Inference. 2005, Cambridge: Cambridge University PressGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/10/21/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.