Skip to main content

How feasible is it to abandon statistical significance? A reflection based on a short survey

Abstract

Background

There is a growing trend in using the “statistically significant” term in the scientific literature. However, harsh criticism of this concept motivated the recommendation to withdraw its use of scientific publications. We aimed to validate the support and the feasibility of adherence to this recommendation, among researchers having declared in favor of removing the statistical significance.

Methods

We surveyed signatories of an article published that defended this recommendation, to validate their opinion and ask them about how likely they will retire the concept of statistical significance.

Results

We obtained 151 responses which confirmed the support for the mentioned publication in aspects such as the adequate interpretation of the p-value, the degree of agreement, and the motivations to sign it. However, there was a wide distribution of answers about how likely are they to use the concept of “statistical significance” in future publications. About 42% declared being neutral, or that would likely use it again. We described arguments referred by several signatories and discussed aspects to be considered in the interpretation of research results.

Conclusions

The responses obtained from a proportion of signatories validated their declared position against the use of statistical significance. However, even in this group, the full application of this recommendation does not seem feasible. The arguments related to the inappropriate use of statistical tests should promote more education among researchers and users of scientific evidence.

Peer Review reports

Background

The culture of testing the null hypothesis through the p-value has dominated the practice of statistical inference [1]. In this sense, the level of significance is defined according to the alpha error that we would be willing to accept when rejecting the hypothesis that there is no association between variables of interest [2]. This level (often set at 0.05) is used to define whether an association is “statistically significant” according to the p-value obtained from the tests [3].

However, the p-value varies depending on the sample size and the magnitude of the association, and the latter varies randomly even in the absence of biases. Due to a generalized application of the same level of significance, investigations on a particular subject can lead to different conclusions, especially with limited sample sizes or in the case of weaker associations [4,5,6]. Using lower levels of significance (e. g., 0.005 instead of 0.05) or calculating the posttest probability have been suggested to address lack of replication of the claimed associations [6,7,8,9].

With a more radical approach, an article recently published in the journal Nature called on the entire scientific community to abandon the concept of “statistical significance” in scientific publications. More than 800 signatories supported this paper [10]. Even though we agree with most of the arguments cited in this publication, we considered that the recommendation of entirely abandoning the statistical significance might be less useful than promoting its rational use.

Besides, there is a growing trend of using the term “statistically significant” in the biomedical literature, as seen in Pubmed searches (Fig. 1). Therefore, we wonder how feasible it would be to abolish the use of this concept from our future publications. For this reason, we conducted a short survey with the signatories of the article to remove the statistical significance to consult the probability of them not using this term anymore. Also, we considered pertinent to validate the support of these researchers for the recommendation mentioned above.

Fig. 1
figure1

Count of references obtained in Pubmed with the term “statistically significant” (1977–2018)

Methods

We sent a short survey to each of the 854 signatories by email between May 5 and 10, 2019 (approximately six weeks after publication). The questionnaire (attached in the supplementary information) included three questions to avoid duplicate answers (country of residence, gender, and date of birth). Moreover, we added three questions to validate the support of the signatories:

  • One of them presented a scenario for the interpretation of a value of p. This multichoice question had as a correct option one considering p-value as the probability of finding values at least as extreme as those observed, assuming that the null hypothesis was true. We also considered as right when participants did not check that answer but referred to a proper interpretation of the question in the comments.

  • Another question aimed to confirm the support to the recommendation, which was formulated as “Currently, how much do you agree with the retiring of the statistical significance of future scientific publications?” The options were presented on a Likert scale with five possible responses.

  • Also, we asked about factors influencing the decision to sign the paper on retiring of statistical significance. This question allowed choosing multiple answers from four suggested options (arguments against statistical significance, arguments in favor of alternative terms, the prestige of the authors and the prestige of the journals of the publication) and writing down other motivations.

In addition to these questions, we asked the signatories how likely they are to use the concept of “statistical significance” in their future publications. The options included:

  • Never (I expect never to use it again)

  • Unlikely (It is unlikely that I will use it again)

  • Neutral, or it depends on the occasion

  • Likely (It is likely that I will use it again)

  • Always (I will use it every time I have the chance)

We presented the distribution of answers to this question as raw frequencies, and, to address potential biases from non-responders, we also calculated values weighted by the inverse the probability of responding. To calculate this probability by sex and country, we grouped the geographical origin in five categories for women and ten for men, including respectively the three and the eight countries with the highest number of responders and two regions grouping the others. Four participants (two men and two women) did not provide data from residence, so for them, we considered the average probability of response according to the sex category. We discounted the weight of these participants from the total and distributed the rest among the other participants into the corresponding sex category. In this way, the sum of all the weights remained equal to the number of signatories.

In the end, the questionnaire had an open question to record additional comments, and those we considered related to the discussion are presented in the supplementary material. We excluded any data that could potentially identify the respondent or another person. Since the questionnaire was anonymous, we were not able to send personalized reminders to the nonrespondent.

This study was evaluated and approved by the Ethics Committee of the School of Public Health of the University of São Paulo. A link for an informed consent form was sent to the participants in the invitation email. The survey was intended to be responded anonymously, and any data suggesting an individual identity was treated confidentially.

Results

Out of the forms sent to all the 854 signatories, we received 153 responses. We excluded two (one because it claimed to be lying and another because it was considered duplicate), obtaining a total of 151 valid responses, mostly from men (n = 136) with a median age of 43 years old (interquartile interval: 36 to 56). By considering the characteristics of the total of signatories, we observed a higher response rate of men compared to women, and notable differences between countries (Table 1).

Table 1 Response rates according sex and country categories of signatories

About the question of interpretation of the value of p, we considered 136 correct answers (including 133 closed selections and three open responses), corresponding to 90.1% of the total of respondents. Five participants did not answer this question or make a justification. Ten participants wrongly answered this question: all of them were male and proceeded from nine different countries.

Relating to the current degree of agreement with the decision to withdraw the statistical significance, 98 (65%) answered that they strongly agreed, 49 (32%) partially agreed, three neither agreed nor disagreed, and only one indicated a strong disagreement. The last one referred that did “not agree with the title of the essay” and “it is unfortunate that the press and colleagues (…) are focusing on the title”.

Concerning the motivations to sign, the majority (142/151, 94%) answered that it was because of the arguments against statistical significance, followed by the arguments in favor of the alternatives (91/151, 60.3%). Only a minority of the respondents recognized that the prestige of the authors (n = 9) or of the journal (n = 12) were part of the motivations; however, for none of them, this was the only motivation. Additionally, 20 respondents reported other motives such as problems of misinterpretation and misuse of p-values.

Regarding the probability of using the concept of statistical significance in future publications, 35 (23%) responded that they expect never to use it again; and, 52 (34%) said it would be unlikely. On the other hand, 34 (23%) answered as neutral, or it depends on the occasion; 29 (19%) indicated that it would be likely to be used again and one stated that they would use it whenever they had the opportunity (Fig. 2a). We obtained similar results when we weighted the frequencies by sex and the countries of residence (Fig. 2b).

Fig. 2
figure2

In your future publications, how likely are you to use the concept of “statistical significance”?

We received several comments about the matter (supplementary material), of which we highlighted the following:

  • ""Significance" with firm thresholds is the problem. The credibility of a result is multi-determined. P-level is one of the determinants, but only one.

    The main - really the only - thing that’s needed to determine the credibility of a result is replication. There is no shortcut; you can’t know what the study would find if you repeated it, unless you repeat it. The use of "significance" and even exact p-levels typically is an attempt to avoid this stubborn truth."

  • "Although I signed in agreement with the article, I do not think the title was properly reflecting the spirit with which it was written. We are not advocating to drop statistical significance altogether, but to make a more mindful use of it. The main mistakes are 1) to think the p-value gives us a measure of the strength or magnitude of a relationship, for example. 2) a p value can help use supporting or rejecting alternative hypothesis. We need to incorporate measures that make sense in the system we are studying. Effect sizes, confidence intervals, Bayesian or Information Theory approaches in addition to the classical stats."

  • "It is impossible to interpret a p-value in the absence of some prior estimate of the probability of the null hypothesis being true (or false). I am much more in favor of presenting the Bayes Factor Bound."

  • "The paper in question proposed to stop using the term "statistical significance". It did NOT propose that p values should be banned, but only that they should not be dichotomized. I proposed that p values should be supplemented by a number that represents the risk that a "positive" test is a false positive."

Discussion

This independent survey may be considered a validation of the support of a group of researchers to a recommendation to abandon the use of the concept of statistical significance. Most of the signatories correctly interpreted the p-value. This result is not superfluous because some studies suggest that the misinterpretation of the p-value can be frequent even among academics [11,12,13]. Besides, most responders strongly agree to abandon the use of “statistical significance” and, for the most part, were motivated by the arguments presented against this concept.

However, regarding the feasibility of abandoning statistical significance, close to a quarter are fully convinced that they will never use this concept again. On the other hand, about 42% declared being neutral or that would likely use it in future publications. Assuming that the researchers surveyed represent those against the concept, the distribution of answers to this question suggests that the fully retire of the statistical significance does not seem feasible. Although there were evident differences in the response rate according to the sex and country categories, the weighting for these variables led to similar results concerning this question. Therefore, we considered that this finding would not be explained by selection bias.

Because we were looking for a high response rate in our survey, we did not include questions related to the causes for which signatories would again use statistical significance in future publications. Therefore, the fact of using the concept of statistical significance does not mean that they are going to base their conclusions solely or primarily on this result. Also, it is possible those continuing to use this term would be motivated by compliance the expectations of journals, reviewers, or readers, more than by their way of interpreting the results.

The p-value will continue to be presented, and dichotomization results seem to be inevitable regardless of the criterion chosen. Despite this, based on the validation we have made, we consider that Amrhein et al. discerned in their paper a legitimate concern of researchers from different areas [10]. In that sense, we agree with the importance of a research finding not being based only on statistical significance [14, 15].

An aspect to highlight is to differentiate the application scenarios from statistical significance [12]. For example, there is a critical distinction between studies of causal inference vs. those for prediction purposes. In the latter, the interpretability of the estimates may be optional, and the statistical criteria can command decisions to use or not a predictor [16, 17].

However, in studies of causal inference, the concept of statistical significance should not be a primary concern. In those cases, the efforts must focus on adequately research designing and analysis to avoid bias, control confusion, and consider eventual effect modifications [18, 19]. After that, the measures of association and impact should define when a result is significant in the clinical and public health scopes [20].

Therefore, it is not surprising that one of the major concerns expressed by several of the signatories is the misuse and misinterpretation of the value of p. Also, well-documented publication biases in favor of “positive results” are a consequence of overvaluation of statistical significance [21, 22]. These are often concerns among editors and statistical consultants of biomedical journals. For example, The New England Journal of Medicine recently modified its guidelines for statistical reporting by including a requirement to replace P values with estimates and confidence intervals when neither the protocol nor the analysis plan has specified methods to adjust for multiplicity [23].

We agree that the value of a result must be based on the interpretation of the spectrum of values compatible with the data, such as Amrhein et al. suggested [10]. However, removing a term such as statistical significance is far from a solution to avoid the publication biases. We regret that, even based on point and interval estimates, associations compatible with the null value undoubtedly would continue being under-reported. Conversely, the absence of a preset threshold to interpret a p-value could increase subjectivity [24].

Faced with the seemingly inevitable use of statistical significance [21], we must give due value to statistical tests, promoting the understanding of their limitations [25]. In that sense, one critical issue is the widespread application of an arbitrary significance level (i.e., 0.05) [12, 26]. As an analogy, diagnostic tests may need different cut-offs according to the disease prevalence to maintain high predictive values [27]. Similarly, it would be negligent in using the same significance level for all research problems. The pre-test probability of an association would help to define a cut-off to increase the chance of both identifying the true associations and discarding those spurious [7]. Nevertheless, no value should become a new thumb rule applicable to all situations.

Reducing the significance level can reduce the false positive rate but increase the false negative rate, which is reducing the power of a study. This can be a problem when decisions have to be based on studies with small sample sizes, such as in preliminary outbreak investigations or in the research of extremely rare but severe diseases.

For another purpose, a study aimed to replicate or confirm results from other well-designed studies would not need to use the same level of significance, since the state of knowledge has changed. A higher significance level could be justified when previous studies suggested a high pre-test probability.

Moreover, other issues may be necessary to consider in each case, such as the implications of a false positive and false negative result. For example, it does not seem sensible to use the same significance level to approve a drug with a high risk of adverse effects as for a low-risk educational intervention. Probably in the former, we were more interested in ruling out the alpha error. As with diagnostic tests, the cut-off point for significance should also be adjusted to increase the likelihood that our research will cause more benefit than harm.

For all the above, it is likely that we have not yet found a magical formula to choose levels of significance. Therefore, we share the frustration of decisions being guided by an arbitrary or poorly justified rule. Statistical significance may play a supporting role, but not a leading one. However, better than trying to abolish this concept, we consider it is necessary to develop strategies to justify and predefine the significance levels considering both the evidence and the implications of errors resulting from the statistical tests. Moreover, efforts to define what is clinically or epidemiologically significant may be more useful to guide research and interventions [18, 19].

Conclusions

The responses obtained from a proportion of signatories validated their declared position against the use of statistical significance. However, even in this group, the full application of this recommendation does not seem feasible (at least shortly). The arguments against the inappropriate use of statistical tests should promote more education among researchers and users of scientific evidence. Probably, the main problem does not rely on choosing a cutoff for the p-value, but on our difficulty recognizing the limitations of both statistics and rules.

Availability of data and materials

All data generated and analyzed during this study are included in this published article and its supplementary material file.

References

  1. 1.

    Lash TL. The harm done to reproducibility by the culture of null hypothesis significance testing. Am J Epidemiol. 2017;186:627–35.

    Article  Google Scholar 

  2. 2.

    Fisher R. Statistical methods for research workers. 14th ed. Edinburgh: Oliver and Boyd; 1970.

    Google Scholar 

  3. 3.

    Goodman S. A dirty dozen: twelve P-value misconceptions. Semin Hematol. 2008;45:135–40.

    Article  Google Scholar 

  4. 4.

    Altman N, Krzywinski M. Interpreting P values. Nature Methods. 2017;14:213-4. https://doi.org/10.1038/nmeth.4210.

  5. 5.

    Greenland S. Nonsignificance plus high power does not imply support for the null over the alternative. Ann Epidemiol. 2012;22:364–8.

    Article  Google Scholar 

  6. 6.

    Mark DB, Lee KL, Harrell FE. Understanding the Role of P Values and Hypothesis Tests in Clinical Research. JAMA Cardiol. 2016;1:1048–54.

    Article  Google Scholar 

  7. 7.

    Ioannidis JPA. The Proposal to Lower P Value Thresholds to .005. JAMA. 2018:E1–2. https://doi.org/10.1001/jama.2018.1536.

  8. 8.

    Colquhoun D. The false positive risk: a proposal concerning what to do about p-values. Am Stat. 2019;2019:192–201.

    Article  Google Scholar 

  9. 9.

    Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers E-J, Berk R, et al. Redefine statistical significance. Nat Hum Behav. 2018;2:6–10. https://doi.org/10.1038/s41562-017-0189-z.

    Article  PubMed  Google Scholar 

  10. 10.

    Amrhein V, Greenland S, McShane BB. Retire statistical significance. Nature. 2019;567:305–7 https://www.nature.com/articles/d41586-019-00857-9.

    CAS  Article  Google Scholar 

  11. 11.

    Badenes-Ribera L, Frías-Navarro D, Monterde-I-Bort H, Pascual-Soler M. Interpretation of the p value: a national survey study in academic psychologists from Spain. Psicothema. 2015;27:290–5.

    PubMed  Google Scholar 

  12. 12.

    Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, et al. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016;31:337–50. https://doi.org/10.1007/s10654-016-0149-3.

    Article  PubMed  PubMed Central  Google Scholar 

  13. 13.

    Cassidy SA, Dimova R, Giguère B, Spence JR, Stanley DJ. Failing Grade: 89% of Introduction-to-Psychology Textbooks That Define or Explain Statistical Significance Do So Incorrectly. Adv Methods Pract Psychol Sci. 2019;2:233–9. https://doi.org/10.1177/2515245919858072.

  14. 14.

    Wasserstein RL, Schirm AL, Lazar NA. Moving to a World Beyond “p < 0.05.”. Am Stat. 2019;73:1–19.

    Article  Google Scholar 

  15. 15.

    Ho J, Tumkaya T, Aryal S, Choi H, Claridge-Chang A. Moving beyond P values: data analysis with estimation graphics. Nat Methods. 2019;16:565–6.

    CAS  Article  Google Scholar 

  16. 16.

    Johansson U, Sönströd C, Norinder U, Boström H. Trade-off between accuracy and interpretability for predictive in silico modeling. Futur Med Chem. 2011;3:647–63.

    CAS  Article  Google Scholar 

  17. 17.

    Schooling CM, Jones HE. Clarifying questions about “risk factors”: predictors versus explanation. Emerg Themes Epidemiol. 2018;15:10.

    Article  Google Scholar 

  18. 18.

    Jakobsen JC, Gluud C, Lange T, Wetterslev J. The thresholds for statistical and clinical significance - a five-step procedure for evaluation of intervention effects in randomised clinical trials. BMC Med Res Methodol. 2014;14:1–12.

    Article  Google Scholar 

  19. 19.

    Koretz RL. Assessing the evidence in evidence-based medicine. Nutr Clin Pract. 2019;34:60–72.

    Article  Google Scholar 

  20. 20.

    Glass TA, Goodman SN, Hernán MA, Samet JM. Causal inference in public health. Annu Rev Public Health. 2013;34:61–75.

  21. 21.

    Kyriacou DN. The enduring evolution of the P value. JAMA. 2016;315:1113–5.

  22. 22.

    Perneger TV, Combescure C. The distribution of P -values in medical research articles suggested selective reporting associated with statistical significance. J Clin Epidemiol. 2017;87:70–7. https://doi.org/10.1016/j.jclinepi.2017.04.003.

    Article  PubMed  Google Scholar 

  23. 23.

    Harrington D, D’Agostino RB, Gatsonis C, Hogan JW, Hunter DJ, Normand S-LT, et al. New guidelines for statistical reporting in the journal. N Engl J Med. 2019;381:285–6. https://doi.org/10.1056/NEJMe1906559.

    Article  PubMed  Google Scholar 

  24. 24.

    Ioannidis JPA. Retiring significance: a free pass to bias. Nature. 2019;567:461 https://www.nature.com/magazine-assets/d41586-019-00969-2/d41586-019-00969-2.pdf. Accessed 6 Apr 2019.

    CAS  Article  Google Scholar 

  25. 25.

    Wasserstein RL, Lazar NA. The ASA’s statement on p-values: context, process, and purpose. Am Stat. 2016;70:129–33.

    Article  Google Scholar 

  26. 26.

    Lakens D. On the challenges of drawing conclusions from p-values just below 0.05. Peer J. 2015;3:e1142.

    Article  Google Scholar 

  27. 27.

    Weitkunat R, Kaelin E, Vuillaume G, Kallischnigg G. Effectiveness of strategies to increase the validity of findings from association studies: size vs replication. BMC Med Res Methodol. 2010;10:47.

    Article  Google Scholar 

Download references

Acknowledgments

The authors would like to thank the signatories that kindly responded to the survey.

Funding

The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors. FADQ was granted a fellowship for research productivity from the Brazilian National Council for Scientific and Technological Development – CNPq, process/contract identification: 312656/2019–0.

Author information

Affiliations

Authors

Contributions

FADQ conceived the study, participated in the design and coordination thereof, analyzed the data and prepared the first draft of the manuscript. FMC and JMNS participated in the preparation of the questionnaire, data collection and the interpretation and discussion of results. All authors read, carried out critical reviews of the manuscript and approved the final manuscript.

Corresponding author

Correspondence to Fredi Alexander Diaz-Quijano.

Ethics declarations

Ethics approval and consent to participate

This study was evaluated and approved by the Ethics Committee of the School of Public Health of the University of São Paulo. The committee waived the need for written or verbal consent. However, a link for an informed consent form was sent to the participants in the invitation email. Moreover, in the text of this email was stated that “by answering the survey, I [FADQ] am assuming you have read the form, and you consent freely to participate.”

Consent for publication

Not applicable because the manuscript does not contain individual data.

Competing interests

None of the authors has any competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Diaz-Quijano, F.A., Calixto, F.M. & da Silva, J.M.N. How feasible is it to abandon statistical significance? A reflection based on a short survey. BMC Med Res Methodol 20, 140 (2020). https://doi.org/10.1186/s12874-020-01030-x

Download citation

Keywords

  • p values
  • Null-hypothesis
  • Statistical inference
  • Statistical significance