 Research Article
 Open Access
 Published:
Analysis of Bayesian posterior significance and effect size indices for the twosample ttest to support reproducible medical research
BMC Medical Research Methodology volume 20, Article number: 88 (2020)
Abstract
Background
The replication crisis hit the medical sciences about a decade ago, but today still most of the flaws inherent in null hypothesis significance testing (NHST) have not been solved. While the drawbacks of pvalues have been detailed in endless venues, for clinical research, only a few attractive alternatives have been proposed to replace pvalues and NHST. Bayesian methods are one of them, and they are gaining increasing attention in medical research, as some of their advantages include the description of model parameters in terms of probability, as well as the incorporation of prior information in contrast to the frequentist framework. While Bayesian methods are not the only remedy to the situation, there is an increasing agreement that they are an essential way to avoid common misconceptions and false interpretation of study results. The requirements necessary for applying Bayesian statistics have transitioned from detailed programming knowledge into simple pointandclick programs like JASP. Still, the multitude of Bayesian significance and effect measures which contrast the gold standard of significance in medical research, the pvalue, causes a lack of agreement on which measure to report.
Methods
Therefore, in this paper, we conduct an extensive simulation study to compare common Bayesian significance and effect measures which can be obtained from a posterior distribution. In it, we analyse the behaviour of these measures for one of the most important statistical procedures in medical research and in particular clinical trials, the twosample Student’s (and Welch’s) ttest.
Results
The results show that some measures cannot state evidence for both the null and the alternative. While the different indices behave similarly regarding increasing sample size and noise, the prior modelling influences the obtained results and extreme priors allow for cherrypicking similar to phacking in the frequentist paradigm. The indices behave quite differently regarding their ability to control the type I error rates and regarding their ability to detect an existing effect.
Conclusion
Based on the results, two of the commonly used indices can be recommended for more widespread use in clinical and biomedical research, as they improve the type I error control compared to the classic twosample ttest and enjoy multiple other desirable properties.
Background
In randomised clinical trials (RCT), the twosample Student’s and Welch’s ttest is one of the most popular statistical procedures conducted. The goal often can be defined to test the efficacy of a new treatment or medication and investigate the size of an effect. Common settings use a treatment and control group, and the goal is to measure differences in a response variable like blood pressure. The gold standard in medical research for deciding if a new treatment or drug was more effective than the control treatment or drug is the pvalue. The pvalue states if the researcher can deem the observed difference significant, that means unlikely to have occurred under the assumption of the null hypothesis. The dominance of pvalues when comparing two groups in medical (and other) research is overwhelming: Nuijten et al. [1] showed in a metaanalysis that of 258105 pvalues reported in journals between 1985 and 2013, 26% belonged to a tstatistic, see also Wetzels et al. [2].
In its most restricted setting, the twosample Student’s ttest assumes normally distributed data with identical variances, that is \(Y_{1i}\sim \mathcal {N}(\mu _{1},\sigma ^{2}), Y_{2j}\sim \mathcal {N}(\mu _{2},\sigma ^{2})\) and tests the null hypothesis of no difference at all, that is H_{0}:μ_{2}=μ_{1}, assuming equal sample sizes \(i,j=1,...,n, n\in \mathbb {N}\). Removing the restriction for homoscedasticity – which is the assumption of identical variances \(\sigma _{1}^{2} = \sigma _{2}^{2}\) in both groups – and the assumption of identical sample sizes i=j, the setting leads to the well known BehrensFisherproblem, which remains unsolved until today. The typical practice is to proceed with an approximative solution, known as Welch’s twosample ttest. These approximative solutions are quite reliable, but as frequentist testing makes use of sampling statistics, which only allow rejecting the null hypothesis via the use of pvalues, confirming any research hypothesis is not possible. The general procedure of null hypothesis significance testing (NHST), which uses sampling statistics to reject a null hypothesis via pvalues makes formulating any reasonable research hypothesis complicated, as the research hypothesis first has to be rephrased in the form of a rejectable null hypothesis. In some cases, this is not possible at all, further limiting the usefulness of NHST in applied research. Countless papers have criticised the misuse and abuse of pvalues in particular in medical research, and official statements of the American Statistical Association (ASA) in 2016 and 2019 by Wasserstein & Lazar [3] and Wasserstein et al. [4] make clear that tensions have not relaxed by now. The current practice shows that the pvalue as a measure of significance is still widely used and resilient to the repeated criticism [5], while being prone to overestimating effects, stating effects if none exist in reality, and false interpretation by scientists [6]. This problem is especially observed in clinical research, see Ioannidis [7].
Among the proposed solutions to the problems of NHST is a shift to Bayesian statistics [4]. It is commonly agreed on that a more widespread use of Bayesian methods can at least partially improve the reliability in medical research on a statistical basis [8–10]. Recently, the development of Bayesian counterparts to frequently used statistical tests in medical and social science – including Student’s and Welch’s twosample ttest – has opened up new possibilities for researchers: Opensource programs like JASP (https://jaspstats.org) implement a broad spectrum of Bayesian methods and make them available to a wide range of researchers via a simple pointandclick user interface similar to SPSS.
Given the general recommendation of a shift towards the Bayesian paradigm, it is sensible to ask what benefits come with this shift. While NHST focusses on hypothesis testing via pvalues and stating the significance of an observed effect, the Bayesian philosophy proceeds by the formulation of a statistical model, the inclusion of available prior information into the analysis, and the derivation of the posterior distribution of the parameters of interest, for example, the effect size in the setting of Student’s twosample ttest. Employing the posterior distribution instead of point estimates, the Bayesian philosophy fosters estimation under uncertainty directly in contrast to NHST, which commonly uses point estimates like maximum likelihood estimates with confidence intervals, which are often interpreted wrong.
In NHST, testing for the significance of an effect is the standard approach, but the significance of an effect does not imply that the discovered relationship is also scientifically meaningful. It only means that the observed effect is unlikely to be observed under the assumption of the null hypothesis, no matter how large or small it is. Also, a nonsignificant result does not indicate that the null hypothesis is correct, and together these drawbacks of NHST can be seen as the reason why multiple measures of significance and magnitude of an effect based on the posterior distribution have been proposed in the Bayesian literature. In the Bayesian paradigm, inferences about the parameters of interest are drawn from the posterior distribution, and testing is optional. In practice, drawing conclusions from the posterior distribution is achieved by using different posterior indices. There are measures which state the significance of an effect, and measures which also gauge the size of it. Among them is the Bayes factor introduced by Jeffreys [11], the region of practical equivalence (ROPE) championed by Kruschke [12], the probability of direction (PD) as detailed in Makowski et al. [13], the MAPbased pvalue proposed by Mills [14], and the Full Bayesian Significance Test (FBST) featuring the evalue, which was introduced by Pereira, Stern and Wechsler [15, 16]. The appropriateness of these indices is still debated in the literature, which makes it challenging to choose among the available indices because by now there is no explicit agreement on which index researchers should use to report the results of a Bayesian analysis [10, 17–19].
What is missing are specific investigations which of the available measures of significance and effect size are appropriate for a specific statistical method like the twosample Student’s and Welch’s ttest. The results of such studies could guide scientists in the selection of an appropriate index to assess the result of a twosample Student’s or Welch’s ttest performed in the analysis of clinical trial data. In order to provide such guidance, this paper investigates the behaviour of common Bayesian posterior indices for the presence and size of an effect in the setting of the twosample Student’s and Welch’s ttest.
Indices of significance and magnitude of an observed effect
In this section, we briefly review the existing Bayesian indices of significance and magnitude of an observed effect. Reviewing the most commonly used indices will serve as a firm understanding of the simulation study reported later in this paper, and also enhance a critical reflection on each of the indices.
The Bayes factor (BF)
The oldest and still widely used index is the Bayes factor (BF). Bayesian hypothesis testing often is associated with the Bayes factor BF_{01}, the predictive updating factor which measures the change in relative beliefs about both hypotheses H_{0} and H_{1} given the data x:
The Bayes factor BF_{01} can be rewritten as the ratio of the two marginal likelihoods of both models, which is calculated by integrating out the respective model parameters according to the prior distribution of the parameters. Generally, the calculation of these marginals can be complex for nontrivial models. In the setting of the twosample Student’s ttest, the Bayes factor is used for testing a null hypothesis H_{0}:δ=0 of no effect against a one or twosided alternative H_{1}:δ>0,H_{1}:δ<0 or H_{1}:δ≠0, where δ=(μ_{1}−μ_{2})/σ is the effect size according to Cohen [20, p. 20], under the assumption of two independent samples and identical standard deviation σ in each group. An often lamented problem with Bayes factors as detailed in Kamary et al. [21] and Robert [17] is the dependence on the prior distributions assigned to the model parameters. Nevertheless, the Bayes factor has deep roots in Bayesian thinking and is one of the most widely used measures for hypothesis testing. Over the years, several authors including Jeffreys [11], Kass and Raftery [22] or Van Doorn et al. [23] have offered thresholds for interpreting different values of it. For example, according to Van Doorn et al. [23], a Bayes factor BF_{10}>3 can be interpreted as moderate evidence for the alternative H_{1} relative to the null hypothesis H_{0}, and a Bayes factor BF_{10}>10 can be interpreted as strong evidence in the same way. Note that the Bayes factor BF_{10} can be obtained by inverting BF_{01} in equation (1), that is: BF_{10}=p(xH_{1})/p(xH_{0})=1/BF_{01}. So, if for example BF_{01}=4 states moderate evidence for the null hypothesis H_{0}:δ=0, then BF_{10}=1/BF_{01} is obtained as 1/4 for the alternative hypothesis H_{1}:δ≠0.
The region of practical equivalence (ROPE)
The region of practical equivalence was championed by Kruschke [24], who stresses that such a region is often observed in different scientific domains under different names “such as indifference zone, range of equivalence, equivalence margin, margin of noninferiority, smallest effect size of interest, and goodenough belt” Kruschke [19, p. 272]. The essential idea is that in applied research, parameter values can often be termed practically equivalent if they lie in a given range. Starting from the posterior distribution of the parameter of interest, researchers should interpret values inside the region of practical equivalence (ROPE) as equivalent. For example, when conducting a clinical trial which compares the weight in kilograms of patients in two groups, one could define that the difference of means μ_{2}−μ_{1} is practically equivalent to zero if it lies inside the ROPE [−1,1]. That means a difference of only one kilogram is interpreted as practically equivalent to zero. If the posterior distribution of μ_{2}−μ_{1} now is entirely located inside the ROPE, the difference μ_{2}−μ_{1} is interpreted as practically equivalent to zero a posteriori. On the other hand, if the total probability mass of the posterior distribution μ_{2}−μ_{1} is located outside the ROPE, the null hypothesis μ_{2}=μ_{1} of no difference can be rejected. The same procedure can be applied to any parameter, θ of interest. If the probability mass of the posterior lies partially inside and outside the ROPE, the situation is inconclusive.
There are two versions of the ROPE, one in which the 95% HighestPosteriorDensityInterval (HPD) is used for the analysis (95% ROPE), and one in which the full posterior distribution is used (full ROPE). For the effect size δ, Kruschke [24] proposed to use [−0.1,0.1] as the ROPE for the null hypothesis H_{0}:δ=0 of no effect, which is half of the effect size necessary for at least a small effect according to Cohen [20] (a small effect is defined as 0.2≤δ<0.5 or −0.5<δ≤−0.2 according to Cohen [20]).
The probability of direction (PD)
The probability of direction is detailed in Makowski et al. [13] and varies between 50% and 100%. It is defined as the proportion of the posterior distribution of the parameter that is of the median’s sign. Therefore, if the posterior distribution assigns probability mass to both positive and negative parameter values, and the median is positive, it is the percentage of the posterior distributions probability mass located on the positive real numbers (0,∞).
The MAPbased pvalue
The MAPbased pvalue was proposed by Mills [14] (see also Makowski et al. [13]), and can be related to the odds that a parameter has against the null hypothesis: It is defined as the ratio of the posterior density at the null value and the value of the posterior density at the maximum a posteriori (MAP) value, which is the equivalent of the mode for continuous probability distributions.
The evalue and the full Bayesian significance test (FBST)
The Full Bayesian Significance Test (FBST) was originally developed by Pereira and Stern [15] and created under the assumption that a significance test of a sharp hypothesis had to be conducted. A sharp hypothesis refers to any submanifold of the parameter space of interest, see [16], which includes for example point hypotheses like H_{0}:δ=0. Considering a standard parametric statistical model, where \(\theta \in \Theta \subseteq \mathbb {R}^{p}\) is a (vector) parameter of interest, p(xθ) is the likelihood function associated to the observed data x, and p(θ) is the prior distribution of θ, the posterior distribution p(θx) is proportional to the product of the likelihood and prior density:
A hypothesis H makes the statement that the parameter θ lies in the corresponding null set Θ_{H} then. Following [25] in notation, the Full Bayesian Significance Test (FBST) then defines two quantities: ev (H), which is the evalue supporting (or in favour of) the hypothesis H, and \(\overline {\text {ev}}(H)\), the evalue against H, also called the Bayesian evidence value against H, see Pereira and Stern [15]. First, the posterior surprise functions(θ) and its maximum s^{∗} restricted to the null set Θ_{H} are denoted as
In the definition of the posterior surprise function s(θ), the denumerator r(θ) is a reference density. If the improper flat prior r(θ)∝1 is used, the surprise function becomes the posterior distribution p(θx). Otherwise, a noninformative prior distribution can be used as a reference density, see Stern [25]. The next step towards the evalue is to define
and \(\overline {T}(s^{*})\) is then called the tangential set to the hypothesis H, which contains the points of the parameter space with higher surprise (relative to the reference density r(θ)) than any point in the null set Θ_{H}. Integrating the posterior p(θx) over this set can be interpreted as the Bayesian evidence against H, the evalue \(\overline {\text {ev}}(H)\):
Of course the evalue ev (H) supporting H is obtained as ev\((H):=1\overline {\text {ev}}(H)\). In the above, W(ν) is called the cumulative surprise function, and \(\overline {W}(\nu):=1W(\nu)\). Therefore, large values of \(\overline {\text {ev}}(H)\) indicate that the hypothesis H traverses lowdensity regions (or equivalently, that the alternative hypothesis traverses highdensity regions) so that the evidence against H is large. The theoretical properties of the FBST and the evalue(s) have been detailed in Pereira and Stern [16] and Stern [25]. Here, we focus on the behaviour of the evalue \(\overline {\text {ev}}(H)\) against H:δ=0 in the context of the Bayesian twosample ttest. Note that one can use ev (H) to reject H if ev (H) is sufficiently small (or when \(\overline {\text {ev}}(H)\) is large), but not to confirm H, which may be seen as a drawback of the FBST. Note also that there exist asymptotic arguments using the distribution of ev (H), which make it possible to obtain critical values based on this distribution to reject a hypothesis H, similar to pvalues in NHST. In the simulation study reported later, we do not make use of any asymptotic argument and solely report the evalue \(\overline {\text {ev}}(H)\) against H.
Additional remarks
Makowski et al. [13] also proposed the Bayes factor versus ROPE index, which does not compare the point null hypothesis H_{0}:δ=0 against an alternative H_{1}:δ≠0 as the normal BF, but used a null H_{0}:δ∈[−0.1,0.1] which is given by the ROPE and then tests against the alternative H_{1}:δ∉[−0.1,0.1] which is the complement to the ROPE. While this approach is highly similar to the traditional ROPE and shows similar behaviour indeed [13], it will not be used here. Also, the frequentist pvalue is used as a reference index, which is the probability under the null hypothesis, to obtain a result equal to or more extreme than the one observed for the statistical model used, see Wasserstein & Lazar [3].
Figures 1 and 2 show the different posterior Bayesian indices for significance and size of an effect for a Bayesian twosample ttest. Group one was simulated as \(\mathcal {N}(0.5,1)\) and group two as \(\mathcal {N}(2,1)\) each with n=10 samples and the true effect size is δ=−1.5. The FBST is visualized in Fig. 1, where the left plot shows a Cauchy prior C(0,1) (dashed line) and the resulting posterior p(δx) (solid black line), which is obtained by the Bayesian twosample ttest of Rouder et al. [26]. s^{∗} is computed as s(0)=0.1103 (indicated by the blue point) and the integral W(0) over the set T(0) is shown as the red area under the posterior. This area is ev (H), which is 0.0418 in this case. The blue area corresponds to the integral \(\overline {W}(0)\) over the set \(\overline {T}(0)\), which consists of all parameter values δ attaining a posterior density p(δx) larger than p(0)=0.1103, indicated by the horizontal dashed blue line. The value of this integral is the evidence against \(H_{0}:\delta =0, \overline {\text {ev}}(H)=0.9582\), which advises the researcher to reject H_{0}:δ=0 if a threshold of \(\overline {\text {ev}}(H)>0.95\) is used for making a decision in light of the obtained evidence. The right plot in Fig. 1 shows the same situation, but now the reference prior r(δ) used in the surprise function has been changed from the improper flat prior r(δ)∝1 to the wide Cauchy prior C(0,1) actually used when conducting the Bayesian twosample ttest of Rouder et al. [26]. Therefore, the surprise function values differ (see the scaling of the yaxis) and values of p(δx)/p(δ)>1 indicate that the posterior p(δx) assigns a larger probability to a given parameter value than the prior p(δ). This can be interpreted as the data having increased this parameters probability.
The Bayes factor BF_{10} of H_{0}:δ=0 against H_{1}:δ≠0 is shown in the upper left plot of Fig. 2 and can be interpreted as the ratio of the prior density at the pointnull value δ_{0}=0 visualised as the grey lollipop and the posterior density at the pointnull value δ_{0}=0 visualised as the red lollipop. After observing the data, H_{0} becomes less probable, which is reflected in the Bayes factor of BF_{10}=3.38. This magnitude indicates only moderate evidence for H_{1}, which is due to the small sample size of n=10. Note that the Bayes factor BF_{01} can be obtained by inverting the ratio.
The MAPbased pvalue is shown in the upper right plot and is defined as the ratio of the height of the posterior density at the null value δ_{0}=0 and the MAPvalue δ_{MAP}, the maximum a posteriori parameter. As can be seen, the MAP estimate is near δ=−1, indicating a clear shift away from the null hypothesis. Still, the MAPbased pvalue is given as p_{MAP}=0.203, which is not significant.
The lower left plot visualises the 95% and full ROPE, where the ROPE is defined as [−0.1,0.1], following the recommendations of Kruschke [27]. 2.38% probability mass of the posterior distribution is located inside the ROPE when using the 95% ROPE and 3.00% is located inside the ROPE when using the full ROPE. In a test of practical equivalence, where the null is only rejected if the posterior is located entirely outside the ROPE, the null hypothesis H_{0} cannot be rejected based on the ROPE. Still, if an estimationoriented perspective is used, avoiding the classical testing stance, the ROPEanalysis shows evidence for the alternative H_{1} for both the 95% and full ROPE.
The lower right plot in Fig. 2 shows the probability of direction (PD). It enjoys some desirable properties: First, it clearly shows that the effect is more likely to be of negative than positive sign, as 97.70% of the posterior is located on the negative real numbers. Also, the PD embraces estimation under uncertainty instead of hypothesis testing, in the same way as the ROPE does when avoiding an explicit testing stance. The posterior distribution can then be used in a second step to obtain, for example, the mean and standard deviation as estimates for the parameter. Still, hypothesis testing is also possible via rejecting the null H_{0}:δ≥0 if at least 95% of the posterior of δ is located on the negative real axis.
Methods
A simulation study was performed to analyse the behaviour of the different measures in the setting of Welch’s twosample ttest. Pairs of data were simulated, consisting of two samples, one for each group, each normally distributed. Four settings were selected: In the first, no effect was present, and both groups were identically distributed as standard normal \(\mathcal {N}(0,1)\). In the second, a small effect was present, and the first group was simulated as \(\mathcal {N}(2.89,1.84)\) and the second as \(\mathcal {N}(3.5,1.56)\), resulting in a true effect size of
In the third simulation setting, a medium effect was present. The first group was simulated as \(\mathcal {N}(254.08,2.36)\) and the second as \(\mathcal {N}(255.84,3.04)\), resulting in a true effect size of
The last setting used \(\mathcal {N}(15.01,3.4)\) and \(\mathcal {N}(19.91,5.8)\) distributions for the first and second group, yielding a true effect size of
For each of the four effect size settings, 10,000 datasets following the corresponding group distributions as detailed above were simulated. This procedure was repeated for different samples sizes n, ranging from n=10 to n=100 in steps of size 10 to investigate the influence of sample size on the indices. In each case, the traditional pvalue, the Bayes factor BF_{10}, the ROPE 95%, the full ROPE, the probability of direction, the MAPbased pvalue and the evalue \(\overline {\text {ev}}(H_{0})\), that is the evidence against H_{0}:δ=0 were computed. The Bayes factor was calculated as the JeffreysZellnerSiow Bayes factor for the null hypothesis H_{0}:δ=0 of no effect against the alternative H_{1}:δ≠0, see Rouder et al. [26] and Gronau et al. [28]. More precisely, the calculated quantities are (1) the Bayes factor, a single number that quantifies the evidence for the presence or absence of an effect and (2) the posterior distribution, which quantifies the uncertainty about the size of the effect under the assumption H_{1}:δ≠0 that it exists. This posterior distribution (2) of the effect size δ was then used to compute the 95% ROPE, the full ROPE, the PD and the MAPbased pvalue as well as the evalue \(\overline {\text {ev}}(H_{0})\). The traditional pvalue was obtained via a twosample Welch’s ttest.
The above procedure was conducted three times with the prior on the effect size δ set to three different hyperparameters to investigate the influence of the prior modelling: A noninformative Jeffrey’s prior was always put on the standard deviation of the normal population, while a Cauchy prior was placed on the standardised effect size. The Cauchy prior \(C(0,\sqrt {2}/2)\) was used in the first setting, C(0,1) in the second and \(C(0,\sqrt {2})\) in the third, corresponding to a medium, wide and ultrawide prior on the effect size δ. This way, the influence of the prior modelling on the resulting indices can be measured. To get more insights about the evalue \(\overline {\text {ev}}(H_{0})\), for each prior setting \(\overline {\text {ev}}(H_{0})\) was once computed using a flat improper reference density r(δ)∝1 (that is, the surprise function equals the posterior distribution), and once using the Cauchy prior assigned to δ as a reference density in the surprise function s(δ).
Finally, the above procedure was repeated for the fixed sample size n=30 to investigate the influence of noise. n=30 samples were simulated in each group to control for the influence of sample size and Gaussian noise \(\mathcal {N}(0,\varepsilon)\) was added to the group data x and y, where ε was selected as ε=0.5 to ε=5 in steps of 0.5.
The percentage of significant results was computed for samples of increasing size n as the number of significant results divided by 10,000. This number is an estimate for the type I error probabilities of the indices, a quantity crucial for reproducible research [29]. Significant is defined here as follows: A Bayes factor BF_{10}≥3. A posterior distribution using the 95% ROPE or full ROPE is significant when it is located completely outside the corresponding ROPE [−0.1,0.1] around δ=0. The MAPbased pvalue is significant when p_{MAP}<0.05. The pvalue is significant when p<0.05. The PD is significant when PD=1 or PD=0, and the evalue is significant when \(\overline {\text {ev}}(H)>0.95\) (no matter whether a flat reference density or the Cauchy reference density was used).
The statistical programming language R was used [30] for the simulations. The Bayes factor was computed via Gaussian quadrature in the R package BayesFactor [31], which was also used to obtain the posterior distribution of δ under the alternative H_{1} of an existing effect. The package bayestestR [32] was used to compute the 95% ROPE, full ROPE, PD and MAPbased pvalue. The evidence \(\overline {\text {ev}}\) against H_{0}:δ=0 in the FBST was computed with the posterior MarkovChainMonteCarlo draws of the posterior distribution of δ provided by the BayesFactor package [31]. These posterior draws were interpolated to construct a posterior density of δ, which was then integrated numerically over the tangential set of H_{0} as required for \(\overline {\text {ev}}(H_{0})\). For more details, also about the random number generator seed, a commented replication script, which can reproduce all results and figures, is provided at the Open Science Foundation under https://osf.io/fbz4s/.
Results
Influence of sample size and prior modelling
Figure 3 shows the dependence of the Bayesian indices on sample size for four different effect sizes using the ultrawide prior \(C(0,\sqrt {2})\). The four plots in each row show the succession of the results for no effect, a small effect, a medium effect and finally a large effect, while the xaxis shows increasing sample size n=10 to n=100 in each group in steps of 10.
The left plot of the first row shows that the pvalue is distributed uniformly under the null hypothesis H_{0}:δ=0. If the alternative H_{1}:δ≠0 is true, the three figures right beneath show that for increasing sample size n, the pvalue becomes significant, where the necessary sample size for stating significance decreases with increasing actual effect size δ.
The second row shows the succession for the Bayes factor BF_{10}. The left plot shows, that under the null hypothesis H_{0}:δ=0 the Bayes factor correctly converges to zero (in contrast to the pvalue). This property opens the possibility of confirming the null hypothesis, which is not possible via an ordinary pvalue. The three figures right of this plot show the progression of the Bayes factor BF_{10} for increasing effect size. Here, the Bayes factor accumulates more and more evidence for the alternative H_{1}:δ≠0 for small, medium and large effect sizes. For more substantial effect sizes, the Bayes factor requires a much smaller sample size to state evidence for the alternative. The plots are limited to a yrange of [0,100] (except for the first plot) for better visibility, as BF_{10} becomes very large quickly.
The third and fourth row shows the results for the 95% and full ROPE [−0.1,0.1] around the effect size δ=0. Under the null, in both cases, the percentage of the posterior’s probability mass inside the ROPE increases. As δ=0 under the null, for n→∞, the posterior will eventually concentrate completely inside the ROPE, but the necessary sample size can be considerable. From the figure, it becomes clear that for n=100, about 50% of the probability mass of the posterior is located inside the ROPE [−0.1,0.1] around δ=0. For increasing sample size n, this percentage will finally become 100%. Considering the 95% and full ROPE, even for small sample sizes like n=10 the majority of values shows that at least 10% of the posterior is located inside the ROPE so that hardly any falsepositive statements are produced.
Under the alternative H_{1}:δ≠0, both the 95% and full ROPE show that the percentage of the posterior located inside the ROPE [−0.1,0.1] of no effect converges to zero for increasing sample size n. For increasing effect size δ, the necessary sample size n needed to reject the null hypothesis H_{0} (based on an equivalence test or an estimation under uncertainty perspective as detailed by Kruschke [19]) becomes smaller.
The fifth row shows the results for the probability of direction (PD). Under the null hypothesis H_{0}:δ=0, the PD is not uniformly distributed as was the case for pvalues. The PD concentrates at about 70% here (see the scaling of the yaxis), which does not reflect the true effect size of δ=0, which should yield a PD near 50%. Still, under the alternative H_{1}:δ≠0, the PD converges to 100% if sample sizes grow. The speed of convergence is faster for larger effect sizes δ≠0.
The MAPbased pvalue shown in the sixth row shows a behaviour similar to the classic pvalue. One difference is that under the null hypothesis H_{0}, it is much larger on average than the traditional pvalue. Still, this behaviour is robust to increasing sample size n and a correct interpretation of the MAPbased pvalue only allows to state significance when p_{MAP} is smaller than a significance threshold. Interpreting large p_{MAP} as evidence for H_{0} is not allowed at all. Under the alternative H_{1}, the behaviour is quite similar to the classic pvalue: For increasing sample size n, the MAPbased pvalue becomes significant, where the necessary sample size n for stating significance decreases with increasing effect size δ.
The evidence \(\overline {\text {ev}}(H_{0})\) (in the following denoted as \(\overline {\text {ev}}\)) under the flat improper reference density r(δ)∝1 is shown in the seventh row and concentrates around δ=0.5 under the null hypothesis H_{0}:δ=0. The reason for this can be seen in the fact that the posterior of δ concentrates for n→∞ around δ=0 if H_{0}:δ=0 is true, and the posterior density p(δx) also concentrates around δ=0 with slight fluctuations happening due to the randomness in simulation. The only thing that changes when increasing sample size n is thus the scaling of the xaxis of the posterior p(δx), so that \(\overline {\text {ev}}\) is not influenced at all by increasing sample size. The support for H_{0} can easily be obtained by calculating ev\((H_{0})=1\overline {\text {ev}}(H_{0})\), which in this case also concentrates around 0.5, instead of concentrating around 1. If on the other hand H_{1}:δ≠0 is true, \(\overline {\text {ev}}\) quickly signals evidence against H_{0} for increasing sample size n and increasing effect size δ, as shown by the three righthand plots in the seventh row. When using the medium Cauchy prior \(C(0,\sqrt {2}/2)\) instead of the improper reference density r(δ)∝1, the situation is similar, but the plots in the last row in Fig. 5 show that the evidence \(\overline {\text {ev}}\) against H_{0} accumulate faster then if H_{1} is true.
Figure 4 shows the results of the simulation when using a wide prior C(0,1) instead of the ultrawide prior \(C(0,\sqrt (2))\). The classic pvalue is of course not affected at all from this prior change. The BF_{10} shown in the second row is slightly larger under the alternative H_{1}:δ≠0, as the wide prior C(0,1) becomes more informative compared to the ultrawide prior \(C(0,\sqrt {2})\). The probability mass located around δ=0 becomes more concentrated when using the wide C(0,1) prior instead of the ultrawide \(C(0,\sqrt 2)\) prior, and therefore BF_{10} is increased (compare the boxplots in Figs. 3 and 4).
For the same reasons, the percentage of probability mass inside the 95% and full ROPE increases under the null H_{0}:δ=0, as shown by the third and fourth row in Fig. 4. More prior mass around δ=0 due to the narrower C(0,1) prior on δ leads to more posterior mass inside the ROPE [−0.1,0.1] around δ=0. Under the alternative H_{1}, the 95% and full ROPE suffer from this change, as shown in the boxplots for small, medium and large effects in rows three and four, which are shifted up slightly. The increase of probability mass near δ=0 draws the posterior towards δ=0, and it becomes harder for the posterior to concentrate outside of the ROPE. Nevertheless, for increasing sample size, the ROPEs finally reveal evidence for the alternative H_{1}. Note that due to the concentration of probability mass around zero when using the C(0,1) prior, the boxplots of the ROPEs are shifted slightly up under the null hypothesis of no effect.
The same holds for the PD, which also needs a larger sample size now to achieve the same evidence for the alternative when an effect is present. No matter whether a small, medium or large effect size is present, all boxplots shift down slightly, indicating that less probability mass is strictly positive in the posteriors produced. The narrower prior distribution seems to shrink the complete posterior distribution towards smaller values, leading in turn to a smaller PD.
The MAPbased pvalue is also influenced by the narrower prior: Due to the increased probability mass near δ=0, the MAPestimate of δ shrinks towards δ=0. In combination with the larger value of the prior C(0,1) at the pointnull value δ_{0}=0 compared to the pointnull value of the ultrawide prior \(C(0,\sqrt {2})\), the ratio calculated for the MAPbased pvalue decreases, leading to larger MAPbased pvalues and slightly upshifted boxplots under the alternative H_{1}.
The last two rows show \(\overline {\text {ev}}\) under the improper reference density r(δ)∝1. Barely any change can be observed compared to the setting using the ultrawide prior \(C(0,\sqrt {2})\), which is confirmed in the seventh row. Under the wide Cauchy prior reference density r(δ)=C(0,1), the evidence against H_{0}:δ=0 again concentrates around \(\overline {\text {ev}}=0.5\), indicating neither strong evidence against H_{0} nor support for H_{0}. Compared to the ultrawide prior used in Fig. 3, under the alternative H_{1}:δ≠0 the evidence \(\overline {\text {ev}}\) against H_{0}:δ=0 also barely changes. These results show that the evalue is quite robust against variations in the prior modelling.
Figure 5 shows the results when using a medium prior instead of a wide one. The classic pvalue is again not affected from this prior, so the results are identical. In contrast to Figs. 3 and 4, the Bayes factor now accumulates evidence even faster, because the medium prior is even more informative than the wide and ultrawide one.
The 95% and full ROPE boxplots are shifted up even higher therefore under H_{0}, showing that switching from the noninformative ultrawide and weakly informative wide prior to the medium prior yields larger percentages of the posterior distributions probability mass inside the ROPE under the null hypothesis H_{0} as even more probability mass concentrates around δ_{0}=0 now. From a Bayesian perspective, the null hypothesis is thus faster confirmed. Under the alternative H_{1}:δ≠0, the medium prior makes it now even harder for the 95% and full ROPE to reject the null hypothesis. This is again due to the fact that under the medium prior \(C(0,\sqrt {2}/2)\) the prior allocates again more probability mass to values near δ_{0}=0 than under the ultrawide \(C(0,\sqrt {2})\) or wide Cauchy prior C(0,1). Therefore, the posterior shifts more slowly away from the ROPE [−0.1,0.1] of no effect, and therefore for the same sample size n, the posterior mass located inside the ROPE is larger when using the medium prior on δ. Still, for increasing sample size, this effect vanishes and even under the medium prior distribution, the concentration of posterior mass inside the ROPE converges to zero.
The same phenomenon holds for the PD and the MAPbased pvalue. Here too, under the alternative the narrower prior on δ around zero makes it harder for the PD and MAPbased pvalue to accumulate evidence for the alternative H_{1}. For increasing sample size n, both the PD and the MAPbased pvalue still finally reject the null hypothesis. For a fixed sample size n, the same is achieved faster under the ultrawide and wide prior, which have less prior probability mass near δ_{0}=0.
Considering \(\overline {\text {ev}}\) in the last two rows, under the improper reference density r(δ)∝1 again barely any changes can be observed compared to the setting using the ultrawide \(C(0,\sqrt {2})\) or wide C(0,1) prior, which is confirmed in the seventh row of Fig. 5. Under the medium Cauchy prior reference density \(r(\delta)=C(0,\sqrt {2}/2)\), the evidence against H_{0}:δ=0 again concentrates around \(\overline {\text {ev}}=0.5\), indicating neither strong evidence against H_{0} nor support for H_{0}. Compared to the ultrawide and wide priors used in Figs. 3 and 4, under the alternative H_{1}:δ≠0 the evidence \(\overline {\text {ev}}\) against H_{0}:δ=0 again is barely influenced by shifting to the medium Cauchy prior, showing strong robustness of the evalue against the prior modelling.
At this point, the results show that both the MAPbased pvalue, the classic pvalue and the evalue \(\overline {\text {ev}}\) cannot state evidence for the null hypothesis in addition to being able to state evidence for the alternative. These measures can only reject the null hypothesis H_{0} and offer no possibility to confirm the null hypothesis. For practical research, this is limiting. Also, the PD stabilises at about 75%, which is the middle of its possible extremes, 50% and 100%. It would be desirable that the PD converges to 50% under the null H_{0}:δ=0, to show that both a positive and negative effect are equally possible. Given the behaviour of the PD under the null, it seems that the PD favours the directed alternative δ>0 although the null H_{0}:δ=0 is true. Under the alternative, H_{1}:δ≠0, the PD as well as the pvalue and MAPbased pvalue behave as expected. Note that Pereira and Stern [15] created the evalue to test a sharp hypothesis H_{0}, and rejection of H_{0} was the intended goal of the procedure. In contrast to the pvalue and MAPbased pvalue, the evalue enjoys a multitude of highly desirable properties like compliance with the likelihood principle, being a probability value derived from the posterior distribution, and possessing a version which is invariant to alternative parameterisations, see also [16]. Therefore, the evalue is preferable over the standard pvalue and MAPbased pvalue, also because of its robustness to the prior selection.
The Bayes factor BF_{10}, the 95% and full ROPE have two desirable properties: Under the null, all three measures indicate evidence for H_{0}:δ=0 while under the alternative H_{1}:δ≠0, they indicate evidence for H_{1}. It is somehow problematic while not astonishing that both constructs accumulate evidence faster under the null H_{0} using a medium prior, than when using a wide or ultrawide prior. Under the alternative, evidence for H_{1} accumulates faster when using a wide or ultrawide prior instead of a medium one. Thus, when using a medium prior, finding evidence for H_{0} is easier than finding evidence for H_{1} both with the BF and the ROPEs. Using a wide or ultrawide prior, finding evidence for H_{1} is easier than finding evidence for H_{0} with the BF and the ROPEs. Therefore, we recommend using the wide prior C(0,1), which places itself in the middle between these two extremes. Using a medium or ultrawide prior needs further justification, because otherwise, some kind of cherrypicking could happen by combining Bayes factors or ROPEs with a medium, wide or ultrawide prior depending on the goal of rejection or confirmation of the null hypothesis. Note that the evalue showed strong robustness to the prior selection. Therefore, if the rejection of a research hypothesis is the formulated goal of the scientific enterprise, the evalue based on the FBST procedure with the corresponding Cauchy prior as reference density in the surprise function may prevent such cherrypicking.
The takeaway message regarding the prior modelling here is that the combination of prior and significance and effect size measure together can make it easier to find evidence for some hypotheses, which is problematic. Also, taking into account that the focus of research is to reveal relevant differences (clinically, in biomedical research for example), it is recommended to use at least n=100 patients in each group to ensure that also small effects can be detected reliably.
Influence of noise
Figure 6 shows the results for the influence of noise on Bayesian indices of significance and effect size. As expected and shown in the first row, the influence of noise on the classic pvalue under the null H_{0} is negligible. Under the alternative, the pvalue gets disturbed more and more with increasing noise ε. The number of significant pvalues reduces for increasing noise as shown by the boxplots, which are shifted upwards more and more when noise ε increases.
The BF_{10} has the same problems: When the null hypothesis H_{0}:δ=0 is true, the Bayes factor is not influenced much by noise. When on the other hand H_{1}:δ≠0 is true, adding noise to the observations makes it more difficult for the Bayes factor to state evidence for the alternative H_{1}:δ≠0. This behaviour is also revealed when comparing Figs. 3 and 6: The boxplots in the fourth plot of the second row in Fig. 3 show that the Bayes factor achieves higher values compared to the situation where noise is present, as shown in the fourth plot of the second row in Fig. 6.
The 95% ROPE and full ROPE also suffer from increasing noise. Under the null hypothesis, the noise does not influence the percentage of posterior mass inside the ROPE, but under the alternative H_{1} increasing noise ε causes increasing amounts of posterior mass to be located inside the ROPE. This behaviour makes it harder for the ROPE to signal evidence for the alternative H_{1}:δ≠0.
The PD suffers from the same problem, as increasing noise causes the posterior to be more and more symmetric around δ_{0}=0, indicated by the boxplots successively shifted down for increasing noise under H_{1}.
The MAPbased pvalue is also not influenced by noise under the null hypothesis H_{0}, but the boxplots are shifted up under the alternative, indicating that increasing noise leads to larger pvalues and less significant ones, which makes it harder for the MAPbased pvalue to reject the null hypothesis in the presence of noise.
The evalue \(\overline {\text {ev}}\) is also barely influenced by noise under the null hypothesis H_{0} both when used in combination with the flat reference density r(δ)∝1 and the wide Cauchy reference density r(δ)=C(0,1). Under the alternative, increasing noise makes it harder for \(\overline {\text {ev}}\) to state evidence against H_{0} as shown in the last two rows of Fig. 6.
Sensitivity and type I error rates
Table 1 shows Monte Carlo estimates for the type I error rates and the percentage of significant indices based on the results of the previous simulations. For increasing sample size n, the type I error rates were estimated as the number of significant indices divided by 10,000 when no effect was present.
In the cases where a small, medium or large effect was present, the percentage shows the number of significant measures divided by 10,000. Significant was defined as follows here: p<.05 for pvalues, BF_{10}≥3 for the Bayes factor, which equals moderate evidence according to Van Doorn et al. [23], a posterior which is located completely outside the 95% or full ROPE, and for the PD 100% of the posterior’s mass needed to be strictly positive or negative. The evalue \(\overline {\text {ev}}\) against H_{0}:δ=0 was required to be larger than 0.95, both when used with the improper reference density r(δ)∝1 and the wide Cauchy prior r(δ)=C(0,1) in the surprise function.
Figure 7 visualises the results: The left plot corresponds to the table row of no effect and shows the type I error rates of the indices. As shown in the figure, the classic pvalue fluctuates around its nominal significance level of α=.05, although there is no effect present. In contrast, most Bayesian indices have lower type I error rates about half the size as the classic pvalue. A comparison of the Bayesian posterior indices reveals three groups: The first group consists of the Bayes factor BF_{10}, the 95% ROPE and the MAPbased pvalue. These indices concentrate around a falsepositive rate of about 1% for increasing sample size. Still, the Bayes factor and ROPE make more type I errors for small sample size, while the MAPbased pvalue makes more for large sample sizes. The second group consists of the PD and the full ROPE, both of which make practically no type I error independent of the sample size n. This fact can be attributed to the quite conservative behaviour of both indices compared to the indices in group one. The third group consists of the evalue with improper or wide Cauchy prior, which achieves type I error rates slightly smaller than the traditional pvalue, but more massive than the other Bayesian indices.
The second plot corresponds to the small effect part of Table 1. Now the desired behaviour is that the indices detect the existing effect for the smallest possible sample size n. The classic pvalue has the most liberate behaviour in stating that an effect is present, which reflects the often criticised fact that pvalues overstate the significance of an effect compared to other indices of effect size and significance, see Wasserstein and Lazar [3]. The Bayesian indices signal evidence for the alternative more slowly than their frequentist counterparts, and again the three groups already discovered in the first plot reveal themselves here: The BF_{10}, the 95% ROPE and the MAPbased pvalue detect the small effect more often than the indices of the second group, which again includes the full ROPE and the PD. The third group consisting of the two versions of the evalue shows similar behaviour as the pvalue: They signal the existence of an effect more quickly than their Bayesian competitors, which comes at the cost of increased type I errors as shown in the left plot previously.
The third and fourth plot correspond to the medium and large effect part of Table 1 and confirm the previous analysis. The pvalue and evalue(s) state significance more often than every other index, but BF_{10}, the 95% ROPE and the MAPbased pvalue yield a similar behaviour for increasing effect size δ now. Also, from the succession of the PD and full ROPE, it becomes clear that the PD more often states the presence of an effect in contrast to the full ROPE, which is more conservative, even for increasing effect size. Still, for increasing sample size, these “slow“ indices eventually state the presence of the effect, too. Interestingly, the MAPbased pvalue has a similar behaviour for large effect sizes as the full ROPE and PD, as shown in the right plot of Fig. 7. The behaviour of the evalue again shows substantial similarity to the behaviour of the pvalue under the medium and large effect setting.
Discussion
This paper studied the behaviour of common Bayesian significance and effect size indices for the setting of twosample Welch’s ttest, which is often applied in the analysis of clinical trial data. To guide researchers in choosing an appropriate index when the Bayesian counterpart to Welch’s twosample ttest as proposed by Rouder et al. [26] is used instead, an extensive simulation study analysed the influence of sample size n, the prior modelling and noise ε. Also, the type I error rates and sensitivities to detect an existing effect were studied.
The results show that one can split Bayesian significance and effect size indices into two categories: Indices which can state evidence for the null hypothesis H_{0}:δ=0and the alternative H_{1}:δ≠0, and indices which can only state evidence for the alternative. The first group consists of the Bayes factor, the 95% and full ROPE. The MAPbased pvalue, the PD and the evalue belong to the second group, the MAPbased pvalue and the evalue showing a similar behaviour as the classic pvalue. Note that formally the evalue belongs to the first group, but the simulation results showed that stating evidence for the null hypothesis H_{0} is not achieved under the null hypothesis H_{0} by the evalue. On the other hand, the evalue showed the best performance compared to all other indices when H_{1} was true, and based on its other properties – for a review see Pereira, Stern and Wechsler [16] – it is preferable over the MAPbased pvalue, PD and classic pvalue. The PD suffers from the fact that under H_{0} it stabilizes at about 0.7, which is unintuitive and has to be interpreted as a tendency to favour evidence for the alternative when in fact the null hypothesis H_{0} is true, see Figs. 3, 4 and 5. Thus, when rejection of a null hypothesis is the goal, we recommend using the FBST and reporting the evalue based on the corresponding Cauchy prior as reference density in the surprise function. Also, the evalue is following the likelihood principle and is robust against the prior modelling, avoiding cherrypicking.
If the goal of the scientific enterprise is to confirm a research hypothesis, based on the results, the Bayes factor, the 95% ROPE or the full ROPE should be considered. All three indices show similar behaviour regarding increasing sample size n, and state both evidence for H_{0} and H_{1} depending on the presence of an effect.
The prior modelling showed that both the ultrawide and medium prior on δ could lead to cherrypicking by combining a selected index like a ROPE or BF with the prior: For example, choosing a medium prior when the goal is to confirm H_{0}, evidence for H_{0} accumulates faster than when using a wide or ultrawide prior. If the goal is to find evidence for the alternative, evidence for H_{1} accumulates faster when using a wide or ultrawide prior instead of a medium one.
Therefore, we recommend using the wide prior C(0,1) when the goal is to confirm a hypothesis, as this choice places itself in the middle between the two other extremes and prevents cherrypicking in the case where no prior information is available.
The analysis of the influence of noise showed that all Bayesian indices suffered from increasing noise under H_{1} with no apparent patterns or regularities, or one of the indices being more robust to noise than the others.
The type I error rates, and the sensitivity to detect an existing effect revealed that all Bayesian indices should be preferred to the classic pvalue, although the evalue showed only slightly reduced type I error rates compared to the traditional pvalue. This result is essential, as the control of type I error rates is one of the most critical aspects in clinical trials, see McElreath [29] and Ioannidis [7]. The results showed further that the full ROPE and the PD achieve the best control of type I errors. As the PD cannot transparently state evidence for the null as shown previously, we recommend using the full ROPE to control type I errors in clinical trials.
While the Bayes factor, the MAPbased pvalue, the evalue and the 95% ROPE are more sensitive and detect more effects when using the same sample size n, their type I error rate control is weaker.
Conclusion
To guide researchers in the selection of an appropriate index for clinical trials, we recommend to use the full ROPE in general because of the following reasons: As the Bayes factor and 95% ROPE, the full ROPE can state evidence for both the null and the alternative hypothesis. The influence of sample size n, noise ε and prior modelling is similar for all three indices, but the type I error rate control is better for the full ROPE. The slightly weaker sensitivity to existing effects can be overcome by simply increasing the study sample size n, as shown in Fig. 7: For sample sizes of n=100, the sensitivity is nearly equal to the sensitivity of the Bayes factor and 95% ROPE when a large effect is present. When medium or small effects are present, larger sample sizes are required, but as often multiple hundreds of patients participate in clinical trials, the benefits of type I error control overshadow the higher costs incurred by increased sample size.^{Footnote 1}
Therefore, researchers and clinicians should benefit from using the full ROPE in the analysis of clinical trial data when conducting a twosample Bayesian ttest through better type I error control and precise effect size estimation.
Availability of data and materials
The datasets generated and/or analysed during the current study as well as a full replication script to reproduce all results are available in the Open Science Framework (OSF) repository, https://osf.io/fbz4s/.
Notes
 1.
In the rare situation where the type I error rate is of less importance, we recommend to use the evalue instead, as it has the best sensitivity to detect an existing effect of all indices analysed, and is an attractive Bayesian replacement of the traditional pvalue.
Abbreviations
 NHST:

Null hypothesis significance testing
 BF:

Bayes factor
 ROPE:

Region of practical equivalence
 PD:

Probability of direction
 MAPbased pvalue:

Maximum a posteriori based pvalue
 RCT:

randomized clinical trial
 ASA:

American statistical association
 JASP:

Jeffreys awesome statistics package (software)
 SPSS:

Statistics package for the social sciences
References
 1
Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM. The prevalence of statistical reporting errors in psychology (19852013). Behav Res Methods. 2016; 48(4):1205–26. https://doi.org/10.3758/s1342801506642.
 2
Wetzels R, Matzke D, Lee MD, Rouder JN, Iverson GJ, Wagenmakers EJ. Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspect Psychol Sci. 2011; 6(3):291–8. https://doi.org/10.1177/1745691611406923.
 3
Wasserstein RL, Lazar NA. The ASA’s Statement on pValues: Context, Process, and Purpose. The American Statistician. 2016; 70(2):129–133. https://doi.org/10.1080/00031305.2016.1154108. http://arxiv.org/abs/1011.1669.
 4
Wasserstein RL, Schirm AL, Lazar NA. Moving to a World Beyond “p<0.05”. Am Stat. 2019; 73(sup1):1–19. https://doi.org/10.1080/00031305.2019.1583913.
 5
Matthews R, Wasserstein R, Spiegelhalter D. The ASA’s pvalue statement, one year on. Significance. 2017; 14(2):38–41. https://doi.org/10.1111/j.17409713.2017.01021.x.
 6
Ioannidis JPA. What Have We (Not) Learnt from Millions of Scientific Papers with pValues?Am Stat. 2019; 73:20–5. https://doi.org/10.1080/00031305.2018.1447512.
 7
Ioannidis JPA. Why Most Clinical Research Is Not Useful. PLoS Med. 2016; 13(6):1002049. https://doi.org/10.1371/journal.pmed.1002049.
 8
Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ, Berk R, Bollen KA, Brembs B, Brown L, Camerer C, Cesarini D, Chambers CD, Clyde M, Cook TD, De Boeck P, Dienes Z, Dreber A, Easwaran K, Efferson C, Fehr E, Fidler F, Field AP, Forster M, George EI, Gonzalez R, Goodman S, Green E, Green DP, Greenwald AG, Hadfield JD, Hedges LV, Held L, Hua Ho T, Hoijtink H, Hruschka DJ, Imai K, Imbens G, Ioannidis JPA, Jeon M, Jones JH, Kirchler M, Laibson D, List J, Little R, Lupia A, Machery E, Maxwell SE, McCarthy M, Moore DA, Morgan SL, Munafó M, Nakagawa S, Nyhan B, Parker TH, Pericchi L, Perugini M, Rouder J, Rousseau J, Savalei V, Schönbrodt FD, Sellke T, Sinclair B, Tingley D, Van Zandt T, Vazire S, Watts DJ, Winship C, Wolpert RL, Xie Y, Young C, Zinman J, Johnson VE. Redefine statistical significance. Nat Hum Behav. 2018; 2(1):6–10. https://doi.org/10.1038/s415620170189z.
 9
Etz A, Wagenmakers EJ. J. B. S. Haldane’s Contribution to the Bayes Factor Hypothesis Test. Stat Sci. 2015; 32(2):313–29. https://doi.org/10.1214/16STS599. http://arxiv.org/abs/1511.08180.
 10
Ly A, Verhagen J, Wagenmakers EJ. An evaluation of alternative methods for testing hypotheses, from the perspective of Harold Jeffreys. J Math Psychol. 2016; 72:43–55. https://doi.org/10.1016/j.jmp.2016.01.003.
 11
Jeffreys H. Theory of Probability, 3rd edn.Oxford: Oxford University Press; 1961.
 12
Kruschke JK, Liddell TM. The Bayesian New Statistics : Hypothesis testing, estimation, metaanalysis, and power analysis from a Bayesian perspective. Psychon Bull Rev. 2018; 25:178–206. https://doi.org/10.3758/s1342301612214.
 13
Makowski D, BenShachar MS, Chen SHA, Lüdecke D. Indices of Effect Existence and Significance in the Bayesian Framework. Front Psychol. 2019; 10:2767. https://doi.org/10.3389/fpsyg.2019.02767.
 14
Mills J. Objective Bayesian Hypothesis Testing; 2017. https://economics.ku.edu/sites/economics.ku.edu/files/files/Seminar/papers1718/april20.pdf.
 15
De Bragança Pereira CA, Stern JM. Evidence and credibility: Full Bayesian significance test for precise hypotheses. Entropy. 1999; 1(4):99–110. https://doi.org/10.3390/e1040099.
 16
Pereira CADB, Stern JM, Wechsler S. Can a significance test be genuinely bayesian?Bayesian Analysis. 2008; 3(1):79–100. https://doi.org/10.1214/08BA303.
 17
Robert CP. The expected demise of the Bayes factor. J Math Psychol. 2016; 72(2009):33–7. https://doi.org/10.1016/j.jmp.2015.08.002. http://arxiv.org/abs/1506.08292.
 18
Ly A, Verhagen J, Wagenmakers EJ. Harold Jeffreys’s default Bayes factor hypothesis tests: Explanation, extension, and application in psychology. J Math Psychol. 2016; 72:19–32. https://doi.org/10.1016/j.jmp.2015.06.004.
 19
Kruschke JK. Rejecting or Accepting Parameter Values in Bayesian Estimation. Adv Methods Pract Psychol Sci. 2018; 1(2):270–80. https://doi.org/10.1177/2515245918771304.
 20
Cohen J. Statistical Power Analysis for the Behavioral Sciences, 2 edn.Hillsdale: Routledge; 1988.
 21
Kamary K, Mengersen K, Robert CP, Rousseau J. Testing hypotheses via a mixture estimation model. arXiv preprint. 2014:1–37. https://doi.org/10.16373/j.cnki.ahr.150049. http://arxiv.org/abs/1412.2044.
 22
Kass RE, Raftery AE, Association S, Jun N. Bayes factors. J Am Stat Assoc. 1995; 90(430):773–95.
 23
van Doorn J, van den Bergh D, Bohm U, Dablander F, Derks K, Draws T, Evans NJ, Gronau QF, Hinne M, Kucharský Š, Ly A, Marsman M, Matzke D, Raj A, Sarafoglou A, Stefan A, Voelkel JG, Wagenmakers EJ. The JASP Guidelines for Conducting and Reporting a Bayesian Analysis. PsyArxiv Preprint. 2019. https://doi.org/10.31234/osf.io/yqxfr. http://arxiv.org/abs/osf.io/yqxfr.
 24
Kruschke JK. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan, Second Edition. Oxford: Academic Press; 2015, pp. 1–759. https://doi.org/10.1016/B9780124058880.099992. http://arxiv.org/abs/arXiv:1011.1669v3.
 25
Stern JM, Pereira CAdB. The evalue: A Fully Bayesian Significance Measure for Precise Statistical Hypotheses and its Research Program. arXiv preprint. 2020:0–3. https://doi.org/arXiv:2001.10577v1. http://arxiv.org/abs/arXiv:2001.10577v2.
 26
Rouder JN, Speckman PL, Sun D, Morey RD, Iverson G. Bayesian t tests for accepting and rejecting the null hypothesis. Psychon Bull Rev. 2009; 16(2):225–37. https://doi.org/10.3758/PBR.16.2.225.
 27
Kruschke JK. Bayesian estimation supersedes the ttest,. J Exp Psychol Gen. 2013; 142(2):573–603. https://doi.org/10.1037/a0029146. http://arxiv.org/abs//dx.doi.org/10.1037/a0029146.
 28
Gronau QF, Ly A, Wagenmakers EJ. Informed Bayesian t Tests. Am Stat. 2019; 00(0):1–7. https://doi.org/10.1080/00031305.2018.1562983.
 29
McElreath R, Smaldino PE. Replication, communication, and the population dynamics of scientific discovery. PLoS ONE. 2015; 10(8):1–16. https://doi.org/10.1371/journal.pone.0136088.
 30
R Core Team. R: A Language and Environment for Statistical Computing. 2019. https://www.rproject.org/.
 31
Morey RD, Rouder JN. BayesFactor: Computation of Bayes Factors for Common Designs. 2018. https://cran.rproject.org/package=BayesFactor.
 32
Makowski D, BenShachar MS, Lüdecke D. bayestestR: Describing Effects and their Uncertainty, Existence and Significance within the Bayesian Framework. J Open Source Softw. 2019; 4(40). https://doi.org/10.21105/joss.01541.
Acknowledgements
The quality of a first draft of the manuscript was improved by the helpful comments of Julio Michael Stern, who pointed the author towards the FBST and the evalue. Also, the author thanks Bruno Mario Cesana, M.D., whose comments clearly helped in improving the overall quality of the manuscript. The author also thanks the Center for Media and Computing Technology at University of Siegen for access to their highperformance computing cluster.
Funding
Not applicable.
Author information
Affiliations
Contributions
The author(s) read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The author declares that he has no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Kelter, R. Analysis of Bayesian posterior significance and effect size indices for the twosample ttest to support reproducible medical research. BMC Med Res Methodol 20, 88 (2020). https://doi.org/10.1186/s12874020009682
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874020009682
Keywords
 Bayesian significance and effect measures
 Bayesian testing
 Student’s ttest
 Bayesian biostatistics