baymedr: an R package and web application for the calculation of Bayes factors for superiority, equivalence, and non-inferiority designs

Background Clinical trials often seek to determine the superiority, equivalence, or non-inferiority of an experimental condition (e.g., a new drug) compared to a control condition (e.g., a placebo or an already existing drug). The use of frequentist statistical methods to analyze data for these types of designs is ubiquitous even though they have several limitations. Bayesian inference remedies many of these shortcomings and allows for intuitive interpretations, but are currently difficult to implement for the applied researcher. Results We outline the frequentist conceptualization of superiority, equivalence, and non-inferiority designs and discuss its disadvantages. Subsequently, we explain how Bayes factors can be used to compare the relative plausibility of competing hypotheses. We present baymedr, an R package and web application, that provides user-friendly tools for the computation of Bayes factors for superiority, equivalence, and non-inferiority designs. Instructions on how to use baymedr are provided and an example illustrates how existing results can be reanalyzed with baymedr. Conclusions Our baymedr R package and web application enable researchers to conduct Bayesian superiority, equivalence, and non-inferiority tests. baymedr is characterized by a user-friendly implementation, making it convenient for researchers who are not statistical experts. Using baymedr, it is possible to calculate Bayes factors based on raw data and summary statistics.


Introduction
Researchers generally agree that the clinical trial is the best method to determine and compare the effects of medications and treatments (E.Christensen, 2007;Friedman et al., 2010).Although clinical trials are often similar in design, different statistical procedures need to be employed depending on the nature of the research question.
Commonly, clinical trials seek to determine the superiority, equivalence, or non-inferiority of an experimental condition (e.g., subjects receiving a new medication) compared to a control condition (e.g., subjects receiving a placebo or an already existing medication; Lesaffre, 2008;Piaggio et al., 2012).For these goals, statistical inference is often conducted in the form of testing.
Usually, the frequentist approach to statistical testing forms the framework in which data for these research designs are analyzed (Chavalarias et al., 2016).In particular, researchers often rely on null hypothesis significance testing (NHST), which quantifies evidence through a p-value.This p-value represents the probability of obtaining a test statistic (e.g., a t-value) at least as extreme as the one observed, assuming that the null hypothesis is true.In other words, the p-value is an indicator of the unusualness of the obtained test statistic under the null hypothesis, forming a "proof by contradiction" (R. Christensen, 2005, p. 123).If the p-value is smaller than a predefined Type I error rate (α), typically set to α = .05(but see, e.g., Benjamin et al., 2018;Lakens, Adolfi, et al., 2018), rejection of the null hypothesis is warranted; otherwise the obtained data do not justify rejection of the null hypothesis.
An alternative to NHST is statistical testing within a Bayesian framework.
Bayesian statistics is based on the idea that the credibilities of well-defined parameter values (e.g., effect size) or models (e.g., null and alternative hypotheses) are updated based on new observations (Kruschke, 2015).With exploding computational power and the rise of Markov chain Monte Carlo methods (e.g., Gilks et al., 1995;van Ravenzwaaij et al., 2018) that are used to estimate probability distributions that cannot be determined analytically, applications of Bayesian inference have recently become tractable.Indeed, Bayesian methods are seeing more and more use in the biomedical field (Berry, 2006) and other disciplines (van de Schoot et al., 2017).
Despite the fact that statistical inference is slowly changing from frequentist methods towards Bayesian methods, a majority of biomedical research still employs frequentist statistical techniques (Chavalarias et al., 2016).To some extent, this might be due to a biased statistical education in favor of frequentist inference.Moreover, researchers might perceive statistical inference through NHST and reporting of p-values as prescriptive and, hence, adhere to this convention (Gigerenzer, 2004;Winkler, 2001).We believe that one of the most crucial factors is the unavailability of easy-to-use Bayesian tools and software, leaving Bayesian hypothesis testing largely to statistical experts.Fortunately, important advances have been made towards user-friendly interfaces for Bayesian analyses with the release of the BayesFactor software (Morey & Rouder, 2018), written in R (R Core Team, 2021), and point-and-click software like JASP (JASP Team, 2021) and Jamovi (The jamovi project, 2021), the latter two of which are based to some extent on the BayesFactor software.However, these tools are mainly tailored towards research designs in the social sciences.Easy-to-use Bayesian tools and corresponding accessible software for the analysis of biomedical research designs specifically (e.g., superiority, equivalence, and non-inferiority) are still missing and, thus, urgently needed.
In this article, we provide a software package and a web application for conducting Bayesian hypothesis tests for superiority, equivalence, and non-inferiority designs.
Although implementations for the superiority and equivalence test exist elsewhere, the implementation of the non-inferiority test is novel.Firstly, we outline the traditional frequentist approach to statistical testing for each of these designs.Secondly, we discuss the key disadvantages and potential pitfalls of this approach and motivate why Bayesian inferential techniques are better suited for these research designs.Thirdly, we explain the conceptual background of Bayes factors (Goodman, 1999b;Jeffreys, 1939Jeffreys, , 1948Jeffreys, , 1961;;Kass & Raftery, 1995).Fourthly, we provide and introduce baymedr (Linde & van Ravenzwaaij, 2021), an open-source software written in R (R Core Team, 2021) that comes together with a web application (https://maxlinde.shinyapps.io/baymedr/), for the computation of Bayes factors for common biomedical designs.We provide step-by-step instructions on how to use baymedr.Finally, we present a reanalysis of an existing empirical study to illustrate the most important features of the baymedr R package and the accompanying web application.

Frequentist Inference for Superiority, Equivalence, and Non-Inferiority Designs
The superiority, equivalence, and non-inferiority tests are concerned with research settings in which two conditions (e.g., control and experimental) are compared on some outcome measure (E.Christensen, 2007;Lesaffre, 2008).For instance, researchers might want to investigate whether a new antidepressant medication is superior, equivalent, or non-inferior compared to a well-established antidepressant.For a continuous outcome variable, the between-group comparison is typically made with one or two t-tests.The three designs differ, however, in the precise specification of the t-tests (see Fig 1).
In the following, we will assume that higher scores on the outcome measure of interest represent a more favorable outcome (i.e., superiority or non-inferiority) than lower scores.For example, high scores are favorable when the measure of interest represents the number of social interactions in patients with social anxiety, whereas low scores are favorable when the outcome variable is the number of depressive symptoms in patients with major depressive disorder.We will also assume that the outcome variable is continuous and that the residuals within both conditions are Normal distributed in the population, sharing a common population variance.Throughout this article, the true population effect size (δ) reflects the true standardized difference in the outcome between the experimental condition (i.e., e) and the control condition (i.e., c): The Superiority Design The superiority design tests whether the experimental condition is superior to the control condition (see the first row of Fig 1).Conceptually, the superiority design consists of a one-sided test due to its inherent directionality.The null hypothesis H 0 states that the true population effect size is zero, whereas the alternative hypotheses H 1 states that the true population effect size is larger than zero: To test these hypotheses, a one-sided t-test is conducted. 1

The Equivalence Design
The equivalence design tests whether the experimental and control conditions are practically equivalent (see the second row of Fig 1).There are multiple approaches to equivalence testing (see, e.g., Meyners, 2012).A comprehensive treatment of all approaches is beyond the scope of this article.Here, we focus on one popular alternative: the two 1 Researchers often conduct a two-sided t-test and then confirm that the observed effect goes in the expected direction.We do not describe this approach because we have the opinion that a one-sided t-test should be conducted for the superiority test, whose name already implies a uni-directional alternative hypothesis.

Non-superiority Superiority
Non one-sided tests procedure (TOST; Hodges & Lehmann, 1954;Schuirmann, 1987;Westlake, 1976; see also Meyners, 2012;Senn, 2008).An equivalence interval must be defined, which can be based, for example, on the smallest effect size of interest (Lakens, 2017;Lakens, Scheel, et al., 2018).The specification of the equivalence interval is not a statistical question; thus, it should be set by experts in the respective fields (Meyners, 2012;Schuirmann, 1987) or comply with regulatory guidelines (Garrett, 2003).Importantly, however, the equivalence interval should be determined independent of the obtained data.
TOST involves conducting two one-sided t-tests, each one with its own null and alternative hypotheses.For the first test, the null hypothesis states that the true population effect size is smaller than the lower boundary of the equivalence interval, whereas the alternative hypothesis states that the true population effect size is larger than the lower boundary of the equivalence interval.For the second test, the null hypothesis states that the true population effect size is larger than the upper boundary of the equivalence interval, whereas the alternative hypothesis states that the true population effect size is smaller than the upper boundary of the equivalence interval.Assuming that the equivalence interval is symmetric around the null value, these hypotheses can be summarized as follows: where c represents the margin of the standardized equivalence interval.Two p-values (p −c and p c ) result from the application of the TOST procedure.We reject the null hypothesis of non-equivalence and, thus, establish equivalence if max (p −c , p c ) < α (cf.Meyners, 2012;Walker & Nowacki, 2011).In other words, both tests need to reach statistical significance.

The Non-Inferiority Design
In some situations, researchers are interested in testing whether the experimental condition is non-inferior or not worse than the control condition by a certain amount.This is the goal of the non-inferiority design, which consists of a one-tailed test (see the third row of Fig 1  , 1999), is cheaper (Kaul & Diamond, 2006), or is easier to administer than the current medication (Van de Werf et al., 1999).In these cases, we need to ponder the cost of a somewhat lower or equal effectiveness of the new treatment with the value of the just mentioned benefits (Hills, 2017).The null hypothesis states that the true population effect size is equal to a predetermined threshold, whereas the alternative hypothesis states that the true population effect size is higher than this threshold: where c represents the standardized non-inferiority margin.As with the equivalence interval, the non-inferiority margin should be defined independent of the obtained data.

Limitations of Frequentist Inference
Tests of superiority, equivalence, and non-inferiority have great value in biomedical research.It is the way researchers conduct their statistical analyses that, we argue, should be critically reconsidered.There are several disadvantages associated with the application of NHST to superiority, equivalence, and non-inferiority designs.Here, we limit our discussion to two disadvantages; for a more comprehensive exposition we refer the reader to other sources (e.g., Goodman, 1999a;International Committee of Medical Journal Editors, 1997;Rennie, 1978;Wagenmakers et al., 2018).
First, researchers need to stick to a predetermined sampling plan (Rouder, 2014;Schönbrodt & Wagenmakers, 2018;Schönbrodt et al., 2017).That is, it is not legitimate to decide based on interim results to stop data collection (e.g., because the p-value is already smaller than α) or to continue data collection beyond the predetermined sample size (e.g., because the p-value almost reaches statistical significance).In principle, researchers can correct for the fact that they inspected the data by reducing the required significance threshold through one of several techniques (Ranganathan et al., 2016).However, such correction methods are rarely applied.Especially in biomedical research, the possibility of optional stopping could reduce the waste of resources for expensive and time-consuming BAYMEDR 10 trials (Chalmers & Glasziou, 2009).
Second, with the traditional frequentist framework it is impossible to quantify evidence in favor of the null hypothesis (Gallistel, 2009;Rouder et al., 2009;van Ravenzwaaij et al., 2019;Wagenmakers, 2007;Wagenmakers et al., 2018).Oftentimes, the p-value is erroneously interpreted as a posterior probability, in the sense that it represents the probability of the null hypothesis (Berger & Sellke, 1987;Gelman, 2013;Goodman, 2008;Haller & Krauss, 2002).However, a non-significant p-value does not only occur when the null hypothesis is in fact true but also when the alternative hypothesis is true, yet there was not enough power to detect an effect (Bakan, 1966;van Ravenzwaaij et al., 2019).As Altman and Bland (1995, p. 485) put it: "Absence of evidence is not evidence of absence".
Still, a large proportion of biomedical studies falsely claim equivalence based on statistically non-significant t-tests (Greene et al., 2000).Yet, estimating evidence in favor of the null hypothesis is essential for certain designs like the equivalence test (Blackwelder, 1982;Hoekstra et al., 2018;van Ravenzwaaij et al., 2019).
The TOST procedure for equivalence testing provides a workaround for the problem that evidence for the null hypothesis cannot be quantified with frequentist techniques by defining an equivalence interval around δ = 0 and conducting two tests.Without this interval the TOST procedure would inevitably fail (see Meyners, 2012, for an explanation of why this is the case).As we will see, the Bayesian equivalence test does not have this restriction; it allows for the specification of interval as well as point null hypotheses.

Bayesian Tests for Superiority, Equivalence, and Non-Inferiority Designs
The Bayesian statistical framework provides a logically sound method to update beliefs about parameters based on new data (Goodman, 1999b;Kruschke, 2015).Bayesian inference can be divided into parameter estimation (e.g., estimating a population correlation) and model comparison (e.g., comparing the relative probabilities of the data under the null and alternative hypotheses) procedures (see, e.g., Kruschke & Liddell, 2018b, for an overview).Here, we will focus on the latter approach, which is usually accomplished with Bayes factors (Goodman, 1999b;Jeffreys, 1939Jeffreys, , 1948Jeffreys, , 1961;;Kass & Raftery, 1995).In our exposition of Bayes factors in general and specifically for superiority, equivalence, and non-inferiority designs, we mostly refrain from complex equations and derivations.Formulas are only provided when we think that they help to communicate the ideas and concepts.We refer readers interested in the mathematics of Bayes factors to other sources (e.g., Etz & Vandekerckhove, 2018;Jeffreys, 1961;Kass & Raftery, 1995;O'Hagan & Forster, 2004;Rouder et al., 2009;Wagenmakers et al., 2010).The precise derivation of Bayes factors for superiority, equivalence, and non-inferiority designs in particular is treated elsewhere (van Ravenzwaaij et al., 2019; see also Gronau et al., 2020).

The Bayes Factor
Let us suppose that we have two hypotheses, H 0 and H 1 , that we want to contrast.
Without considering any data, we have initial beliefs about the probabilities of H 0 and H 1 , which are given by the prior probabilities p (H 0 ) and p (H 1 ) = 1 − p (H 0 ).Now, we collect some data D.After having seen the data, we have new and refined beliefs about the probabilities that H 0 and H 1 are true, which are given by the posterior probabilities In other words, we update our prior beliefs about the probabilities of H 0 and H 1 by incorporating what the data dictates we should believe and arrive at our posterior beliefs.This relation is expressed in Bayes' rule: As we will see, the likelihood in Equation 5 is actually a marginal likelihood because each model (i.e., H 0 and H 1 ) contains certain parameters that are integrated out.The denominator in Equation 5(labeled marginal likelihood) serves as a normalization constant, ensuring that the sum of the posterior probabilities is 1.Without this normalization constant the posterior is still proportional to the product of the likelihood and the prior.Therefore, for H 0 and H 1 we can also write: where ∝ means "is proportional to".
Rather than using posterior probabilities for each hypothesis, let the ratio of the posterior probabilities for H 0 and H 1 be: Bayes factor, BF 01 The quantity p (H 0 | D) /p (H 1 | D) represents the posterior odds and the quantity p (H 0 ) /p (H 1 ) is called the prior odds.To get the posterior odds, we have to multiply the prior odds with p (D | H 0 ) /p (D | H 1 ), a quantity known as the Bayes factor (Goodman, 1999b;Jeffreys, 1939Jeffreys, , 1948Jeffreys, , 1961;;Kass & Raftery, 1995), which is a ratio of marginal likelihoods: where θ 0 and θ 1 are vectors of parameters under H 0 and H 1 , respectively.In other words, the marginal likelihoods in the numerator and denominator of Equation 8 are weighted averages of the likelihoods, for which the weights are determined by the corresponding prior.In the case where one hypothesis has fixed values for the parameter vector θ i (e.g., a point null hypothesis), integration over the parameter space and the specification of a prior is not required.In that case, the marginal likelihood becomes a likelihood.
The Bayes factor is the amount by which we would update our prior odds to obtain the posterior odds, after taking into consideration the data.For example, if we had prior odds of 2 and the Bayes factor is 24, then the posterior odds would be 48.In the special case where the prior odds is 1, the Bayes factor is equal to the posterior odds.A major advantage of the Bayes factor is its ease of interpretation.For example, if the Bayes factor (BF 01 , denoting the fact that H 0 is in the numerator and H 1 in the denominator) equals 10, the data are ten times more likely to have occurred under H 0 compared to H 1 .With BF 01 = 0.2, we can say that the data are five times more likely under H 1 compared to H 0 because we can simply take the reciprocal of BF 01 (i.e., BF 10 = 1/BF 01 ).What constitutes enough evidence is subjective and certainly depends on the context.Nevertheless, rules of thumb for evidence thresholds have been proposed.For instance, Kass and Raftery (1995) labeled Bayes factors between 1 and 3 as "not worth more than a bare mention", Bayes factors between 3 and 20 as "positive", those between 20 and 150 as "strong", and anything above 150 as "very strong", with corresponding thresholds for the reciprocals of the Bayes factors.An alternative classification scheme was already proposed before, with thresholds at 3, 10, 30, and 100 and similar labels (Jeffreys, 1961; see also Lee & Wagenmakers, 2013, for updated labels).
Of course, we need to define H 0 and H 1 .In other words, both models contain certain parameters for which we need to determine a prior distribution.Here, we will assume that the residuals of the two groups are Normal distributed in the population with a common population variance.The shape of a Normal distribution is fully determined with the location (mean; µ) and the scale (variance; σ 2 ) parameters.Thus, in principle, both models contain two parameters.Now, we make two important changes.
Secondly, µ under H 1 can be expressed in terms of a population effect size δ (Gönen et al., 2005;Rouder et al., 2009).This establishes a common and comparable scale across experiments and populations (Rouder et al., 2009).The prior on δ could reflect certain hypotheses that we want to test.For instance, we could compare the null hypothesis (H 0 : δ = 0) to a two-sided alternative hypotheses (H 1 : δ = 0) or to one of two one-sided alternative hypotheses (H 1 : δ < 0 or H 1 : δ > 0).Alternatively, we could compare an interval hypothesis for the null hypothesis (H 0 : −c < δ < c) with a corresponding alternative hypothesis (H 1 : δ < −c OR δ > c). 2 The choice of the specific prior for δ is a delicate matter, which is discussed in the next section.
In the most general case, the Bayes factor (i.e., BF 01 ) can be calculated through division of the posterior odds by the prior odds (i.e., rearranging Equation 7): accordingly, we can also calculate BF 10 : Calculating Bayes factors this way often involves solving complex integrals (see, e.g., Equation 8; also cf.Wagenmakers et al., 2010).Fortunately, there is a computational shortcut for the specific but very common scenario where we have a point null hypothesis and a complementary interval alternative hypothesis.This shortcut, which is called the Savage-Dickey density ratio, takes the ratio of the density of the prior and posterior at the null value under the alternative hypothesis to calculate the Bayes factor; this is explained in more detail elsewhere (Dickey & Lientz, 1970;Kass & Raftery, 1995;van Ravenzwaaij & Etz, 2021;Wagenmakers et al., 2010).

Default Priors
Until this point in our exposition, we were quite vague about the form of the prior for δ under H 1 .In principle, the prior for δ within H 1 can be defined as desired, conforming to the beliefs of the researcher.In fact, this is a fundamental part of Bayesian inference because various priors allow for the expression of a theory or prior beliefs (Morey et al., 2016;Vanpaemel, 2010).Most commonly, however, default or objective priors are employed that aim to increase the objectivity in specifying the prior or serve as a default when no specific prior information is available (Consonni et al., 2018;Jeffreys, 1961;Rouder et al., 2009).We employ objective priors in baymedr.
In the situation where we have a point null hypothesis and an alternative hypothesis that involves a range of values, Jeffreys (1961) proposed to use a Cauchy prior with a scale parameter of r = 1 for δ under H 1 .This Cauchy distribution is equivalent to a Student's t distribution with 1 degree of freedom and resembles a standard Normal distribution, except that the Cauchy distribution has less mass at the center but instead heavier tails (see Fig 2; Rouder et al., 2009).Mathematically, the Cauchy distribution corresponds to the combined specification of (1) a Normal prior with mean µ δ and variance σ 2 δ on δ; and (2) an inverse Chi-square distribution with 1 degree of freedom on σ 2 δ .Integrating out σ 2 δ yields the Cauchy distribution (Liang et al., 2008;Rouder et al., 2009).The scale parameter r defines the width of the Cauchy distribution; that is, half of the mass lies between −r and r.
Choosing a Cauchy prior with a location parameter of 0 and a scale parameter of r = 1 has the advantage that the resulting Bayes factor is 1 in case of completely uninformative data.In turn, the Bayes factor approaches infinity (or 0) for decisive data (Bayarri et al., 2012;Jeffreys, 1961).Still, by varying the Cauchy scale parameter, we can set a different emphasis on the prior credibility of a range of effect sizes.More recently, a Cauchy prior scale of r = 1/ √ 2 is used as a default setting in the BayesFactor software (Morey & Rouder, 2018), the point-and-click software JASP (JASP Team, 2021), and Jamovi (The jamovi project, 2021).We have adopted this value in baymedr as a default setting.Nevertheless, objective priors are often criticized (see, e.g., Kruschke & Liddell, 2018a;Tendeiro & Kiers, 2019); researchers are encouraged to use more informed priors if relevant knowledge is available (Gronau et al., 2020;Rouder et al., 2009).

Using baymedr
With the baymedr software (BAYesian inference for MEDical designs in R; Linde & van Ravenzwaaij, 2021), written in R (R Core Team, 2021), and the corresponding web application (accessible at https://maxlinde.shinyapps.io/baymedr/)one can easily calculate Bayes factors for superiority, equivalence, and non-inferiority designs.The R package can be used by researchers who have only rudimentary knowledge of R; if that is not the case, researchers can use the web application, which does not require any knowledge of programming.In the following, we will demonstrate how Bayes factors for superiority, equivalence, and non-inferiority designs can be calculated with the baymedr R package; a thorough explanation of the web application is not necessary as it strongly overlaps with the R package.Subsequently, we will showcase (1) the baymedr R package and (2) the corresponding web application by reanalyzing data of an empirical study by Basner et al. (2019).

Install and Load baymedr
To install the latest release of the baymedr R package from The Comprehensive R Archive Network (CRAN; https://CRAN.R-project.org/package=baymedr),use the following command: install.packages("baymedr") The most recent version of the R package can be obtained from GitHub (https://github.com/maxlinde/baymedr)with the help of the devtools package (Wickham et al., 2019): Once baymedr is installed, it needs to be loaded into memory, after which it is ready for usage: library("baymedr")

Commonalities Across Designs
For all three research designs, the user has three options for data input (function arguments that have "x" as a name or suffix refer to the control condition and those with "y" as a name or suffix to the experimental condition): (1) provide the raw data; the Once a superiority, equivalence, or non-inferiority test is conducted, an informative and accessible output message is printed in the console.For all three designs, this output states the type of test that was conducted and whether raw or summary data were used.
Moreover, the corresponding null and alternative hypotheses are restated and the specified Cauchy prior scale is shown.In addition, the lower and upper bounds of the equivalence interval are presented in case an equivalence test was employed; similarly, the non-inferiority margin is printed when the non-inferiority design was chosen.Lastly, the resulting Bayes factor is shown.To avoid any confusion, it is declared in brackets whether the Bayes factor quantifies evidence towards the null (e.g., equivalence) or alternative (e.g., non-inferiority or superiority) hypothesis.

Conducting Superiority, Equivalence, and Non-inferiority Tests
The Bayesian superiority test is performed with the super_bf() function.
Depending on the research setting, low or high scores on the measure of interest represent "superiority", which is specified by the argument direction.Since we seek to find evidence for the alternative hypothesis (superiority), the Bayes factor quantifies evidence for H 1 relative to H 0 (i.e., BF 10 ).
The Bayesian equivalence test is done with the equiv_bf() function.The desired equivalence interval is specified with the interval argument.Several options are possible: A symmetric equivalence interval around δ = 0 can be indicated by providing one value (e.g., interval = 0.2) or by providing a vector with the negative and the positive values (e.g., interval = c(-0.2,0.2)).An asymmetric equivalence interval can be specified by providing a vector with the negative and the positive values (e.g., interval = c(-0.3,0.2)).The implementation of a point null hypothesis is achieved by using either interval = 0 or interval = c(0, 0), which also serves as the default specification.The argument interval_std can be used to declare whether the equivalence interval was specified in standardized or unstandardized units.Since we seek to quantify evidence towards equivalence, we contrast the evidence for H 0 relative to H 1 (i.e., BF 01 ).
The Bayes factor for the non-inferiority design is calculated with the infer_bf() function.The value for the non-inferiority margin can be specified with the ni_margin argument.The argument ni_margin_std can be used to declare whether the non-inferiority margin was given in standardized or unstandardized units.Lastly, depending on whether high or low values on the measure of interest represent "non-inferiority", one of the options "high" or "low" should be set for the argument direction.We wish to determine the evidence in favor of H 1 ; therefore, the evidence is expressed for H 1 relative to H 0 (i.e., BF 10 ).

Demonstration of baymedr
To illustrate how the R package and the web application can be used, we provide one example of an empirical study that employed non-inferiority tests to investigate differences in the amount of sleep, sleepiness, and alertness among medical trainees following either standard or flexible duty-hour programs (Basner et al., 2019)  Shown is part of the baymedr web application demonstrating how summary statistics can be inserted and further parameters specified for a Bayesian non-inferiority test.In this specific case, the summary statistics correspond to the ones obtained from Basner et al. (2019).See text for details.
truncated, meaning that they are cut off at δ = c.Similarly, the lower plot shows the truncated prior and posterior for contrasting H 0 : δ = c with H 1 : δ > c.Through a heuristic called the Savage-Dickey density ratio (Dickey & Lientz, 1970;Kass & Raftery, 1995;van Ravenzwaaij & Etz, 2021;Wagenmakers et al., 2010), the ratio of the heights of the colored dots gives us the Bayes factor (see the colored expressions in the formula on the right side

Figure 4
Shown is part of the baymedr web application showing the results of a Bayesian non-inferiority test.In this specific case, the results correspond to a reanalysis using summary statistics obtained from Basner et al. (2019).See text for details.
of the results output).The text above the two plots explains the plots as well.

Discussion
Tests of superiority, equivalence, and non-inferiority are important means to compare the effectiveness of medications and treatments in biomedical research.Despite several limitations, researchers overwhelmingly rely on traditional frequentist inference to analyze the corresponding data for these research designs (Chavalarias et al., 2016).We believe that Bayes factors (Goodman, 1999b;Jeffreys, 1939Jeffreys, , 1948Jeffreys, , 1961;;Kass & Raftery, 1995) are an attractive alternative to NHST and p-values because they allow researchers to quantify evidence in favor of the null hypothesis (Gallistel, 2009;van Ravenzwaaij et al., 2019;Wagenmakers, 2007;Wagenmakers et al., 2018) and permit sequential testing and optional stopping (Rouder, 2014;Schönbrodt & Wagenmakers, 2018;Schönbrodt et al., 2017).In fact, we believe that the possibility for optional stopping and sequential testing has the potential to largely reduce the waste of scarce resources.This is especially important in the field of biomedicine, where clinical trials might be expensive or even harmful for participants.
Our baymedr R package and web application (Linde & van Ravenzwaaij, 2021) enable researchers to conduct Bayesian superiority, equivalence, and non-inferiority tests.
baymedr is characterized by a user-friendly implementation, making it convenient for researchers who are not statistical experts.Furthermore, using baymedr, it is possible to calculate Bayes factors based on raw data and summary statistics, allowing for the reanalysis of published studies, for which the full data set is not available.
Figure 2Comparison of the standard Normal probability density function (solid line) and the standard Cauchy probability density function (dashed line).
relevant arguments are x and y; (2) provide the sample sizes, sample means, and sample standard deviations; the relevant arguments are n_x and n_y for sample sizes, mean_x and mean_y for sample means, and sd_x and sd_y for sample standard deviations; (3) provide the sample sizes, sample means, and the confidence interval for the difference in group means; the relevant arguments are n_x and n_y for sample sizes, mean_x and mean_y for sample means, and ci_margin for the confidence interval margin and ci_level for the confidence level.The Cauchy distribution is used as the prior for δ under the alternative hypothesis for all three tests.The user can set the width of the Cauchy prior with the prior_scale argument, thus, allowing the specification of different ranges of plausible effect sizes.In all three cases, the Cauchy prior is centered on δ = 0. Further, baymedr uses a default Cauchy prior scale of r = 1/ √ 2, complying with the standard settings of the BayesFactor software (Morey & Rouder, 2018), JASP (JASP Team, 2021), and Jamovi (The jamovi project, 2021).
Fig 4 shows the output of the calculations.The top of the left column displays the same output that is given with the R package.Further, upon clicking on "Show frequentist results", the results of the frequentist non-inferiority test are shown and clicking on "Hide frequentist results" in turn hides those results.Below that output is the formula for the Bayes factor, with different elements printed in colors that correspond to dots in matching colors in the plots on the right column of the results output.The upper plot shows the prior and posterior for contrasting H 0 : δ = c with H 1 : δ < c.The two distributions are Kruschke, 2015) (H i ) represents the prior probability ofH i , p (D | H i ) denotes the likelihood of the data under H i , p (D | H 0 ) p (H 0 ) + p (D | H 1 ) p (H 1 )is the marginal likelihood (also called evidence;Kruschke, 2015), and p (H i | D) is the posterior probability of H i .
Basner et al. (2019) p. 916).As outlined above, the calculation of Bayes factors for equivalence and superiority tests is done quite similarly to the non-inferiority test, so we do not provide specific examples for those tests.For the purpose of this demonstration, we will only consider the outcome variable sleepiness.Participants were monitored over a period of 14 days and were asked to indicate each morning how sleepyThe null hypothesis was that medical trainees in the flexible program are sleepier by more than a non-inferiority margin than trainees in the standard program.Conversely, the alternative hypothesis was that trainees in the flexible program are not sleepier by more than a non-inferiority margin than trainees in the standard program.The non-inferiority margin was defined as 1 point on the 9-point Likert scale.All relevant summary statistics can be obtained or calculated from Table1and the Results section ofBasner et al. (2019).Table1indicates that the flexible program had a mean of M e = 4.
(Åkerstedt & Gillberg, 1990)Karolinska sleepiness scale(Åkerstedt & Gillberg, 1990), a 9-point Likert scale ranging from 1 (extremely alert) to 9 (extremely sleepy, fighting sleep).The dependent variable consisted of the average sleepiness score over the whole observation period of 14 days.The research question was whether the flexible duty-hour program was non-inferior to the standard program in terms of sleepiness.