Comparison of Bayesian and frequentist group-sequential clinical trial designs

Background There is a growing interest in the use of Bayesian adaptive designs in late-phase clinical trials. This includes the use of stopping rules based on Bayesian analyses in which the frequentist type I error rate is controlled as in frequentist group-sequential designs. Methods This paper presents a practical comparison of Bayesian and frequentist group-sequential tests. Focussing on the setting in which data can be summarised by normally distributed test statistics, we evaluate and compare boundary values and operating characteristics. Results Although Bayesian and frequentist group-sequential approaches are based on fundamentally different paradigms, in a single arm trial or two-arm comparative trial with a prior distribution specified for the treatment difference, Bayesian and frequentist group-sequential tests can have identical stopping rules if particular critical values with which the posterior probability is compared or particular spending function values are chosen. If the Bayesian critical values at different looks are restricted to be equal, O’Brien and Fleming’s design corresponds to a Bayesian design with an exceptionally informative negative prior, Pocock’s design to a Bayesian design with a non-informative prior and frequentist designs with a linear alpha spending function are very similar to Bayesian designs with slightly informative priors.This contrasts with the setting of a comparative trial with independent prior distributions specified for treatment effects in different groups. In this case Bayesian and frequentist group-sequential tests cannot have the same stopping rule as the Bayesian stopping rule depends on the observed means in the two groups and not just on their difference. In this setting the Bayesian test can only be guaranteed to control the type I error for a specified range of values of the control group treatment effect. Conclusions Comparison of frequentist and Bayesian designs can encourage careful thought about design parameters and help to ensure appropriate design choices are made.

rate. Hueber et al. [7] (see also [8] for additional statistical details) describe a Bayesian group-sequential trial comparing secukinumab with placebo for the treatment of Crohn's disease. The outcome is the change in Crohn's Disease Activity Index (CDAI), which was taken to be normally distributed. Prior distributions were specified separately for the placebo and secukinumab effects, with the former being informative and the latter non-informative. Analyses were planned after 30 and 60 patients, when the trial could be stopped if both (i) the posterior probability that secukinumab was superior to the placebo exceeded 95%, and (ii) there was at least a 50% posterior probability that the change in CDAI due to secukinumab was superior to that for placebo by at least fifty. The type I error rate for this design was calculated using the R package gsbDesign [9] and shown to be 1.2% if the change in CDAI due to placebo was as anticipated.
A Bayesian group-sequential trial with a binary primary outcome is described by Wilber et al. [10]. This randomised trial compared antiarrhythmic drug therapy with catheter ablation for the treatment of paroxysmal atrial fibrillation. The primary outcome was the observation of protocol-defined treatment failure. Analyses were planned after 150, 175, 200 and 230 patients, with a stopping rule based on the posterior probability of superiority of the experimental treatment over the control exceeding 98%, giving a type I error rate of 0.025.
The increasing use of Bayesian sequential designs that control the frequentist type I error rate has led to a growing body of work comparing Bayesian and frequentist group sequential trial methods [3,5,8,[11][12][13][14]. This paper adds to this work. In contrast to some authors who draw comparisons between underlying Bayesian and frequentist paradigms, our focus is a practical one, in which we compare Bayesian and frequentist group sequential tests in terms of their boundary values and operating characteristics. We consider specifically the setting of normally distributed data or test statistics. This facilitates comparison between Bayesian and frequentist group sequential methods as the latter have been largely developed in this setting.
We consider separately Bayesian designs in which a single treatment effect is considered, either in a singlearm trial or with a prior specified directly for the difference between experimental and control treatments, and in which treatment effects have independent prior distributions. In the one-parameter setting frequentist and Bayesian group-sequential designs can be identical if sufficient flexibility in choice of design parameters is allowed [12], and we show that frequentist and Bayesian group-sequential designs may be very similar for common choices of stopping rules. In the two-parameter setting we show that the frequentist and Bayesian designs cannot correspond, and show that in this case the Bayesian group-sequential designs can only control the type I error rate for specified values of the control group treatment effect.

Notation and problem formulation Single arm trials with normally distributed data
Suppose we conduct a group-sequential single-arm clinical trial of some experimental treatment with up to K analyses of a single sample of normally distributed data with a cumulative total of n k observations at look k, k = 1, . . . , K.
At each look the data observed up to that point will be analysed and a decision made whether or not to continue to the next look. We will only consider stopping the trial for a positive result, that is for efficacy. Additional stopping for futility is considered in the "Discussion" section.
Denoting by Y i the observed value for patient i, we will assume this is normally distributed with mean θ and known variance σ 2 . We wish to draw inference on θ and will assume that parameterisation is such that θ = 0 corresponds to the experimental treatment being of equal efficacy to some specified reference value or standard treatment effect, with positive values of θ (and hence of Y i ) indicative of superiority of the experimental treatment.
LetȲ k = n k i=1 Y i /n k denote the mean value from the cumulative sample at look k. This is the sufficient statistic for θ at look k. It is helpful to write the distribution in terms of the inverse of the variance, known as the information, and set I k = n k /σ 2 . We then haveȲ 1 , . . . ,Ȳ K multivariate normal with with a similar multivariate normal distribution for the standardised test statistics,Ȳ 1 √ I 1 , . . . ,Ȳ K √ I K . In a frequentist setting, we will test the null hypothesis, H 0 : θ ≤ 0 against the one-sided alternative, θ > 0, concluding that the experimental treatment is superior to the standard if this null hypothesis is rejected. The test will be based on the observed values ofȲ 1 , . . . ,Ȳ K , stopping and rejecting the null hypothesis at look k ifȲ k is sufficiently large as described in more detail below.
In a Bayesian setting, inference will be based on the posterior distribution for θ given the observed data. Basing the likelihood on (1), a normal prior for θ is conjugate. Given prior distribution θ ∼ N θ 0 , I −1 0 the posterior distribution for θ following observation ofȲ k =ȳ k at look k is given by (see [15] Section 5.2). If this posterior distribution is sufficiently indicative of a positive treatment effect the trial will be stopped with the conclusion that the experimental treatment is superior to the standard or reference value. More details are given below. The value of I 0 gives a measure of the prior information. In particular, letting I 0 approach 0 gives a flat improper normal prior.

Single arm trials with non-normal data
For non-normal data, tests can be based on the assumed distributional form parameterised in terms of the treatment effect, which will again be denoted by θ. An analytic form of the posterior distribution may be available if a conjugate prior distribution is used.
Alternatively, in many cases, if n 1 , . . . , n K are sufficiently large, we can obtain an estimateθ k for the treatment effect based on the data at look k withθ 1 , . . . ,θ K approximately following the multivariate normal distribution (1) for some I 1 , . . . , I K . It is common to use this approximate distributional form in a frequentist group-sequential test [16], enabling use of these estimates in place of the single sample means and applying methods based on the normal distribution (1) even without normally distributed data, or with normal data when the variance cannot be assumed to be known.
An illustration in the setting of a single sample of binomial data is given below.

Comparative trials
Suppose now we have two groups; group 0, the control group and group 1, the experimental treatment group. Let Y ji denote the response from patient i in group j, assumed to be normally distributed with known variance, with Y ji ∼ N μ j , σ 2 j , j = 0, 1. We wish to draw inference on the treatment difference given by θ = μ 1 − μ 0 . We will again assume larger values of Y ji are preferable so that larger values of θ correspond to the superiority of the experimental treatment to the control treatment.
In a frequentist setting, we will test H 0 : θ ≤ 0 against θ > 0 based on the observed values of D 1 , . . . , D K , stopping and rejecting the null hypothesis at look k, concluding that the experimental treatment is superior to the control, if D k is sufficiently large, as described in more detail below.
In a Bayesian setting, we may specify the prior distribution for the treatment effect in two ways. The first is to specify a prior distribution for the treatment difference, θ, directly. Suppose again that θ has a normal prior distribution with θ ∼ N θ 0 , I −1 0 . At look k the posterior distribution for θ given observed value D k = d k is given by The alternative is to specify independent prior distributions for μ 0 and μ 1 , update these separately to obtain posterior distributions for μ 0 and μ 1 and then use these to obtain a posterior distribution for θ. This approach is considered in detail below in the section entitled "Comparison of frequentist and Bayesian group-sequential approaches -two parameter case".
For non-normal data, or when the variance cannot be assumed known, we often again have estimates of the treatment effect,θ k , approximately normally distributed, so that the distributional form (1) can be used. As in the two-sample case with normally distributed data, in the Bayesian setting we can either specify a prior for θ directly or specify independent prior distributions for treatment effects in the two groups.

Bayesian group-sequential approach
In a Bayesian sequential trial, inference at look k will be based on the posterior distribution for θ given in the single group case by (2), in the two sample case when a prior distribution is specified for θ directly by (3) and in the two sample case when prior distributions are given for μ 0 and μ 1 by the expression (10) given below.
A common approach is to stop the trial, concluding that the experimental treatment is superior to the control if the posterior probability that θ exceeds 0 given the observed data is sufficiently large. In detail, critical values, p k , k = 1, . . . , K, will be specified and the trial will stop as soon as Considering stopping to conclude the experimental treatment is superior to the control to be equivalent to rejection of H 0 , the frequentist type I error rate of this Bayesian sequential procedure can be calculated by noting that Pr(θ > 0 | data at look k) is a random variable since it depends on the observed data. Control of the type I error rate is thus achieved if It has been suggested that p 1 , . . . , p K should be chosen to satisfy this condition [2].
A number of alternatives to the stopping criterion (4) above have also been proposed. For example, the trial might be stopped to declare the experimental treatment superior at look k if the posterior probability that θ exceeds some specified positive target value, or the predictive probability that the experimental treatment would be found superior if the trial continued to the final analysis, is sufficiently large [8,17,18].
Although, in general, different values for p 1 , . . . , p K could be specified, often a common value p 1 = · · · = p K is used [2], with this value chosen to satisfy (5). We will consider both the general and this specific case in the examples below.
In many settings the probability on the left hand side of (5) can most easily be calculated via simulation methods [2]. In the case of single-or two-sample normally distributed data considered here, since, for a specified prior distribution, the posterior probability (4) depends onȲ k , it can be calculated analytically from the joint distribution (1), for example in R using the gsbDesign [9] or code available from the first author.

Frequentist group-sequential approaches
In a frequentist setting, the null hypothesis, H 0 : θ ≤ 0, will be rejected, and the trial stopped at look k ifȲ k √ I k ≥ u k for some u k in the single-sample case or if D k √ I k ≥ u k in the two sample case. As the forms of the joint distributions forȲ 1 , . . . ,Ȳ K and D 1 , . . . , D K are identical, we will here consider only the single-sample case.
To control the type I error rate at some specified level α, it is required to choose u 1 , . . . , As the requirement (6) is insufficient to specify u 1 , . . . , u K , a number of approaches have been proposed as described in the next two subsections.

Pocock's test and O'Brien and Fleming's test
Pocock [19] and O'Brien and Fleming [20] propose methods with equally-spaced looks, that is, using the notation introduced above, with I k = kI K /K, k = 1, . . . , K. O'Brien and Fleming suggest stopping ifȲ k I k exceeds some fixed value, that is taking u k = c/ √ I k . Pocock suggests stopping if the standardised differenceȲ k I 1/2 k exceeds a fixed value, that is taking u k = c. In each case, the constant value for c is found so as to satisfy (6). These values are tabulated for certain K and α [19,20], or can be obtained from a numerical search, noting that the probability in (6) can be expressed in terms of the multivariate normal distribution function which may be evaluated numerically, for example in R using function pmvnorm in the mvtnorm package [21].

Spending function approaches
Slud and Wei [22] suggest introducing greater flexibility to sequential designs that satisfy (6) by specifying the type I error rate "spent" at each look. In detail, they specify α 1 ≤ · · · ≤ α K = α, then obtain u k , k = 1, . . . , K, such that the probability under the null hypothesis of stopping at or before look k, say at some look k with k ≤ k, is equal to α k , that is This approach was extended by Lan and DeMets [23], who proposed that α 1 , . . . , α K be given by a function α * (t) of the information time, with t at look k equal to For general choice of non-decreasing α * with α * (0) = 0 and α * (1) = α, the approaches of Slud and Wei and Lan and DeMets are equivalent provided I 1 , . . . , I K are specified in advance. By defining the functional form of α * , the Lan and DeMets approach enables calculation of u 1 , . . . , u K to satisfy (6) when I 1 , . . . , I K are not given in advance, providing they are independent ofȲ 1 , . . . ,Ȳ K .
Lan and DeMets give forms for the spending function α * (t) corresponding approximately to the Pocock test, with α * (t) = α log(1+(e−1)t), and the O'Brien and Fleming test, with α * (t) = 2(1 − (z α / √ t)), where denotes the distribution function for a standard normal and z α denotes −1 (1−α), the upper 100α percentile of the standard normal distribution. Exact spending functions for these tests for a given number of looks can be obtained numerically from the joint distribution (1) [24]. Alternative spending function forms have been suggested [1,25], including as a special case the linear spending function α * (t) = αt.
The stopping boundary values u 1 , . . . , u K may be computed recursively [1]; at look k, supposing u 1 , . . . , u k−1 and I 1 , . . . , I k are known, we can use the joint distribution of Y 1 , . . . ,Ȳ k for θ = 0 from (1) along with a numerical search to find u k to satisfy (7). These calculations can be performed in R using the gsBound in the gsDesign package [26] or code available from the first author.

Examples
To compare the Bayesian and frequentist groupsequential methods, we illustrate the two approaches using three simplified examples. These are described below.

Example 1: Single-arm trial with normally distributed data
Consider a single-arm trial with the outcome for patient i equal to Y i with Y i ∼ N(θ, σ 2 ) for some known σ . Suppose that θ = 0 corresponds to a null value and θ = 1 to a worthwhile treatment effect. We will assume that the trial is conducted in up to five stages, that is K = 5, with these of equal size so that the number of patients included in the first k stages is n k = nk/K. We will further assume that n K = 10σ 2 . With this sample size a fixed sample size trial with a hypothesis test conducted at a two-sided 5% level would have power of approximately 90%. This gives I 1 , . . . , I 5 = 2, . . . , 10.
We will consider a range of prior distributions for θ. We will take I 0 equal to 0 (non-informative), 0.5 and 1 (that is with weight equivalent to one twentieth and one tenth of the total information available from the trial) as well as a very informative prior distribution with I 0 = 20, and will take θ 0 equal to −0.25, 0, 0.25 and 0.5, recalling that 0 and 1 correspond to null and worthwhile treatment effects. Density functions for the range of prior distributions considered are shown in Fig. 1. The prior mean, θ 0 , increases across the columns moving from left to right and the prior information, I 0 , decreases as we move down the rows. The vertical lines correspond to the null and worthwhile treatment effects of 0 and 1. Only one plot is given in the lowest row as when I 0 = 0 the prior distribution does not depend on θ 0 .

Example 2: Single-arm trial with binary data
Consider, as a second example, a single-arm trial with a binary outcome corresponding to success or failure for each patient. Suppose that the trial has up to four looks with 25, 50, 75 and 100 patients and assume that we wish to determine whether the true success rate, which will be denoted by π, exceeds a control rate, π 0 , assumed to be 0.5, using a non-informative prior distribution for π.

Example 3: Two-arm trial with normally distributed data
The third example is a two-arm trial with up to five equally-sized stages with the outcome for patient i in group j (j = 0, 1) equal to Y ij with Y ij ∼ N μ j , σ 2 j for some known σ j , where we assume σ 1 = σ 0 .
Denoting the treatment difference μ 1 − μ 0 by θ, we will, as in Example 1 above, assume that θ = 1 represents a worthwhile treatment effect. Assuming at stage k we have included a total of n k patients in each of the two trial arms, we will set I k = n k /2σ 2 and, again as in Example 1, take I 1 , . . . , I 5 = 2, . . . , 10.
Suppose that μ 1 and μ 0 have independent normal prior distributions with μ j ∼ N μ j0 , I −1 j0 , with a moderately informative prior distribution for μ 0 with μ 00 = 0 and I 00 = 0.5, and a noninformative prior distribution for μ 1 with I 10 = 0. The treatment difference θ thus has a noninformative prior distribution with I 0 = 0.

Comparison of frequentist and Bayesian group-sequential approaches -single parameter case
In this section we consider the setting in which we either have a single sample or are comparing two groups but specify a prior distribution for the treatment effect, θ, directly rather than giving separate prior distributions for μ 1 and μ 0 . As noted above, in this case the two-sample setting is essentially identical to the single-sample settings, so that we will consider only the latter specifically.
Suppose that the maximum number of looks, K, the information at these looks, I 1 , . . . , I K and, for the Bayesian design, the prior distribution parameters, θ 0 and I 0 are specified.
The posterior distribution for θ at look k in this case is given by (2) so that the posterior probability that θ exceeds 0 is given by Given some choice of p 1 , . . . , p K , for the Bayesian design using stopping criterion (4) expression (8) means that the trial will be stopped at look so that the Bayesian trial, like the frequentist one, will stop wheneverȲ k , or equivalently the standardisedȲ k √ I k , is sufficiently large. with general α 1 , . . . , α K or p 1 , . . . , p K With u B k as given by (9), let α B k = Pr(Ȳ k √ I k ≥ u B k somek ≤ k; θ = 0). This may be calculated from the multivariate normal distribution ofȲ 1 √ I 1 , . . . ,Ȳ k √ I k following from (1). Setting k = K enables analytic calculation of the frequentist type I error rate for the Bayesian test.

Sequential tests
Setting α k = α B k and constructing a frequentist design using these α 1 , . . . , α K values will give a frequentist groupsequential boundary identical to the Bayesian one.
Thus, as noted by Emerson et al. [12], if we allow full flexibility over the choice of p 1 , . . . , p K for the Bayesian group-sequential design and α 1 , . . . , α K for the frequentist design, subject respectively to the constraint on overall type I error rate (5) or (6), the classes of frequentist group sequential and Bayesian designs are identical.
Similarly, if Bayesian sequential boundaries are constructed using the posterior probability that θ exceeds a positive target value or the posterior predictive probability of a final positive result, the fact that both of these are monotonically increasing inȲ k means that the stopping boundaries are again of the formȲ k √ I k ≥ u B k for some u B 1 , . . . , u B K , so that these still correspond to a frequentist boundary for appropriate choice of α 1 , . . . , α K and vice versa [12]. The same result holds for sequential tests based on Bayes factors provided these are constructed so as to be monotonically increasing inȲ k , as is the case, for example, when a point null at θ = 0 is compared to a 'one-sided' prior with support for positive θ only.

Specific group-sequential tests: Single-arm trial with normally distributed data
Although in principle, p 1 , . . . , p K and α 1 , . . . , α K may be chosen arbitrarily, in practice, constraints may be put on the values used. In this case frequentist and Bayesian group sequential tests may not correspond. In this section we construct frequentist group-sequential designs with a linear alpha spending function and with alpha spending functions corresponding to the Pocock design and the O'Brien and Fleming design, comparing these with Bayesian tests with stopping criteria given by (4) with p 1 = · · · = p K .
Consider Example 1 above with the range of prior distributions illustrated in Fig. 1. In each case we used stopping criterion (4) and took p 1 = · · · = p K , finding the common value to give overall type I error rate of α = 0.025. Figure 2 shows critical values, u B 1 , . . . , u B 5 , (plotted as circles) for the Bayesian tests with different prior distributions. Each plot corresponds to a different prior distribution, the layout of plots in the figure matching those in Fig. 1. Note that a different scale is used for the plots in the uppermost row. Using a similar format, Fig. 3 shows the cumulative type I error spent by each look for the tests shown in Fig. 2. Critical values and cumulative type I error spent are also given in Table 1.
It can be seen that more informative or more negative priors lead to a smaller chance of stopping at earlier interim analyses; this makes sense as more information is required to overcome the prior and obtain a posterior probability pr(θ > 0 |ȳ k ) ≥ p k . Other than for the most informative priors considered, it appears that the choice of θ 0 has relatively little impact; in these cases the value of I 0 is small relative to I K so that the prior distribution makes relatively little contribution to the posterior distribution and hence to the stopping decision. Figures 2 and 3 and Table 1 also show stopping boundaries and type I error spending functions for O'Brien and Fleming's test, Pocock's test and the frequentist test with a linear spending function, that is with α * (t) = αt, for five equally-spaced analyses. Boundary values and type I error spent at each look for the different tests (omitting those with I 0 = 20 and θ 0 > −0.025) are also given in Table 1, together with the value of p 1 = · · · = p K required to give overall type I error rate of 0.025 for the Bayesian designs.
It can be seen that stopping boundaries and type I error spent for the O'Brien and Fleming test are nearly identical to those for the Bayesian test with prior distribution with θ 0 = −0.25 and I 0 = 20. In this case the form of the stopping boundary, with stopping very unlikely at interim analyses but relatively likely at the final analysis, is only achieved if very strong negative prior opinion is held. This prior distribution was included specifically because of this similarity; it is hard to imagine anyone conducting a trial if they had such a strongly negative prior opinion of the effect of the treatment under investigation.
The similarity between Pocock's test and the Bayesian test with a non-informative prior distribution for θ can also be noted. For a non-informative prior, that is with I 0 = 0, (9) gives u B k = − −1 (1 − p k ) so that taking p 1 = · · · = p K corresponds to taking u B 1 = · · · = u B K . Thus in this case the Bayesian test with p k chosen to control the overall error rate is identical to Pocock's test when looks are equally spaced in terms of information.
For moderately informative prior distributions, that is for I 0 equal to 0.5 or 1, the Bayesian test appears to be similar to the frequentist test with α * (t) = αt for the reasonably wide range of θ 0 values considered.

Specific group-sequential tests: Single-arm trial with binary data
Consider next Example 2 above. In this case a Bayesian sequential test can be based on the exact binomial distribution of the data. In detail, denoting by X k the number of successes observed from the n k patients observed up to look k, k = 1, . . . , 4, we can take X k ∼ Bin(n k , π). A beta prior distribution is conjugate and a non-informative prior is π ∼ Beta(1, 1), or equivalently π ∼ U[ 0, 1]. The posterior distribution at look k after observing X k = x k is then π | x k , n k ∼ Beta(x k + 1, n k − x k + 1).
To be consistent with the notation above, where θ denotes the treatment effect with θ = 0 corresponding to the null hypothesis, we can take θ = π − π 0 . The trial will stop to claim that θ > 0, or equivalently, π > π 0 , if the posterior probability Pr(π > π 0 | x k , n k ) ≥ p k for some p k .
Taking p 1 = · · · = p k , for a given value of p 1 , critical values in terms of the required number of successes at each look can be found by calculating this posterior probability for a range of possible x k values. These in turn can be used to calculate the resulting frequentist type I error rate under the null hypothesis H 0 : θ = 0 or equivalently in this case, π = π 0 = 0.5, either by simulation or calculation and summation of the appropriate binomial probabilities. A numerical search can then be used to find the value of p 1 at which the type I error rate is controlled at a specified level.
For a four-look test with a non-informative Beta(1, 1) prior distribution for π, the type I error rate is controlled at level 0.05 for p 1 = · · · = p 4 = 0.977. The critical values for the test in terms of the total number of successes observed at looks 1 to 4 are then respectively 18, 33, 47 and 61.
A frequentist group-sequential analysis can be based on the normal approximation (1) forθ = X k /n k − A four-look frequentist group-sequential Pocock test constructed based on this approximation would stop forθ k √ I k ≥ u k with u k = 2.067, that is for X k ≥ 0.5n k + 2.067 √ n k /2, giving stopping boundaries in terms of X k for n k = 25, 50, 75 and 100 of 17.7, 32.3, 46.5 and 60.3. Rounding these up to integers gives stopping boundary values identical to those for the Bayesian test with a non-informative prior distribution.

Specific group-sequential tests: Two-arm trial with normally distributed data
We next consider Example 3 above, using only the prior information given by the prior distribution for the treatment difference θ, that is the non-informative prior distribution with I 0 = 0.
The distribution of the observed difference between the treatment means at looks 1 to K, D 1 , . . . , D K follows a multivariate normal distribution of the same form as that of the mean valuesȲ 1 , . . . ,Ȳ K in the single-group case, with I k now taken to be n k /2σ 2 . Setting p 1 = · · · = p K and taking this value so as to control the overall type I error rate to be 0.025, thus gives critical values, u k , now for D k √ I k , equal to 2.41 at all looks, exactly as in single-arm case with a non-informative prior distribution for θ.

Comparison of frequentist and Bayesian group-sequential approaches -two parameter case
Consider now the setting in which we are comparing two groups of normally distributed data and, in the Bayesian setting, specify separate independent normal prior distributions for μ 1 and μ 0 .
It is shown in Appendix A that the posterior variance of θ when separate prior distributions are given for μ 1 and μ 0 given by (10) is always smaller than that given by (3) when only the prior distribution for θ is used. With independent prior distributions for μ 1 and μ 0 , the posterior distribution depends onȳ 1k andȳ 0k , and not just on the difference d k =ȳ 1k −ȳ 0k . Assuming μ 1 and μ 0 are independent means that θ is not independent of μ 1 + μ 0 . Thus although D k is sufficient for θ, we can also learn about θ by learning about μ 1 + μ 0 , for which D k is not sufficient. We therefore gain information by knowingȳ 1k +ȳ 0k as well as y 1k −ȳ 0k , that is by having information on bothȳ 1k and y 0k , leading to a smaller posterior variance.
Suppose that, as in the single parameter case, we stop the trial as soon as we have Pr(θ > 0 | data at look k) ≥ p k , and that we wish to choose p 1 , . . . , p K so as to control the type I error rate to be at most α, that is to satisfy (5).
It is shown in Appendix B that, irrespective of the values of p 1 , . . . , p k , the stopping regions for frequentist and Bayesian group-sequential tests cannot coincide other than in the special case with I 1k /(I 10 + I 1k ) = I 0k /(I 00 + I 0k ), k = 1, . . . , K, when the posterior distribution for θ is exactly the same as that obtained directly from a single prior distribution for θ without considering prior distributions for the means of the two groups separately, With independent prior distributions for μ 1 and μ 0 the posterior distribution of θ depends onȳ 1k andȳ 0k . The probability in (5) thus depends on μ 0 and μ 1 and the requirement that this is controlled at level α when θ = 0 requires that it is controlled when μ 1 = μ 0 for all values of μ 0 . Appendix B shows that beacuse the mean of the posterior distribution for θ when μ 1 = μ 0 depends on μ 0 , this is impossible.
For the two-arm Bayesian group-sequential trial with five looks in Example 3 above, controlling the one-sided type I error rate to be 0.025 when μ 1 = μ 0 = 0 requires p 1 = · · · = p 5 = 0.9884. Figure 4 shows the one-sided type I error rate for this design for a range of μ 0 values with, in each case, μ 1 = μ 0 so that θ = 0. It can be seen that in this case although the type I error rate is controlled for μ 0 = 0, the type I error rate increases above the desired level for μ 0 > 0. The figure also shows the prior distribution for μ 0 , showing that error rate inflation would occur for plausible values of μ 0 .

Discussion
Our comparison has been restricted on the whole to group-sequential tests based on normally distributed test statistics. Although some exact or non-normal frequentist group-sequential test methods have been proposed [27][28][29] the assumption of normality is common in this setting. In Bayesian group-sequential tests it is more common to use non-normal distributions, with simulation methods being used if necessary to calculate operating characteristics. The decision to focus on normally distributed test statistics was made so as to put Bayesian and frequentist designs in a similar setting, facilitate comparison and identify relationships, such as that between the Pocock test and the Bayesian test with a noninformative prior distribution, which might otherwise not be apparent. As can be seen from the binary data example above, where the Pocock test and the exact Bayesian test give identical stopping rules, in practice asymptotic normality can be a reasonable assumption. Fig. 4 Type I error rate for Bayesian test with K = 5 and p 1 = · · · = p 5 = 0.9884 for range of true μ 0 values along with density (not to scale) for the prior distribution for μ 0 We have considered stopping for a positive result only. In practice, with both frequentist and Bayesian groupsequential designs, it is often desirable to allow stopping when a lack of efficacy is clear, that is for futility. Futility stopping rules can be divided into those that are binding, when the rule is specified in advance and must be adhered to in order to maintain the required properties of the design, and those that are non-binding, where a more flexible approach can be taken. As stopping for futility cannot lead to a positive claim of efficacy, it can only decrease the type I error rate. Thus with a non-binding futility stopping rule, it is desirable to control the type I error rate even if no futility stopping occurs, that is in the case when the trial is only stopped for a positive result as considered above. The use of a binding futility stopping rule will change the operating characteristics of the group-sequential tests.
We have focussed on comparison of Bayesian and frequentist group-sequential designs for single-arm and comparative studies. These are just one type of adaptive design, which can include many other features including adaptive exploration of a dose-response relationship, adaptive randomisation, dropping of arms in multi-arm trials, incorporation of multiple endpoints and sample size reestimation. Frequentist methods that guarantee control of error rates are available for some of these problems such as sample size re-estimation [30] but in some other cases construction of decision rules for frequentist methods can be challenging. Bayesian methods can be accompanied by simulations to verify operating characteristics under a likely range of scenarios for a wide variety of adaptations for which rigorous proof of error rate control is not available.

Conclusions
Although Bayesian and frequentist group-sequential approaches are based on fundamentally different paradigms, in practice, when used for the analysis of a clinical trial, both provide an indication of the efficacy of an experimental treatment. This means that a comparison of Bayesian and frequentist test can be helpful to understand the frequentist operating characteristics for Bayesian tests and the Bayesian model and prior distributional assumptions that could lead to a particular frequentist test. This has been our aim in this paper.
Focussing on a setting in which test statistics can be assumed to be normally distributed, we have shown that in comparative trials with independent prior distributions specified for treatment effects in different groups, stopping rules from Bayesian and frequentist group-sequential designs cannot generally correspond. In this case the Bayesian group-sequential design can then only control the type I error rate for specified values of the control group treatment effect. Conversely, in single-arm trials, or when a prior distribution is specified for the treatment difference, stopping rules for Bayesian and frequentist group-sequential tests can be identical if full flexibility for both classes of designs is allowed, or can closely correspond for common choices of design parameters.
O'Brien and Fleming's design was found to correspond closely to a Bayesian design with an exceptionally informative negative prior, this prior leading to the very small probability of early stopping for this design. The fact that such a prior is unlikely to represent prior belief suggests that the use of this design might not be appropriate without very careful thought.
In a similar way, noting that the Bayesian design with a non-informative prior and p 1 = · · · = p K corresponds to a Pocock design suggests that this might also not be generally appropriate given the criticism that this design gives too high a probability of early stopping [31]. This illustrates the importance of appropriate choice of a prior distribution, rather than the general use of a non-informative prior. Evaluation of the frequentist properties can be useful in understanding the influence of the prior distribution in a Bayesian group-sequential design in which the overall type I error rate is controlled.
Bayesian adaptive methods are often more bespoke than frequentist approaches, with simulations used to evaluate their performance not only for a range of treatment effect scenarios but also allowing for anticipated data patterns arising from, for example, delayed responses, multiple endpoints including early outcomes, or different recruitment and drop-out rates. This can require more design work than the use of a more standard frequentist method but can be advantageous in that design choices and their consequences are considered carefully. It is recommended that if frequentist methods are used, equal care should be taken over design choices and their properties explored, using simulations if necessary.

Appendix A: Comparison of posterior variances for comparative trials with single or independent prior distributions
Suppose we are in the two-group setting and have independent prior distributions with μ j ∼ N μ j0 , I −1 j0 , j = 0, 1 and that we have observation ofȲ jk withȲ jk ∼ N(θ, I −1 k ), j = 0, 1, k = 1, . . . , K, so that the posterior distribution for θ is given by (10).
with a = −(r 0 I 00 +2r 0 +1), b = 2(r 0 −I 00 ) and c = I 00 (r 0 + 2) + 1. Note that the derivative is defined for all k ≥ 0 as I 00 and r 0 are both positive. Setting the numerator to zero and solving the quadratic, we find that R k has stationary points at k = 1 and −(r 0 I 00 + 2I 00 + 1)/(r 0 I 00 + 2r 0 + 1). The second of these is negative as I 00 and r 0 are positive, so that the only stationary point with k ≥ 0 is at k = 1 when R k = 1 The second derivative of R k with respect to k at k =