 Research
 Open access
 Published:
Bayesian multilevel multivariate logistic regression for superiority decisionmaking under observable treatment heterogeneity
BMC Medical Research Methodology volumeĀ 23, ArticleĀ number:Ā 220 (2023)
Abstract
Background
In medical, social, and behavioral research we often encounter datasets with a multilevel structure and multiple correlated dependent variables. These data are frequently collected from a study population that distinguishes several subpopulations with different (i.e., heterogeneous) effects of an intervention. Despite the frequent occurrence of such data, methods to analyze them are less common and researchers often resort to either ignoring the multilevel and/or heterogeneous structure, analyzing only a single dependent variable, or a combination of these. These analysis strategies are suboptimal: Ignoring multilevel structures inflates Type I error rates, while neglecting the multivariate or heterogeneous structure masks detailed insights.
Methods
To analyze such data comprehensively, the current paper presents a novel Bayesian multilevel multivariate logistic regression model. The clustered structure of multilevel data is taken into account, such that posterior inferences can be made with accurate error rates. Further, the model shares information between different subpopulations in the estimation of average and conditional average multivariate treatment effects. To facilitate interpretation, multivariate logistic regression parameters are transformed to posterior success probabilities and differences between them.
Results
A numerical evaluation compared our framework to less comprehensive alternatives and highlighted the need to model the multilevel structure: Treatment comparisons based on the multilevel model had targeted Type I error rates, while singlelevel alternatives resulted in inflated Type I errors. Further, the multilevel model was more powerful than a singlelevel model when the number of clusters was higher. A reanalysis of the Third International Stroke Trial data illustrated how incorporating a multilevel structure, assessing treatment heterogeneity, and combining dependent variables contributed to an indepth understanding of treatment effects. Further, we demonstrated how Bayes factors can aid in the selection of a suitable model.
Conclusion
The method is useful in prediction of treatment effects and decisionmaking within subpopulations from multiple clusters, while taking advantage of the size of the entire study sample and while properly incorporating the uncertainty in a principled probabilistic manner using the full posterior distribution.
Background
In medical, social, and behavioral research we often encounter datasets with a multilevel structure and multiple correlated dependent variables. An example of such a study is the Cognition and Radiation Study B [1, 2] that investigated whether local brain radiation (stereotactic radiosurgery) preserves cognitive functioning and quality of life better than whole brain radiation in cancer patients with multiple brain metastases. Patients were recruited from multiple hospitals and the treatment was executed in two treatment centers, giving the data a multilevel structure. Many other examples of such datasets can be found in a paper by Biswas and colleagues [3], who presented a nonexhaustive overview of hundreds of Bayesian trial protocols executed in a specialized center for cancer treatment. The authors noted that a) almost half of the reviewed studies were multicenter trials; and b) many studies were designed to assess effectiveness and side effects simultaneously, thus including at least two dependent variables.
Often, these multilevel, multivariate data are collected from a study population that consists of several subpopulations with potentially distinctive (i.e., heterogeneous) effects of an intervention. Examples of such studies are the two International Stroke Trials (International Stroke Trial (IST) and Third International Stroke Trial (IST3); [4,5,6,7]), which investigated the effects of antiplatelet and antithrombotic treatments on various (neuro)psychological, functional and psychosocial dependent variables respectively. Both trials covered multiple treatment centers from multiple countries and included a variety of patient characteristics that could potentially predict treatment effects. We discuss the IST3 in more depth as it serves as a running example throughout the paper. The IST3 investigated the effects of an intravenous thrombolysis treatment on shortterm (e.g., recurrent stroke, functional deficits) and longterm (e.g., dependency, depression, pain) indicators of health status among patients who suffered from an acute ischaemic stroke. The IST3 data revealed considerable variation in characteristics of patients and disease  such as subtype or severity of stroke, blood pressure, and age  that can be predictive of treatment effects and call for exploration of treatment heterogeneity to gain insight into subpopulationspecific effects [8].
All of the abovementioned trials made treatment comparisons in the context of Randomized Controlled Trials (RCTs): Randomized experiments in which an experimental or a control treatment is randomly assigned and administered to a random sample of patients. RCTs often aim to evaluate whether the experimental treatment is superior or (non)inferior to the control condition and ultimately guide clinicians in evidencebased assignment of treatments and interventions [9]. Whereas RCTs are considered a golden standard for treatment comparison, their implementation is challenged by a growing demand for personalized treatment [10,11,12,13]. That is, clinical practice relies more and more on the idea that different patients react differently to treatments. Treatment prescription is increasingly guided by a tradeoff between patientspecific risks and benefits, making the research context for these decisions multivariate and heterogeneous [14]. While demanding more complex methodology, personalization of treatments can impede the collection of sufficient data for rigorous treatment evaluation. Development of more targeted treatments limits eligibility for participation in trials, thereby making the recruitment of subjects more difficult. As a solution, trials more often span multiple treatment centers or countries. This adds another layer of complexity to the research context: clustered data that require multilevel analysis. To meet the methodological demands of these increasingly complex research problems, RCTs ideally provide a) a broad understanding of the treatmentās effects on multiple dependent variables; and b) insights potential dependencies of treatment effects on characteristics of patients; and c) an accurate handling of clustered data structures. In practice, such comprehensive methods are less common, and often researchers resort to either ignoring the multilevel and/or heterogeneous structure, analyzing only a single dependent variable, or a combination of these. Below, we discuss how the abovementioned three aspects can be implemented in Randomized Controlled Trial methodology to support research in personalized treatment.
First, many RCTs evaluate more than one dependent variable, which are analysed separately in multiple univariate analyses [15]. As an example, the investigators of the IST3 were primarily interested in living independently six months after stroke and secondarily in several other dependent variables, such as recurrent events, adverse reactions to the treatment, and mental health indicators. Analyzing dependent variables independently provides useful insights in treatment effects on each of these dependent variables individually, but discards available information about the relation between them. When the effects on individual dependent variables are complemented with information about their cooccurrences via multivariate analysis, a more detailed picture of treatment effects emerges. Multivariate analysis models relationships between dependent variables and can a) be helpful to detect outcome patterns that would be ignored when dependent variables are considered in isolation; and b) improve the accuracy of sample size computations and error rates in statistical decisionmaking [15,16,17,18].
Second, incorporating patient and/or disease characteristics in treatment comparison can result in a considerable improvement of the practical value of RCTs. The IST3 used a sample of diverse patients with different personal and disease characteristics. This variation contains valuable information regarding differences in treatment effects. For example, knowing whether patients with different weights or blood pressures have different chances of a recurrent stroke or independent living has the potential to inform treatment recommendations. When treatments have distinct effects on patients with different characteristics, treatment effects are considered heterogeneous among (sub)populations of patients. In this case, average treatment effects (ATEs) give a global idea of treatment results among the trial population, but have limited value in targeting treatments to specific patients with their individual (disease) characteristics [19,20,21]. Conditional average treatment effects (CATEs) among specific patient groups provide insight in the variation of treatment effects among the population and help to distinguish patients who ultimately benefit from the treatment from those who do not or may even experience adverse treatment effects. Unfortunately, subgroupspecific treatment comparisons are insufficiently implemented as part of standard trial methodology yet [22]. If subgroups are targeted at all, their effects are often analyzed independently via stratified (or subgroup) analysis. Such a subgroup analysis disregards information from related subgroups and suffers from suboptimal power due to subsetting. Modelling heterogeneity is a more powerful alternative that directly uses the relation between subgroups and allows subgroups to borrow strength from each other [23,24,25].
Third, multilevel data are characterized by observational units that are grouped in clusters. For example, the IST3 spans multiple treatment centers and multiple countries. Reasons to use multilevel analysis can be both substantive and statistical. From a substantive perspective, multilevel analysis can be useful to explain differences between clusters, while using the information from the entire sample [26, 27]. Different trials may  for example  have overlapping but nonidentical target populations that can be distinguished by covariate information and may contribute to the understanding of treatment effects. Statistically, differences between clusters should be taken into account for the sake of validity, even if these differences are not of direct interest [28,29,30]. Clustered data require specific analysis methods that are flexible enough to treat observations from different clusters as more similar to each other than to observations from other clusters. If observations within clusters are indeed more similar, the clustered structure is reflected in variance partitioning, where the withincluster and the betweencluster variances are modelled separately. This induces a dependence between the observations within clusters when marginalizing over the clusterspecific effects. When clustered observations are treated as independent observations on the other hand, variance originating from differences between clusters is then erroneously attributed to differences between a manifold of observational units and the unique amount of information is overestimated. As a result, standard errors are overestimated, Type I error rates are inflated, and validity of statistical inference is compromised. The larger the variance between clusters relative to the variance between observational units within clusters, the larger the effect on standard errors. Properly modelling the multilevel structure of clustered data and allowing the parameters to vary over clusters is therefore crucial for accurate statistical decisionmaking [28, 29].
The current paper presents a Bayesian multilevel multivariate logistic regression (BMMLR) framework to capture the three abovementioned methodological aspects in a comprehensive analysis and decision procedure for treatment comparison. We build upon an existing Bayesian multivariate logistic regression (BMLR) framework for singlelevel data to analyze multivariate binary data in the presence of treatment heterogeneity and present a multilevel extension to deal with multilevel data. The multilevel aspect adds another layer of complexity, making the analysis a nontrivial endeavour. We discuss the existing BMLR framework first. This framework consists of three coherent elements [25]:

1
a multivariate modelling procedure to find unknown regression parameters;

2
a transformation procedure to convert regression parameters to the probability scale to make analysis results more interpretable;

3
a compatible decision procedure to draw conclusions regarding treatment superiority or inferiority with targeted Type I error rates.
The first element, the modelling procedure, assumes multivariate Bernoulli distributed dependent variables and assigns them a multinomial parametrization. A multinomial parametrization is helpful for two reasons, since it a) allows statisticians to draw and build upon existing, established multinomial techniques with tractable (conditional) posterior distributions; and b) has the flexibility to model correlations between dependent variables on the subpopulation level, which contributes to the accuracy of inference under treatment heterogeneity [18, 25, 31]. Several other multivariate modelling procedures, such as the multivariate probit model [32] or multivariate logistic regression models [33, 34], have a more restrictive correlation structure and are therefore theoretically less suitable to detect treatment heterogeneity with adequate error control. Moreover, the multivariate logistic regression model by Malik and Abraham [33] does not provide insight in the treatment effects on individual dependent variables. Copula structures have been proposed as promising multivariate alternatives as well, but these models can be difficult to apply to binary dependent variables [35,36,37]. The second element, the transformation procedure, builds upon the close relation between the multinomial and multivariate parametrizations to express results on the scale of (multivariate) success probabilities and differences between them, as a more intuitive alternative to multinomial (log)odds. The transformed parameters provide understandable insights in the treatmentās performance on the trial population (i.e., ATEs) as well as subpopulations of interest (i.e., CATEs). The third element, the decision procedure, conveniently uses the Bayesian nature of the modelling procedure, allowing for inference on the posterior samples of transformed parameters. Decisions can be made in several ways to flexibly combine and weigh multiple dependent variables into a single decision for a population of interest, while taking correlations between dependent variables into account.
The main contribution of the current paper is the extension of the singlelevel BMLR framework to the multilevel context. The novel Bayesian multilevel multivariate logistic regression (BMMLR) framework provides BMLR with a multilevel model component and adjusts the transformation and decision procedure accordingly, to make the framework suitable for the multilevel context, resulting in accurate type I errors. The remainder of the paper is structured as follows. Section āBMMLR: Bayesian multilevel multivariate logistic regressionā introduces the multilevel multivariate logistic regression model to obtain a sample from the posterior distribution of regression coefficients. Section āTransformation of posterior regression coefficients to the probability scaleā outlines how to transform the obtained regression coefficients to more interpretable treatment effect parameters. Section āDecisionmaking based on multivariate treatment effectsā discusses the decision procedure to use the treatment effect parameters for treatment comparison. Section āNumerical evaluationā demonstrates the performance of the model numerically via simulation and in Section āIllustration with IST3 dataā the methodology is illustrated with data from the IST3. The paper concludes with a discussion in Section āDiscussionā.
BMMLR: Bayesian multilevel multivariate logistic regression
Consider the general case with \(K \in \{1,\dots ,K\}\) binary dependent variables \(y^{k}_{ji}\) for subject \(i \in \{1,\dots ,n_{j}\}\) in cluster \(j \in \{1,\dots ,J\}\). Outcome \(y^{k}_{ji}\) is Bernoulli distributed with success probability \(\theta ^{k}_{ji}\).Ā MultivariateĀ vector of K dependent variables, \(\varvec{y}_{ji} = \left(y^{1}_{ji}, \dots , y^{K}_{ji}\right)\) is multivariate Bernoulli distributed [31]. The multivariate Bernoulli distribution relies on a hybrid parameterization where a Kvariate success probability in \(\varvec{\theta }_{ji} = \left(\theta ^{1}_{ji}, \dots , \theta ^{K}_{ji}\right)\) is expressed in terms of \(Q = 2^{K}\) multinomial joint response probabilities in \(\varvec{\phi }_{ji} = \left(\phi ^{1}_{ji}, \dots , \phi ^{Q}_{ji}\right)\) [31]. The \(q^{\text {th}}\) joint response probability in \(\varvec{\phi }_{ji}\) corresponds to multinomial response combination \(\varvec{h}^{q}\), which has length K and is given in the \(q^{th}\) row of the matrix of joint response combinations denoted by \(\varvec{H}\):
Hence, joint response probability \(\phi ^{q}_{ji} = p\left(\varvec{y}_{ji} = \varvec{h}^{q}\right)\). Note that the joint response probability \(\varvec{\phi }_{j}\) and the success probability \(\varvec{\theta }_{j}\) are identical in the univariate situation (i.e., \(K=1\)).
Likelihood of the data
The multinomial parametrization of multivariately Bernoulli distributed data allows to model the relation between dependent variables \(\varvec{y}_{ji}\) and one or multiple predictor variables via multinomial logistic regression. Joint response probability \(\phi ^{q}_{ji}\) is then regressed on a vector of P covariates, \(\varvec{x}_{ji} = \left(x_{ji0}, \dots , x_{ji(P1)}\right)\). Covariate \(x_{ji0} = 1\) is a constant to estimate the intercept and covariate \(x_{jip}\) for \(p \in \{1,\dots ,P1\}\) can, for example, be a treatment indicator, a patient characteristic, or an interaction between these.
The relation between outcome vector \(\varvec{y}_{ji}\) and covariate vector \(\varvec{x}_{ji}\) is mapped with a multinomial logistic function that expresses the probability of \(\varvec{y}_{ji}\) being in response category q, conditional on \(\varvec{x}_{ji}\):
Here, \(\psi ^{q}_{ji}\) is a linear predictor:
In Eq. 3, regression coefficients for response category q, \(\varvec{\gamma }^{q}_{j} = \left(\gamma ^{q}_{0j},\dots ,\gamma ^{q}_{(P1)j}\right)\) are unknown parameters of interest. Regression coefficients of response categories \(1,\dots ,Q1\) are estimated, while regression coefficients of response category Q are fixed at zero (i.e., \(\varvec{\gamma }^{Q}_{j} = \varvec{0}\)) to ensure identifiability of the model. The entire set of regression coefficients in cluster j is denoted with \(\varvec{\gamma }_{j}\).
A key aspect of multilevel models is that the regression coefficients \(\varvec{\gamma }_{j}^{q}\) are allowed to vary over clusters according to a common normal distribution on the second level. The common distribution for the random effects on the second level induces a dependency structure of the observations within clusters. The observations of diffferent individuals in the same clusters are assumed to be conditionally independent conditional on the clusterspecific random effects. The random effects distribution on the second level can be written as:
Equation 4 consists of two elements that reflect the distributional parameters:

1
The parameter \(\gamma ^{q}_{p0}\) is the common effect in the population and does not vary over clusters.

2
The random effect \(u^{q}_{pj}\) quantifies the cluster specific deviation from the common effect \(\gamma ^{q}_{p0}\).
Equation 4 can be adjusted to model clusterspecific predictors or crosslevel interactions between clusterlevel predictors and individual levelpredictors. Further, Eq. 4 can be extended to model mixed effects, which combine regression coefficients that vary over clusters, which are called random effects, and regression coefficients that are identical for all clusters, which are called fixed effects. More information on the specification of more complex linear predictors can be found in general resources on multilevel models, such as Hox et al. [28] or Gelman and Hill [27]. In general, it should be noted that each additional random effect increases the number of parameters, affecting computational burden and estimation precision.
Posterior distribution of regression coefficients
The primary goal of BMMLR is estimating the joint posterior distribution of unknown regression coefficients \(\varvec{\gamma }^{q}_{j}\), their means \(\varvec{\gamma }^{q}\), and their covariance matrices \(\varvec{\Sigma }^{q}\) for category \(q \in 1,\dots ,(Q1)\). The posterior probability distribution of these parameters for category q is given by:
where \(\varvec{\gamma }^{q}\) reflects the vector of average effects for category q, \(\varvec{\Sigma }^{q}\) is the covariance matrix of the effects across clusters for category q, and \(\varvec{\gamma }_{j}^{q}\) reflects the vector of cluster specific effects of cluster j for category q. The posterior probability distribution in Eq. 5 is proportional to the product of three types of probability distributions:

1
The likelihood of the data quantifies the probability of the dependent variables conditional on clusterspecific regression coefficients, \(p(\varvec{y}_{j}\varvec{\gamma }^{q}_{j})\), which is the multinomial logistic function given by Eq. 2;

2
The probability distribution of the clusterspecific regression coefficients \(\varvec{\gamma }^{q}_{j}\) conditional on their means \(\varvec{\gamma }^{q}\) and covariance matrix \(\varvec{\Sigma }^{q}\) for category q, \(p(\varvec{\gamma }^{q}_{j}  \varvec{\gamma }^{q}, \varvec{\Sigma }^{q})\);

3
The prior probability distributions of regression coefficientās means \(\varvec{\gamma }^{q}\), \(p(\varvec{\gamma }^{q})\), and covariance matrix \(\varvec{\Sigma }^{q}\), \(p(\varvec{\Sigma }^{q})\) for category q, before observing the data.
As the multinomial logistic function (Eq. 2) does not have a (conditionally) conjugate prior distribution, the functional form of the posterior distribution is unknown and the regression coefficients cannot be sampled directly from the posterior distribution. In the Supplemental material, we present a Gibbs sampling algorithm based on a PĆ³lyaGamma auxiliary variable expansion of the likelihood proposed by Polson et al. [38]. The expanded likelihood has a Gaussian form and can be combined with normal prior distributions on regression coefficients \(\varvec{\gamma }^{q}\) and an inverseWishart distribution on covariance matrix \(\varvec{\Sigma }^{q}\). The parameters are known to have conditionally conjugate posterior distributions and allow for direct sampling from their multivariate normal and inverseWishart distributions respectively, resulting in MCMC chains of the joint posterior distribution in Eq. 5. We also include a few comments on prior specification for the proposed Gibbs sampling procedure in the Supplemental material.
As an alternative to the proposed Gibbs sampling procedure, sampling from the posterior distribution(s) of multinomial logistic regression coefficients can theoretically be done with other standard MCMCmethods for nonconjugate priorlikelihood combinations, such as MetropolisHastings (e.g., [39], Ch.3 and 5; [40]; [41]) or Hamiltonian Monte Carlo (e.g., [42,43,44]) sampling algorithms.
Transformation of posterior regression coefficients to the probability scale
The output of the BMMLR model from Section āBMMLR: Bayesian multilevel multivariate logistic regressionā is an MCMC sample of posterior multinomial regression coefficients. These regression coefficients reflect the importance of a predictor on a specific joint response combination and represent  in exponentiated form  the odds compared to reference category Q. While these regression coefficients can be insightful in a truly multinomial research problem, they have no straightforward interpretation in multivariate treatment comparison where marginal effects on individual dependent variables play a central role [15].
Transformation of regression coefficients to the multivariate probability scale forms a convenient solution to gain more intuitive insights in both joint and marginal treatment effects. These transformations rely on the close relationship between multinomial and multivariate parametrizations and can be flexibly obtained for the trial population (i.e., average treatment effects) or for subpopulations (i.e., conditional average treatment effects). They are directly suitable for statistical decisionmaking regarding treatment comparison.
We use the framework for transformation to the probability scale and decisionmaking with a posterior sample of multivariate treatment differences introduced in [18] and [25]. Technical details of these procedures are presented in the Supplemental material. We use the remainder of this section to summarize and illustrate the procedure with a toy example from the IST3data, where we assume interest in the effect of Alteplase in the experimental condition (\(T_{A}\)) compared to no treatment in the control group (\(T_{C}\)).
Assume that we reanalyze a part of the IST3 data using the BMMLR framework and take one of originally presented analyses as a starting point [6]. In the selected analysis, the researchers compared the effects of Alteplase vs. control on their primary outcome, longterm independent living after six months (Indep6), among subgroups of patients based on the severity of their initial stroke. In our example, we perform a multivariate analysis of the treatment effects on the primary outcome (Indep6) and one of the secondary (shortterm) dependent variables: being strokefree in the first seven days after the initial stroke (Strk7). We incorporate severity of the initial stroke as a predictor variable to study heterogeneity, using the grouping criteria from the original trial for the estimation of conditional average treatment effects. We aim to investigate the average treatment effect among the trial population as specified by the original eligibility criteria for inclusion. We are also interested in a potential interaction between the treatment and stroke severity, and investigate the conditional average treatment effects among patients with various severities of stroke. To take the clustered structure of the data into account, we specified a BMMLR mixedeffects model with random slopes for the intercept and the main treatment effect, resulting in the following linear predictor:
In Eq. 6, \(\varvec{x}_{ji} = (1,T_{ji}, NIHSS_{ji}, NIHSS_{ji}T_{ji})\) with treatment indicator \(T_{ji}\) and \(NIHSS_{ji}\) being the stroke severity score of subject i in hospital j. The \(Q=4\) resulting joint response categories are \((\{Strk7 = 1, Indep6 = 1\}, \{Strk7 = 1, Indep6 = 0\}, \{Strk7 = 0, Indep6 = 1\}, \left\{Strk7 = 0, Indep6 = 0\right\})\), which we refer to as \((\{11\}, \{10\}, \{01\}, \{00\})\).
Transformation to clusterspecific (differences between) probabilities
The main quantity of interest, the (clusterspecific) marginal multivariate treatment difference, is defined as the difference between clusterspecific multivariate success probabilities of the two treatments:
where subscripts Aj and Cj indicate clusterspecific parameters of the (experimental) Alteplase and control treatments respectively. The elements on the righthand sides of Eq. 7, success probabilities \(\theta _{Tj}^{k}\), are sums of the multinomial joint response probabilities of all response categories with a success on outcome k:
The multinomial joint response probabilities \(\varvec{\phi }_{Tj}\) that form the elements of success probabilities \(\varvec{\theta }_{Tj}\) follow from plugging in posterior regression coefficients \(\varvec{\gamma }^q_{j}\) in the linear predictor (Eq. 6) and the multinomial logistic link function (Eq. 2) for prespecified covariates \(\varvec{x}_{j}\) and for the relevant response category q.
The information in covariate vector \(\varvec{x}_{j}\), which directly affects \(\psi _{Tj}^q\), determines the treatment as well as the subpopulation of interest. Subpopulations can be defined as a value, such as a stroke severity score of one standard deviation below or above the mean, that can be plugged in directly into Eqs. 2Ā and 6.Ā When interested in a subpopulation that is defined by an interval, such as the groups of stroke severity in the IST3, the joint response probability is marginalized over the specified interval or averaged over a sample of observations in this interval. In the latter case, joint response probability \(\phi ^{q}_{Tj}\) is computed for each observed subject \(i \in 1,\dots ,n_{j}\) via Eq. 2. The joint response probability for each treatment T is then computed by averaging over all subjects i in treatment T and cluster j.
Since the model in Section āBMMLR: Bayesian multilevel multivariate logistic regressionā resulted in a sample of L posterior draws of each regression coefficient, multivariate treatment differences are computed for each draw (l) separately. The resulting posterior samples can be summarized with standard descriptive methods.
Pooling treatment effects over clusters
As a last step, clusterspecific estimates are pooled into estimates of average or conditional treatment effects among (sub)populations of interest via the following procedure:
This pooling strategy weighs clusterspecific estimates by cluster size, thereby balancing data with unequal cluster sizes.
Decisionmaking based on multivariate treatment effects
The obtained sample of posterior treatment differences can be used for statistical decisionmaking regarding treatment superiority and inferiority. The multivariate context has multiple options to define superiority and inferiority, leaving much flexibility to combine and prioritize dependent variables in a suitable way. We shortly discuss four different decision rules to give some idea of possibilities, without intending to be exhaustive or complete. The presented rules have different theoretical underpinnings and distinct statistical properties, such as acceptance regions, a priori estimated sample sizes, cutoff values, and error rates. The acceptance regions for superiority decisions of the four presented rules are graphically presented in Fig. 1. More details to guide an informed choice for one of these decision rules in practice can be found in Kavelaars et al. [18].
Three of these rules originate from guidelines of the Food and Drug Administration (FDA) [15]. The FDA defines superiority as a treatment difference larger than zero on the primary outcome (which we refer to as āSingle ruleā), on all dependent variables (āAll ruleā) or on any of the dependent variables (āAny ruleā). The Single rule reduces the statistical analysis to a univariate problem, using only the treatment difference of independent living after 6 months as a primary outcome (Single rule). The All and Any rules make no distinction in the importance of dependent variables and assume that the shortterm and longterm outcome are either both required for superiority or inferiority (All rule), or are interchangeable (Any rule).
In practice, these rules can oversimplify decisionmaking. Secondary outcome variables often contribute to treatment evaluation as well, but are given a coprimary status in the All and Any rules or are not formally included in the statistical decision procedure when the Single rule is used [17, 45]. To handle outcomes that differ in relative importance, linear combinations of dependent variables with preassigned (importance) weights have been proposed as a flexible alternative [14, 16, 18, 46, 47]. We refer to a linear combination as a Compensatory rule, referring to its inherent mechanism that allows (weighted) positive and negative effects to compensate each other. The Compensatory rule allows the IST3 data to consider the effects on the longterm much more important than the shortterm effect without completely excluding the risk of a recurrent stroke from the final decision. In such a situation, we can assign the primary outcome (Indep6)  for example  four times more weight than the secondary outcome (Strk7) and consider Alteplase superior to no treatment if a lower chance of dependency is outweighed by a small increase in the risk of a recurrent stroke.
Evidence in favor of the decision rule can be quantified by the proportion posterior draws of the pooled treatment difference \(\varvec{\delta }\) that lie in the decisionrule specific acceptance region, denoted by \(\mathcal {S}_{R}\). A conclusion is reached via comparison to \(p_{cut}\), which is a cutoff value to balance the required amount of evidence with anticipated Type I error rates [48]:
In the multivariate logistic regression model, the probability in Eq. 11 has no analytical solution. Therefore, decisions are made via the posterior MCMCsample of L draws. Superiority is concluded when:
Similarly, inferiority is concluded when:
In Section āIllustration with IST3 dataā, we demonstrate these decision with data from the IST3 as part of an illustration of the BMMLR framework.
Numerical evaluation
The current section presents an evaluation of the performance of the proposed BMMLR framework. The goal of the evaluation was twofold and we aimed to demonstrate:

1
how well the obtained regression coefficients and treatment effects correspond to their true values to examine bias;

2
how often the BMMLR framework results in an (in)correct superiority or inferiority conclusion to learn about decision error rates;
Setup
Fitted models
The performance of the multilevel model was evaluated in a treatment comparison based on a twolevel model with two dependent variables and one covariate at the subject level. We compared the method to two different (singlelevel) reference approaches, resulting in the following three modelling procedures:

1
The BMMLR model presented in Section āBMMLR: Bayesian multilevel multivariate logistic regressionā. We generated response data from a mixed effects model to include random effects while keeping the number of estimated parameters limited. We included an interaction between the treatment and the covariate as well, resulting in the following linear predictor:
$$\begin{aligned} \psi ^{q}_{ji}= & {} \gamma ^{q}_{0j} + \gamma ^{q}_{1j} T_{ji} + \beta ^{q}_{2} w_{ji} + \beta ^{q}_{3} w_{ji} T_{ji}\\ \nonumber \gamma ^{q}_{0j}= & {} \gamma ^{q}_{00} + u_{0j}\\ \nonumber \gamma ^{q}_{1j}= & {} \gamma ^{q}_{10} + u_{1j}.\nonumber \end{aligned}$$(14)In line with previous notation, \(\varvec{x}_{ji} = (1,T_{ji}, w_{ji}, w_{ji}T_{ji})\) in Eq. 14. Further, vector \(\varvec{\gamma }^{q}_{j} = (\gamma ^{q}_{0j}, \gamma ^{q}_{1j})\) reflects random effects with multivariate normally distributed errors (i.e., \((u^{q}_{0j},u^{q}_{1j}) \sim N(\varvec{0},\varvec{\Sigma }^{q})\)) for the intercept and main effect of the treatment. Regression coefficients \(\varvec{\beta }^{q} = (\beta ^{q}_{2}, \beta ^{q}_{3})\) reflect fixed effects for the covariate and covariatebytreatment interaction.

2
Singlelevel Bayesian multivariate logistic regression model (BMLR; [25]), as a first reference approach. For this model, we use a restricted version of Eq. 14 with fixed regression coefficients only:
$$\begin{aligned} \psi ^{q}_{ji} = \beta ^{q}_{0} + \beta ^{q}_{1} T_{ji} + \beta ^{q}_{2} w_{ji} + \beta ^{q}_{3} w_{ji}T_{ji}, \end{aligned}$$(15)MCMC chains were sampled with a simplified version of the Gibbs sampling procedure in the Supplemental material, that iterates over \(\varvec{\beta }\) and \(\varvec{\Omega }\). The model shares information in the estimation of conditional treatment effects with sufficient power, but does not take the multilevel structure of the data into account.

3
Singlelevel unconditional Bayesian multivariate Bernoulli analysis (BMB; [18]), as a second reference approach. Bayesian multivariate Bernoulli analysis relies on a conjugate multinomial likelihood and Dirichlet prior. MCMC draws are sampled directly from the posterior Dirichlet distribution with parameters \(\sum _{j=1}^{J} \sum _{i=1}^{n_{j}} I(\varvec{y}_{ji} = \varvec{h}^{q}) + \alpha ^{0q}\), where we assigned prior hyperparameters \(\varvec{\alpha }^{0} = (0.01, 0.01, 0.01, 0.01)\). The approach can estimate homogeneous treatment effects accurately and fast, but cannot deal with multilevel data. Moreover, conditional treatment effects originate from subsampling, which is less powerful than regression due to the isolation from other information.
Effect size
We specified a heterogeneous treatment effect, with pooled average treatment differences of zero (\(\varvec{\delta } = (0,0)\), \(\delta (\varvec{w}) = 0\)) and pooled conditional treatment differences larger than zero (\(\varvec{\delta } = (0.25,0.15)\), \(\delta (\varvec{w}) = 0.20\)). This scenario aimed to demonstrate the Type I error rate among the trial population. It reflects a least favorable treatment difference for the Any and Compensatory rules and should therefore result in the targeted Type I error rate for these rules to be considered accurate. The conditional treatment effect provided insight in the power to conclude superiority among the subpopulation under consideration. Outcome variables were negatively correlated (\(\rho _{ATE} = .157\); \(\rho _{CATE} = .20\)). The regression parameters used to generate these effects are presented in Table 1.
For the BMMLR model, the covariance matrix of random effects, \(\varvec{\Sigma }^{q}\), was specified as:
for all \(q \in 1,\dots ,Q1\).
Sample size
We varied the sample sizes at the cluster and subject level. Since there are no clear guidelines regarding sample size computations in multilevel multivariate logistic regression, we explored performance of the model for different numbers of clusters and different sample sizes within clusters. Specifically, we used number of clusters \(J \in \{10,100\}\) and observations per cluster \(n_{j} \in \{10,100\}\) for each treatment, resulting in four different sample size combinations.
Procedure
Data generation
For each sample size, we sampled 1000 datasets under the mixed effects model in Eq. 14 with the true regression parameters in Table 1. We assigned \(n_{j}\) participants to each treatment T and generated covariate x from a standard normal distribution. We sampled response vector \(\varvec{y}_{ji}\) from a multinomial distribution with probabilities \(\varvec{\phi }_{ji}\).
Gibbs sampling
Regression coefficients for the BMMLR and BMLR models were estimated via the Gibbs sampling procedure in the Supplemental material. We ran two MCMCchains via the Gibbs sampler introduced in Section āBMMLR: Bayesian multilevel multivariate logistic regressionā with \(L = 50,000\) iterations plus 10,Ā 000 burnin iterations. This large number of iterations aims to minimize the influence of the potentially high autocorrelations between parameters in multilevel models on the stationary distribution of the parameters. Autocorrelations were highest among random effect parameters \(\varvec{\gamma }_{j}\) and ranged between 0.107 and 0.781 at lag 1 and reduced to a range of \(0.012  0.276\) at lag 10. Further, following the guidelines in Gelman et al. [49], we ensured that the multivariate potential scale reduction factor was below 1.10.
Prior specification
For the multilevel model (BMMLR), we specified diffuse priors, which were multivariate normally distributed for regression coefficients:
The specified variance matrices of regression coefficients were motivated by a paper of Gelman et al. [50], who recommend to choose a variance parameter that results in realistic support for the probability parameter after nonlinear transformation in logistic regression. We specified an inverseWishart prior distribution for the covariance matrix:
The regression parameters \(\varvec{\beta }^{q}\) in the singlelevel regression model (BMLR) were the same as in the multivariate approach (i.e., independent normal priors with means of 0 and variances of 10).
Transformation and decisionmaking
We applied the procedure presented in the Supplemental material to use the obtained MCMCchains of posterior regression coefficients for superiority decisionmaking. We thinned the chains in the transformation procedure with a factor 10 to reduce the computational burden.
We considered two different effects:

1
an average treatment effect for the trial population;

2
a conditional treatment effect for a subpopulation scoring one standard deviation below the mean or lower;
The treatment effects required marginalization over the interval that defined the (sub)population, which we accomplished by averaging over joint response probabilities computed for the empirical sample of data. Clusterspecific treatment effects were weighed by their sample sizes to produce a pooled estimate of the treatment difference.
Decisions were made with a rightsided test for the All, Any, and Compensatory (equal weights, \(\varvec{w} = (0.50,0.50)\)) rules with formal superiority regions:

1
Any rule: \(\mathcal {S}_{R} = \{ \varvec{\delta }  \max _{1<k<K} \delta ^{k} > 0\}  \varvec{y}, \varvec{x}\) and cutoff value \(p_{cut} = 1  \frac{\alpha }{K}\)

2
All Rule: \(\mathcal {S}_{R} = \{ \varvec{\delta }  \min _{1<k<K} \delta ^{k} > 0\}  \varvec{y}, \varvec{x}\) and cutoff value \(p_{cut} = 1  \alpha\)

3
Compensatory rule: \(\mathcal {S}_{R} = \{ \varvec{\delta }  \delta (\varvec{w}) > 0\}  \varvec{y}, \varvec{x}\) and cutoff value \(p_{cut} = 1  \alpha\)
We computed the probability to conclude superiority (\(p_{Sup}\)) as the proportion of posterior treatment differences in the superiority region via Eq. 11. The targeted Type Ierror rate of \(\alpha = .05\) corresponded to decision threshold \(p_{cut} = 1  \alpha = 0.95\) (Compensatory and All rules) and a for multiple tests corrected threshold \(p_{cut} = 1  \frac{\alpha }{K} = 0.975\) (Any rule) [17, 18, 48].
Software
We conducted our analyses in R and made use of several existing packages [51]. PĆ³lyaGamma variables were drawn with the pgdraw package [52]. Further, we drew variables from the multivariate normal, truncated normal, and Dirichlet distributions with the MASS, msm, and MCMCpack packages respectively [53,54,55]. MCMC chains were diagnosed with the coda and mcmcse packages [56, 57]. We parallellized the simulation procedure with the foreach and doParallel packages [58, 59] and created LaTeX tables with the xtable package [60]. The R code used to generate results can be found on GitHub https://github.com/XynthiaKavelaars/Bayesianmultilevelmultivariatelogisticregression.
Results
The current subsection presents the results of the simulation study. Presented decision error rates are in Table 2.
Bias
Regression coefficients, variance matrices and treatment effects (success probabilities, treatment differences) could be estimated without bias in all sample sizes and data generating mechanisms. The absolute average deviation of mean point estimates from true values was smaller than .01.
Decision error rates
Type I error rates
The average treatment effect demonstrated that the probability to incorrectly conclude superiority in multilevel regression (BMMLR) was close to the targeted .05 under a least favorable scenario (i.e., Any and Compensatory decision rules). In general, both reference approaches suffered from inflated Type I error to a similar extent.
The amount of inflation in BMMLR was affected by sample size: A large number of clusters (\(J=100\)) and/or a large subjects per cluster (\(n_{j}=100\)) had the largest Type I error rates, with the combination \(J=100,n_{j}=100\) resulting in the most severe inflation. On the other hand, a small number of clusters and a small number of subjects per cluster (\(J=10, n_{j}=10\)) resulted in an acceptable Type I error rate for the singlelevel BMLR model as well, suggesting some robustness against the violation of the assumption of independent observations in the current setup. In general, the number of subjects per cluster appeared more influential on the Type I error rate inflation than the number of clusters, as demonstrated by the two scenarios with an identical total sample size (\(J=10, n_{j}=100\) and \(J=100, n_{j}=10\)): A small number of clusters and a large sample size per cluster resulted in larger Type I error rates than a large number of clusters with a small sample size per cluster. Keeping everything else constant, a larger number of clusters meant more independent units, implying that the assumption of independent observations was violated less severely. In other words, the need for a multilevel model was more prominent when the number of clusters was small. A similar pattern was seen under the All rule, although Type I errors were small in general. This was expected, since a) the All rule is known to be the most conservative of the three introduced rules; and b) the treatment difference was smaller than the least favorable scenario of this decision rule.
Power
The conditional treatment effect demonstrated the power to correctly conclude superiority for all three rules. Three results were highlighted. First, the multilevel model (BMMLR) is more powerful when the number of clusters is higher. The two conditions with an equal total sample size showed a .30 difference in power under the All rule. The other rules showed the same patterns, but had too high proportions of superiority conclusions to clearly distinguish the sample size conditions: The power in the other conditions equaled or was close to the maximum of 1.000.
Second, the singlelevel regression model (BMLR) resulted in more superiority conclusions than the multilevel regression model, implying that the posterior distributions of treatment differences of the singlelevel regression model had smaller variances. Again, differences were best illustrated by the All rule and the condition with small sample sizes for the Any and Compensatory decision rules, as these proportions were well below the maximum. Similar to the Type I error rates, the differences between the proportions of superiority conclusions appeared to be subject to the number of clusters, as demonstrated by a comparison of the two conditions with an identical total sample size under the All rule. The multilevel model was less powerful than the singlelevel model when the number of clusters was low in particular, being in line with nonindependence of clustered observations.
Third, the multivariate Bernoulli model (BMB) has low power overall, despite the underestimation of variance due to falsely assuming independent observations. As a subsampling approach, conditional treatments were fitted on the part of the data that makes up the subpopulation of interest. Especially the \(J=10\), \(n_{j}=10\) condition suffered from a small remaining sample size.
Illustration with IST3 data
To illustrate the proposed framework with real data, we reanalyzed a subset of data from the Third International Stroke Trial using the BMMLR framework [6, 7]. The included 3,Ā 035 subjects in the IST3 were recruited from 156 different hospitals in 12 different countries, resulting in multilevel data from patients clustered within hospitals and hospitals clustered within countries. We selected a twolevel subset of 1,Ā 447 subjects from 75 hospitals in the United Kingdom with a known health and survival status at six months after the initial stroke and a known or predicted severity score of the initial stroke (NIH Stroke Score; NIHSS) at randomisation. The cluster sizes were skewed and ranged from 1 to 117, with a median cluster size of 7 (SD: 26.66). Of the selected subset of data, \(n_{A} = 716\) subjects were in the Alteplase group (treatment = 1) and \(n_{C} = 731\) subjects were in the control group (treatment = 0). We compared the effects of the two treatments on a) being strokefree for seven days (0 = no; 1 = yes) and b) longterm independent living at six months (0 = no, 1 = yes), while taking the severity of the initial stroke into account. The NIHSS can range from 0 to 42 with a higher score indicating a more severe stroke. The average stroke severity score in the IST3 was 13.12 (SD: 6.91) and comparable in both treatment groups.
Method
We fitted our model with random slopes for the intercept and the treatment effect. We sought to compare our multilevel model (BMMLR) to the two singlelevel models (BMLR and BMB) from the āNumerical evaluationā section in treatment comparison of Alteplase and control on dependency after six months (\(\delta ^{Indep6}\)) and recurrent stroke within seven days (\(\delta ^{Strk7}\)). The multilevel model (BMMLR) was fitted with the linear predictor in Eq. 6 and the linear predictor of the singlelevel regression model (BMLR) was:
Prior specification
For the regression coefficients in the multilevel model (BMMLR) and the singlelevel regression model (BMLR), we specified independent normal prior distributions with means of 0 and variances of 10. For covariance matrix \(\Sigma ^{q}\), we specified an improper uniform prior for the random effects covariance matrix for each category q, to enable testing for the presence of random effects in the model comparison step discussed later.
Gibbs sampling
We ran two MCMCchains via the Gibbs samplers. Since the chains of regression coefficients were highly autocorrelated in the multilevel model (lag 10: \(\varvec{\beta }: 0.47  0.59\); \(\varvec{\gamma }: 0.62  0.80\), \(\varvec{\Sigma }: 0.010.38\)), we sampled a large number of 500,Ā 000 iterations plus 10,Ā 000 burnin iterations. The multivariate potential scale reduction factor was below 1.01 for all parameters, implying that there were no signals of nonconvergence. We thinned MCMCchains in followup posterior transformations with a factor 10 to reduce computational demands, resulting in inference based on \(L = 50,000\) draws.
Transformation and decisionmaking
We applied the procedures presented in the Supplemental material to the thinned MCMCchains of posterior regression coefficients to make superiority decisions. We considered (conditional) average treatment effects among seven different (sub)populations:

1
ATE: average treatment effects for all patients in the trial population;

2
CATE  Low range: conditional average treatment effects for patients with a stroke severity score between 0 and 5;

3
CATE  MidLow range: conditional average treatment effects for patients with a stroke severity score between 6 and 14;

4
CATE  MidHigh range: conditional average treatment effects for patients with a stroke severity score between 15 and 24;

5
CATE  High range: conditional average treatment effects for patients with a stroke severity score above 25;

6
CATE  Low value: conditional treatment effects for patients with a stroke severity score of 5.18, corresponding to 1 standard deviation below the mean;

7
CATE  High value: conditional treatment effects for patients with a stroke severity score of 19.03, corresponding to 1 standard deviation above the mean.
The grouping criteria for CATEs of ranges were taken from the original IST3 paper [6].
We performed twosided tests for the All, Any, and Compensatory rules. Similar to the IST3, we used living independently as the most important outcome in the Compensatory rule and specified weights \(\varvec{w} = (0.20,0.80)\) for remaining free of strokes and independent living respectively. This specification implied that the longterm outcome had four times more impact on the decision than the shortterm outcome. The targeted twosided Type Ierror rate of \(\alpha = .05\) corresponded to decision threshold \(p_{cut} = 1  \frac{\alpha }{2} = 0.975\) (Compensatory and All rules) and a for multiple tests corrected threshold \(p_{cut} = 1  \frac{\alpha }{2K} = 0.9875\) (Any rule).
Model comparison
Since the true model of these realworld data is unknown, we followed up on the analysis with a comparison of model fit via Bayes factors. Bayes factors [61] quantify the relative evidence in the data between competing statistical models. Here we use default Bayes factors which avoid the need to manually specify prior distributions [62,63,64].
BMLR vs. BMB
To compare the two singlelevel models, we computed a Bayes factor on the probabilities that the regression coefficients of the covariate (\(\beta ^{q}_{2}\)) and the interaction between the covariate and the interaction (\(\beta ^{q}_{3}\)) was equal to zero for all \(q \in q,\dots ,Q1\) using the BF()function from the Rpackage BFpack [62].
BMMLR vs. BMLR
To compare the proposed multilevel model (BMMLR) and the singlelevel model (BMLR), we computed empirical Bayes factors as proposed by VieiraGeneroso et al. [64], which tests whether the random effects are equal across clusters using uniform priors for the random effects covariance matrices. This test is executed separately for all six different random effects in the multilevel model.
Software
In addition to the software packages used in Section āNumerical evaluationā, we used R packages haven to import the dataset [65]Ā andĀ BFpack [62] to compute Bayes factors for comparison of the two singlelevel models. Similar to āNumerical evaluationā, the R code used to generate results in this section can be found on GitHub https://github.com/XynthiaKavelaars/Bayesianmultilevelmultivariatelogisticregression.
Results
Results of different (sub)populations
Table 3 show how different analysis models and different decision rules provide elaborate insights in the effects of Alteplase vs. control on a combination of dependent variables among different (sub)populations. Analysis of the selected data with the BMMLR, BMLR, and BMB models gave the following results.
Average treatments effects
The average treatment effect (ATE) among the UKbased part of the trial population showed that the Alteplase group had a lower estimated probability of remaining free of strokes, a higher estimated probability of living independently, and a weighted probability difference close to zero. The three modelling procedures produced similar estimates and unanimously resulted in the conclusions that Alteplase was inferior according to the Any rule due to the effect on being free of strokes, while neither superiority nor inferiority could be concluded from the All or Compensatory rules.
Conditional average treatment effects
The four conditional average treatment effects (CATEs) that reflected subpopulations as ranges sketched a more heterogeneous picture than the average treatment effects. Whereas all ranges showed a lower probability of being free of strokes after treatment with Alteplase, these probabilities increased with the severity of the stroke. Differences between success probabilities of the two treatments appeared to increase with severity of the stroke, such that Alteplase appeared to have the largest negative effect on being strokefree when the severity of the initial stroke was highest. A more diffuse relation between stroke severity and treatment difference emerged on longterm independent living. Alteplase resulted in a slightly lower point estimate of the probability of independent living among patients with a Low stroke severity, but resulted in a higher estimated probability of independent living in all categories of more severe strokes. Patients in the MidLow and MidHigh ranges of stroke severity had the largest positive effect of Alteplase on independent living. The Low and High stroke severity patients had slightly higher weighted probabilities after Alteplase, while patients with a MidLow and MidHigh stroke severity had weighted probabilities close to zero. These nonzero point estimates were not unanimously supported by sufficient evidence to conclude superiority or inferiority. The All and Compensatory rules remained inconclusive for all models among all subpopulations. The BMMLR and BMLR were unanimous in their conclusions for the Any rule: Inferiority was concluded for patients with a Low, MidLow and High stroke severity, while both superiority and inferiority were concluded for patients with a MidHigh range stroke severity. The BMB model remained inconclusive in the Low and High ranges and concluded inferiority among patients with a MidLow or MidHigh stroke severity, according to the Any rule.
The two conditional average treatment effects (CATEs) that specified subpopulations by values illustrated treatment differences for two hypothetical individual patients. After receiving Alteplase, both patients would have a lower probability of remaining free of strokes. Only the patient with a High stroke severity value had a higher probability of longterm independent living. The weighted failure probability difference was slightly below zero for the patient with a Low stroke severity and around zero for the patient with a High stroke severity. Again, the All and Compensatory rules remained inconclusive, whereas the Any rule would result in an inferiority conclusion for the patient with a Low stroke severity and in both inferiority and superiority for the patient with a High stroke severity.
Model comparison
Bayes factors [66] are computed to test whether there is evidence that a dependency structure is present in the data that is caused by the multilevel structure. The results are presented in Table 4. These results indicate that there is evidence that each of the six different random effects do not vary across clusters. This implies that the parsimonious singlelevel model (BMLR) is preferred over the multilevel model (BMMLR) for these specific data. This result is also in agreement with the obtained estimates which are virtually identical under both models. Model comparison between the two singlelevel models (BMLR vs. BMB) resulted in a logtransformed Bayes factor of 16.348, reflecting strong evidence that the a regression model (BMLR) fitted the data better than the multivariate Bernoulli (BMB) model. We give a general recommendation on model selection in the āDiscussionā section.
Conclusions and discussion
Several conclusions regarding the BMMLR framework could be drawn from the presented results. First, multilevel analysis did not affect point estimates in the used subset of IST3 data: BMMLR and BMLR models resulted in similar point estimates of \(\varvec{\delta }\) and \(\delta (\varvec{w})\), as expected from the negligible bias in the results of the simulation study. The posterior probabilities of the BMMLR and the BMLR model were similar and did not lead to different superiority or inferiority conclusions. A model comparison based on Bayes factors resulted in evidence in favor of a singlelevel model. It would be helpful to have information about clustering beforehand and we concluded that these results call for a proper method to quantify the degree of dependence among observations within clusters prior to the analysis. Such insights could help in clarifying the statistical urgency of a multilevel model and the appropriateness of a singlelevel model in advance.
Second, average treatment effects indicated an increased probability of recurrent events and a slightly decreased probability of longterm independent living after receiving the experimental treatment. However, different decision rules led to different conclusions. When the individual treatment effects had to be better on both dependent variables (All rule) or were weighted (Compensatory rule), no superiority or inferiority could be concluded. When any of the dependent variables had to demonstrate a relevant treatment difference (Any rule), both inferiority on recurrent events and superiority on longterm independent living could be concluded. This demonstrated a general potential problem with the Any rule: Contrasting decisions can result from the same analysis. Recall that the Any rule treats all outcome variables as equally important, raising the question which conclusion to favor for patients in the MidHigh range or with a High value of severity. This problem does not occur with the other rules: The All and Compensatory rules are unambiguous in their conclusions.
Third, conditional (average) treatment effects suggested a trend in heterogeneity on the individual dependent variables that was not reflected by the average treatment effect. These trends were partially supported by superiority and/or inferiority decisions, depending on the specified decision rule. Even without clear conclusions, conditional treatment effect sizes provided detailed insights: Considering average treatment effects only would have overlooked these trends. Further, the BMB model in the High range demonstrated that subgroup analysis can be a suboptimal approach to estimate conditional average treatment effects, as it can suffer from power loss. The High range subgroup is a relatively small fraction of the total sample size and performing an independent analysis on this group reduces the amount of evidence. This is reflected in the comparison to the BMMLR and BMLR methods: BMB has less extreme posterior probabilities, while treatment effect estimates are similar.
Discussion
The current paper presented the BMMLR framework as a multilevel extension to the Bayesian multivariate logistic regression (BMLR) analysis framework. The BMMLR framework consisted of three elements:

1
a Bayesian multilevel multivariate logistic regression model;

2
a transformation procedure to interpret results on the (multivariate) probability scale;

3
a statistical decision procedure to draw superiority and inferiority conclusions with targeted frequentist Type I errors
The presented framework accurately handled the multilevel structure of the data in the presence of heterogeneous treatment effects on multiple (correlated) binary dependent variables. A simulation study demonstrated that the proposed model indeed a) estimated average and conditional treatment effects in multilevel data without bias; and b) resulted in statistical decisions with targeted Type I error rates. A multilevel model was clearly superior for clustered data: Naive models that did not take the multilevel structure into account resulted in inflated Type Ierror rates. Further, the logistic model promoted informationsharing between clusters and subpopulations, being a more powerful alternative than subgroup analysis to identify heterogeneous treatment effects. A reanalysis of the IST3 provided another perspective on the data than the original paper [6]. Detailed insights as well as the varying treatment effects among subpopulations demonstrated the importance of a) a wellconsidered and specific decision rule; and b) the assessment of treatment heterogeneity. The statistical need for a multilevel model has not clearly become evident for this specific analysis. The results suggested that a substantive cluster structure in the data does not necessarily imply a relevant statistical dependency structure between observations. We demonstrated that an implied dependency structure can be tested using empirical Bayes factors [64]. If these Bayes factors provide evidence that none of the random effects varies, a singlelevel model gives a more parsimonious description of the data. In case of evidence for the presence of random effects due to the multilevel structure in the data, the proposed multilevel multivariate model is preferred as it gives more accurate type I errors. If there is evidence that some of the random effects do not vary across clusters, it is recommended to fix these parameters to give a more parsimonious description of the data.
Application of the BMMLR framework is not limited to the presented analyses. Theoretically, the model can be adapted to the longitudinal setting, may be used to borrow strength from different trials, or may be extended to data with multiple levels of clustering for example. In practice, such extensions require additional exploration of the (computational) properties of the model, since MCMC sampling procedures appeared sensitive to the amount of autocorrelation and the number of parameters. In a related fashion, carefully choosing which random effects to include is helpful for smooth execution of multilevel analysis. The model has a large number of options regarding specification of the model, giving a lot of flexibility to model cluster effects precisely. This flexibility reduces parsimony however, as it easily increases the number of model parameters. While it is technically possible to expand the model, some care must be taken when adding many outcome variables and many covariates however. This would result in many more model parameters, which results in considerably less parsimonious description of the data and can intensify computations notably. Similarly, the multinomial setup is most suitable for a limited number of dependent variables. Increasing the number of dependent variables results in a large number of response categories, which may lead to sparsity issues.
Future research might advance the design of the BMMLR framework in multiple ways. First, a priori sample size computation and power analysis have priority in medical research. Sample sizes in logistic regression should not be too small and preferably take the success probability into account [67, 68]. In line with our findings, larger numbers of clusters generally appear to be more powerful than larger numbers of subjects within clusters [69], although a study into sample sizes for multilevel logistic regression analysis provided less clear results [70]. Expanding and refining knowledge regarding sample sizes in multilevel models aids in strategic experimental design [71,72,73] Additionally, ethical aspects, such as risks and burden of (potentially inferior) treatment, and practical considerations, such as limited access to (large numbers of) subjects, require more indepth understanding of power and sample sizes. Especially in precision medicine  where treatments are targeted at specific patient populations  numbers of eligible subjects are limited and a priori power analysis helps to manage expectations in terms of duration.
Second, the methodology can be placed into a broader framework of Bayesian statistics. The framework can be extended with the computation of Bayes factors to aid in decisionmaking regarding superiority and inferiority as well, for example following the ideas presented in Van Ravenzwaaij et al. [74]. Further, the specification of prior distributions requires consideration. Specification of noninformative priors may not be trivial. The general tendency to choose relatively large variance parameters for normally distributed prior distributions [50], does not necessarily work well with the proposed model. Covering a range far beyond realistic parameter values, can (negatively) affect the efficiency of the sampling procedure and even the resulting posterior distribution. Thus, concrete guidelines for the specification of noninformative priors would be helpful.
Third, pooling of treatment estimates can be done in several other ways than presented. In general, the pooled treatment effect over clusters is a weighted combination of clusterspecific estimates, where the weights aim to balance aspects that influence estimation and are imbalanced over clusters (e.g., cluster size or variance). Whereas we applied a cluster sizebased approach, several advanced weighing procedures balance unequal variances within clusters via regularization methods (for overviews, see [75,76,77]). These weighing methods generally produce shrinkage to the mean a) when group level variance is smaller; and/or b) when sample sizes are smaller ([27],Ā p. 269). Such weighing procedures have interesting balancing properties but are probably less suitable for trials with clusters of single subjects, such as IST3. These clusters have no variance, should not be discarded or merged inconsiderately, and call for the exploration of suitable weighing procedures for such data.
Finally, the BMMLR framework and multilevel models for discrete data in general lack a standard way to quantify the degree of clustering and the corresponding need for a multilevel model. Often, the degree of clustering is quantified as the variance between clusters relative to the variance within clusters, expressed via an intraclass correlation coefficient (ICC). The computation of ICCs in binary data is not straightforward: The variance within clusters  and therefore the ICC  is a function of the predictors in the model and the ICC depends on the prevalence, requiring an alternative approximation to obtain an appropriate estimate of the ICC [78,79,80,81]. We leave the extension of our framework in this direction for future research.
Conclusion
The presented Bayesian method aimed to capture a multilevel structure and treatment heterogeneity simultaneously in data with multiple correlated binary outcome variables and observed covariates. The framework was built upon three major components: a multivariate logistic regression analysis, a subsequent transformation of regression coefficients to the multivariate probability scale, and a procedure to make decisions regarding treatment superiority or inferiority. When the sample is sufficiently large, treatment effects can be estimated unbiasedly and decisions regarding average and conditional treatment effects can be made with targeted error rates and a priori estimated sample sizes. The method is useful in prediction of treatment effects and decisionmaking within subpopulations from multiple clusters, while taking advantage of the size of the entire study sample and while properly incorporating the uncertainty in a principled probabilistic manner using the full posterior distribution.
Availability of data and materials
The Third International Stroke Trial data that support the findings of this study are available with the identifiers [https://doi.org/10.1016/S01406736(16)304147] and [https://doi.org/10.7488/ds/1350]. The R code used to generate results in Sections āNumerical evaluationā and āIllustration with IST3 dataā can be found on GitHub https://github.com/XynthiaKavelaars/Bayesianmultilevelmultivariatelogisticregression.
Abbreviations
 ATE:

Average Treatment Effect
 BMB:

Bayesian multivariate Bernoulli
 BMLR:

Bayesian multivariate logistic regression
 BMMLR:

Bayesian multilevel multivariate logistic regression
 CATE:

Conditional Average Treatment Effect
 FDA:

Food and Drug Administration
 IST:

Internation Stroke Trial
 IST3:

Third International Stroke Trial
 NIHSS:

National Institutes of Health Stroke Score
 RCT:

Randomized Controlled Trial
References
Schimmel WC, Verhaak E, Hanssens PE, Gehring K, Sitskoorn MM. A randomised trial to compare cognitive outcome after gamma knife radiosurgery versus whole brain radiation therapy in patients with multiple brain metastases: research protocol CARstudy B. BMC Cancer. 2018;18(1):218.
Schimmel WCM, Verhaak E, Hanssens PEJ, Kavelaars XM, Mulder J, Kaptein MC, etĀ al. P01.06.B Interim results from CARStudy B: An ongoing randomized trial on the effect of SRS or WBRT on cognitive performance in patients with 1120 brain metastases. NeuroOncol. 2022;24(Supplement_2):ii24.
Biswas S, Liu DD, Lee JJ, Berry DA. Bayesian clinical trials at the University of Texas MD Anderson cancer center. Clin Trials. 2009;6(3):205ā16.
International Stroke Trial Collaborative Group. The International Stroke Trial (IST): a randomised trial of aspirin, subcutaneous heparin, both, or neither among 19,435 patients with acute ischaemic stroke. Lancet. 1997;349(9065):1569ā81.
Sandercock PA, Niewada M, CzÅonkowska A. The International Stroke Trial database. Trials. 2011;12(1):101.
The International Stroke Trial3 Collaborative Group. The benefits and harms of intravenous thrombolysis with recombinant tissue plasminogen activator within 6 h of acute ischaemic stroke (the third international stroke trial [IST3]): a randomised controlled trial. Lancet. 2012;379(9834):2352ā63.
Sandercock P, Wardlaw J, Lindley R, Cohen G, Whiteley W. The third International Stroke Trial (IST3), 20002015 [dataset]. University of Edinburgh & Edinburgh Clinical Trials Unit; 2016. Available from: https://doi.org/10.7488/ds/1350.
Lindley RI, Wardlaw JM, Whiteley WN, Cohen G, Blackwell L, Murray GD, et al. Alteplase for Acute Ischemic Stroke. Stroke. 2015;46(3):746ā56.
Food and Drug Administration. NonInferiority Clinical Trials to Establish Effectiveness: Guidance for Industry. Rockville: Food and Drug Administration; 2016 [updated 2016 November]. Available at: https://www.fda.gov/regulatoryinformation/searchfdaguidancedocuments/noninferiorityclinicaltrials. Accessed 7 Oct 2022.
Evans D. Hierarchy of evidence: a framework for ranking evidence evaluating healthcare interventions. J Clin Nurs. 2003;12(1):77ā84.
Ng PC, Murray SS, Levy S, Venter JC. An agenda for personalized medicine. Nature. 2009;461(7265):724ā6.
Grol R, Grimshaw J. From best evidence to best practice: effective implementation of change in patientsā care. Lancet. 2003;362(9391):1225ā30.
Simon R. Clinical trials for predictive medicine: new challenges and paradigms. Clin Trials. 2010;7(5):516ā24.
Murray TA, Thall PF, Yuan Y. Utilitybased designs for randomized comparative trials with categorical outcomes. Stat Med. 2016;35(24):4285ā305.
Food and Drug Administration. Multiple Endpoints in Clinical Trials Guidance for Industry [Draft version]. Rockville: Food and Drug Administration; 2017. [updated January 2017]. Available at: https://www.fda.gov/files/drugs/published/MultipleEndpointsinClinicalTrialsGuidanceforIndustry.pdf. Accessed 19 Sept 2022.
Su TL, Glimm E, Whitehead J, Branson M. An evaluation of methods for testing hypotheses relating to two endpoints in a single clinical trial. Pharm Stat. 2012;11(2):107ā17.
Sozu T, Sugimoto T, Hamasaki T. Reducing unnecessary measurements in clinical trials with multiple primary endpoints. J Biopharm Stat. 2016;26(4):631ā43.
Kavelaars X, Mulder J, Kaptein M. Decisionmaking with multiple correlated binary outcomes in clinical trials. Stat Methods Med Res. 2020;29(11):3265ā77.
Hamburg MA, Collins FS. The Path to Personalized Medicine. N Engl J Med. 2010;363(4):301ā4.
Mirnezami R, Nicholson J, Darzi A. Preparing for Precision Medicine. N Engl J Med. 2012;366(6):489ā91.
Schork NJ. Personalized medicine: Time for oneperson trials. Nature. 2015;520(7549):609ā11.
Thall PF. Bayesian cancer clinical trial designs with subgroupspecific decisions. Contemp Clin Trials. 2020;90:105860.
Kaptein M. The use of Thompson sampling to increase estimation precision. Behav Res Methods. 2014;47(2):409ā23.
Kaptein M, Markopoulos P, de Ruyter B, Aarts E. Personalizing persuasive technologies: Explicit and implicit personalization using persuasion profiles. Int J HumComput Stud. 2015;77:38ā51.
Kavelaars XM, Mulder J, Kaptein MC. Bayesian multivariate logistic regression for superiority and inferiority decisionmaking under treatment heterogeneity. arXiv [Preprint]. 2022. p. 41.Ā https://doi.org/10.48550/arXiv.2206.04133. Accessed 12 Oct 2022.
Viele K, Berry S, Neuenschwander B, Amzal B, Chen F, Enas N, et al. Use of historical control data for assessing treatment effects in clinical trials. Pharm Stat. 2014;13(1):41ā54.
Gelman A, Hill J. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press; 2007.
Hox J, Moerbeek M, VanĀ de Schoot R. Multilevel Analysis. Taylor & Francis Ltd; 2017.
Raudenbush SW, Bryk AS. Hierarchical Linear Models: Applications and Data Analysis Methods. SAGE PUBN; 2001.
McGlothlin AE, Viele K. Bayesian Hierarchical Models. JAMA. 2018;320(22):2365.
Dai B, Ding S, Wahba G, et al. Multivariate bernoulli distribution. Bernoulli. 2013;19(4):1465ā83.
Chib S. Marginal Likelihood from the Gibbs Output. J Am Stat Assoc. 1995;90(432):1313ā21.
Malik HJ, Abraham B. Multivariate Logistic Distributions. Ann Stat. 1973;1(3):588ā90.
OBrien SM, Dunson DB. Bayesian Multivariate Logistic Regression. Biometrics. 2004;60(3):739ā46.
Braeken J, Tuerlinckx F, De Boeck P. Copula Functions for Residual Dependency. Psychometrika. 2007;72(3):393.
Nikoloulopoulos AK, Karlis D. Multivariate logit copula model with an application to dental data. Stat Med. 2008;27(30):6393ā406.
Panagiotelis A, Czado C, Joe H. Pair copula constructions for multivariate discrete data. J Am Stat Assoc. 2012;107(499):1063ā72.
Polson NG, Scott JG, Windle J. Bayesian inference for logistic models using PĆ³lyaGamma latent variables. J Am Stat Assoc. 2013;108(504):1339ā49.
Rossi PE, Allenby GM, McCulloch R. Bayesian Statistics and Marketing. Wiley; 2005.
Chib S, Chen Y. MCMC Methods for Fitting and Comparing Multinomial Response Models. 1998. Working Paper, Olin School of Business, Washington University. Available at: http://apps.olin.wustl.edu/faculty/chib/papers/chibGreenbergChenMCMCMN.pdf. Accessed 10 Aug 2023.
Forster JJ. Markov chain Monte Carlo exact inference for binomial and multinomial logistic regression models. Stat Comput. 2003;13(2):169ā77.
Betancourt M. A Conceptual Introduction to Hamiltonian Monte Carlo. arXiv [Preprint]. 2017. p. 60.Ā https://doi.org/10.48550/arXiv.1701.02434. Accessed 10 Aug 2023.
Thomas S, Tu W. Learning Hamiltonian Monte Carlo in R. Am Stat. 2021;75(4):403ā13.
Betancourt MJ, Girolami M. Hamiltonian Monte Carlo for Hierarchical Models. arXiv [Preprint]. 2013. p. 11.Ā https://doi.org/10.48550/arXiv.1312.0906. Accessed 10 Aug 2023.
Sozu T, Sugimoto T, Hamasaki T. Sample size determination in clinical trials with multiple coprimary binary endpoints. Stat Med. 2010;29:2169ā79.
OāBrien PC. Procedures for comparing samples with multiple endpoints. Biometrics. 1984;40(4):1079ā87.
Whitehead J, Branson M, Todd S. A combined score test for binary and ordinal endpoints from clinical trials. Stat Med. 2010;29(5):521ā32.
Marsman M, Wagenmakers EJ. Three Insights from a Bayesian Interpretation of the OneSided P Value. Educ Psychol Meas. 2016;77(3):529ā39.
Gelman A, Carlin JB, Stern HS, Rubin DB, Dunson DB. Bayesian Data Analysis. Taylor & Francis Ltd.; 2013.
Gelman A, Jakulin A, Pittau MG, Su YS. A weakly informative default prior distribution for logistic and other regression models. Ann Appl Stat. 2008;2(4):1360ā83.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna; 2020.
Makalic E, Schmidt D. HighDimensional Bayesian Regularised Regression with the BayesReg Package. arXiv [Preprint]. 2016. p. 18. [accessed 7 Oct 2022]
Venables WN, Ripley BD. Modern Applied Statistics with S. 4th ed. New York: Springer; 2002.
Jackson CH. MultiState Models for Panel Data: The msm Package for R. J Stat Softw. 2011;38(8):1ā29.
Martin AD, Quinn KM, Park JH. MCMCpack: Markov Chain Monte Carlo in R. J Stat Softw. 2011;42(9):22.
Plummer M, Best N, Cowles K, Vines K. CODA: Convergence Diagnosis and Output Analysis for MCMC. R News. 2006;6(1):7ā11.
Flegal JM, Hughes J, Vats D, Dai N, Gupta K, Maji U. mcmcse: Monte Carlo Standard Errors for MCMC. Riverside, and Kanpur; 2021. R package version 1.50.
Microsoft, Weston S. foreach: Provides Foreach Looping Construct. 2020. R package version 1.5.1.
Microsoft, Weston S. doParallel: Foreach Parallel Adaptor for the āparallelā Package. 2020. R package version 1.0.16.
Dahl DB, Scott D, Roosen C, Magnusson A, Swinton J. xtable: Export Tables to LaTeX or HTML. 2019. R package version 1.8ā4.
Jeffreys H. Theory of Probability. Oxford: Clarendon Press; 1961.
Mulder J, Williams DR, Gu X, Tomarken A, BĆ¶ingMessing F, OlssonCollentine A, et al. BFpack: Flexible Bayes Factor Testing of Scientific Theories in R. J Stat Softw. 2021;100(18):1ā63.
Mulder J, Fox JP. Bayes Factor Testing of Multiple Intraclass Correlations. Bayesian Anal. 2019;14(2):521ā52.
VieiraGeneroso F, Leenders R, McFarland D, Mulder J. A Bayesian actororiented multilevel relational event model with hypothesis testing procedures. arXiv [Preprint]. 2023. p. 39.Ā https://doi.org/10.48550/arXiv.2204.10676. Accessed 8 June 2023.
Wickham H, Miller E. haven: Import and Export āSPSSā, āStataā and āSASā Files. 2021. R package version 2.4.3.
Kass RE, Raftery AE. Bayes Factors. J Am Stat Assoc. 1995;90(430):773ā95.
Jong VMT, Eijkemans MJC, Calster B, Timmerman D, Moons KGM, Steyerberg EW, et al. Sample size considerations and predictive performance of multinomial logistic prediction models. Stat Med. 2019;38(9):1601ā19.
Nemes S, Jonasson JM, Genell A, Steineck G. Bias in odds ratios by logistic regression modelling and sample size. BMC Med Res Methodol. 2009;9(1):56.
Snijders TAB. Power and Sample Size in Multilevel Linear Models. Ltd: John Wiley & Sons; 2005.
Moineddin R, Matheson FI, Glazier RH. A simulation study of sample size for multilevel logistic regression models. BMC Med Res Methodol. 2007;7(1):34.
Raudenbush SW, Liu X. Statistical power and optimal design for multisite randomized trials. Psychol Methods. 2000;5(2):199ā213.
Moerbeek M, van Breukelen GJP, Berger MPF. Design Issues for Experiments in Multilevel Populations. J Educ Behav Stat. 2000;25(3):271.
Moerbeek M, Breukelen GJPV, Berger MPF. Optimal Experimental Designs for Multilevel Logistic Models. J R Stat Soc Ser D Stat. 2001;50(1):17ā30.
Van Ravenzwaaij D, Monden R, Tendeiro JN, Ioannidis JPA. Bayes factors for superiority, noninferiority, and equivalence designs. BMC Med Res Methodol. 2019;19(1):71.
Lin Z. An issue of statistical analysis in controlled multicentre studies: how shall we weight the centres? Stat Med. 1999;18(4):365ā73.
Jones B, Teather D, Wang J, Lewis JA. A comparison of various estimators of a treatment difference for a multicentre clinical trial. Stat Med. 1998;17(15ā16):1767ā77.
Gallo PP. Centerweighting issues in multicenter clinical trials. J Biopharm Stat. 2000;10(2):145ā63.
Ridout MS, Demetrio CGB, Firth D. Estimating Intraclass Correlation for Binary Data. Biometrics. 1999;55(1):137ā48.
Gulliford MC, Adams G, Ukoumunne OC, Latinovic R, Chinn S, Campbell MJ. Intraclass correlation coefficient and outcome prevalence are associated in clustered binary data. J Clin Epidemiol. 2005;58(3):246ā51.
Paul S, Saha K, Balasooriya U. An empirical investigation of different operating characteristics of several estimators of the intraclass correlation in the analysis of binary data. J Stat Comput Simul. 2003;73(7):507ā23.
Goldstein H, Browne W, Rasbash J. Partitioning Variation in Multilevel Models. Underst Stat. 2002;1(4):223ā31.
Acknowledgements
We thank Peter Sandercock on behalf of The International Stroke Trial3 Collaborative Group for making the data from the Third International Stroke Trial publicly available. We gratefully acknowledge The IST3 Collaborative Group, the trial joint sponsors (The University of Edinburgh and the Lothian Health Board), and the chief funding agencies of the study: UK Medical Research Council, Health Foundation UK, Stroke Association UK, Research Council of Norway, Arbetsmarknadens Partners Forsakringsbolag (AFA) Insurances Sweden, Swedish Heart Lung Fund, The Foundation of Marianne and Marcus Wallenberg, Polish Ministry of Science and Education, the Australian Heart Foundation, Australian National Health and Medical Research Council (NHMRC), Swiss National ResearchFoundation, Swiss Heart Foundation, Assessorato alla Sanita, Regione dellāUmbria, Italy, and Danube University. We thank four reviewers for their helpful comments on earlier drafts of the manuscript.
Funding
The current work was supported by the Dutch Research Council (NWO) [no. 406.18.505].
Author information
Authors and Affiliations
Contributions
XK performed the analyses and drafted the manuscript. JM and MK verified analytical methods, supported the drafting of the manuscript and supervised the project. All authors critically read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisherās Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Kavelaars, X., Mulder, J. & Kaptein, M. Bayesian multilevel multivariate logistic regression for superiority decisionmaking under observable treatment heterogeneity. BMC Med Res Methodol 23, 220 (2023). https://doi.org/10.1186/s1287402302034z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287402302034z