### Bayesian model for network meta-analysis

We assume that an NMA contains *N* studies; each study compares a subset of a total of *K* treatments. The treatments compared in study *i* are denoted by the set \({\mathcal{T}}_i\) (*i* = 1, …, *N*). Let *y*_{ik} be the outcome measure in study *i*’s treatment group *k* (*k* ∈ *T*_{i}) and \({\mathcal{D}}_{ik}\) be the known, observed data in this study’s treatment group.

This article focuses on the case of binary outcomes, so *y*_{ik} is the event count, which is assumed to follow the binomial distribution, and \({\mathcal{D}}_{ik}\) represents the sample size *n*_{ik}. Without loss of generality, the following materials can be generalized to other types of outcomes. In addition, let *b*_{i} be the baseline treatment for study *i*; this can be any treatment in \({\mathcal{T}}_i\), and can differ across studies. For simplicity, we denote it by *b* when it does not lead to confusion. The Bayesian hierarchical model for the NMA is [2, 37]:

(likelihood)

$${y}_{ik}\sim Bin\left({p}_{ik},{n}_{ik}\right),i=1,\dots, N,k\in {\mathcal{T}}_i;$$

(link function)

$$g\left({p}_{ik}\right)={\mu}_i+{\delta}_{ibk}I\left(k\ne b\right);$$

(random effects)

$${\delta}_{ibk}\sim N\left({d}_{bk},{\tau}_{bk}^2\right);$$

(multi-arm studies)

$$\mathrm{Corr}\left({\delta}_{ibk},{\delta}_{ibh}\right)={\gamma}_{bhk};$$

(priors)

$${\mu}_i,{d}_{bk},{\tau}_{bk},\mathrm{and}\ {\gamma}_{bhk}\sim \mathrm{priors}.$$

We use the canonical logit link function for binomial data, i.e., *g*(*t*) = log[*t*/(1 − *t*)], which transforms the underlying true event rate *p*_{ik} to a linear form. The indicator function *I* (·) returns 0 if *k* = *b* and 1 if *k* ≠ *b*. Consequently, *δ*_{ibk} represents the underlying true log odds ratio (OR) of treatment *k* vs. *b* in study *i*. Moreover, *μ*_{i} represents the baseline effect of study *i*. Because baseline effects could differ greatly across studies, *μ*_{i} is commonly modeled as a nuisance parameter. To account for potential heterogeneity, *δ*_{ibk} is modeled as a random effect; \({\tau}_{bk}^2\) represents the between-study variance for the comparison *k* vs. *b*. The between-study variances are typically assumed equal for all comparisons (i.e., \({\tau}_{bk}^2={\tau}^2\)) [1, 2]. This assumption greatly reduces the model complexity. In cases that these variances dramatically differ, one may alter to use more general model specifications [37]. The overall log ORs are the parameters *d*_{bk} and are of primary interest in NMAs. Within multi-arm studies (if any), *γ*_{bhk} denotes the correlation coefficient between comparisons *k* vs. *b* and *h* vs. *b*. It is commonly set to 0.5, which is a result of assuming equal between-study variances [2, 38].

The posterior distributions of the parameters of interest can be obtained via the Markov chain Monte Carlo (MCMC) algorithm. Our analyses use the vague priors *N*(0, 100^{2}) for study-specific baseline effects *μ*_{i} and the log ORs of all treatments vs. the reference treatment, say treatment 1 (i.e., *d*_{1k}). The reference treatment may be different from study-specific baseline treatments; it is usually a standard control (e.g., placebo) [39]. The overall log ORs of other comparisons are obtained under the evidence consistency assumption, i.e., *d*_{hk} = *d*_{1k} − *d*_{1h} [2]. This article makes the consistency assumption, while it should be routinely checked in NMAs [40,41,42].

### Frequentist and Bayesian P-scores

We tentatively assume that the outcome is beneficial (e.g., successful smoking cessation); if the outcome is harmful (e.g., mortality), we may simply invert the direction of treatment comparisons in the following materials. Under the frequentist setting, the P-score is built on the quantities

$${P}_{kh}=\Phi \left(\left({\hat{d}}_{1k}-{\hat{d}}_{1h}\right)/{s}_{kh}\right),$$

(1)

where \({\hat{d}}_{1k}\) and \({\hat{d}}_{1h}\) are the point estimates of treatment effects for *k* vs. 1 and *h* vs. 1, respectively, and *s*_{kh} is the standard error of \({\hat{d}}_{1k}-{\hat{d}}_{1h}\). Moreover, Φ(·) is the cumulative distribution function of the standard normal distribution. The quantity *P*_{kh} can be interpreted as the extent of certainty that treatment *k* is better than *h* [14]. The *frequentist P-score* of treatment *k* is calculated as \(\frac{1}{K-1}{\sum}_{h\ne k}{P}_{kh}\).

Analogously to the frequentist P-score, conditional on *d*_{1h} and *d*_{1k}, the quantities *P*_{kh} from the Bayesian perspective can be considered as *I*(*d*_{1k} > *d*_{1h}), which are Bernoulli random variables. To quantify the overall performance of treatment *k*, we may similarly use

$${\overline{P}}_k=\frac{1}{K-1}{\sum}_{h\ne k}I\left({d}_{1k}>{d}_{1h}\right).$$

(2)

Note that \({\overline{P}}_k\) is a parameter under the Bayesian framework, while the frequentist P-score is a statistic. Moreover, ∑_{h ≠ k}*I*(*d*_{1k} > *d*_{1h}) is equivalent to *K* − *R*_{k}, where *R*_{k} is the true rank of treatment *k*. Thus, we may also write \({\overline{P}}_k=\left(K-{R}_k\right)/\left(K-1\right)\); this corresponds to the findings by Rücker and Schwarzer [14]. Consequently, we call \({\overline{P}}_k\) the *scaled rank in the NMA* for treatment *k*. It transforms the range of the original rank between 1 and *K* to a range between 0 and 1. In addition, note that *E*[*I*(*d*_{1k} > *d*_{1h}| Data)] = Pr(*d*_{1k} > *d*_{1h}| Data), which is analogous to the quantity in Eq. (1) under the frequentist framework. Therefore, we use the posterior mean of the scaled rank \({\overline{P}}_k\) as the *Bayesian P-score*; it is a counterpart of the frequentist P-score.

The scaled ranks \({\overline{P}}_k\) can be feasibly estimated via the MCMC algorithm. Let \({\left\{{d}_{1k}^{(j)};k=2,\dots, K\right\}}_{j=1}^J\) be the posterior samples of the overall relative effects *d*_{1k} of all treatments vs. the reference treatment 1 in a total of *J* MCMC iterations after the burn-in period, where *j* indexes the iterations. As *d*_{11} is trivially 0, we set \({d}_{11}^{(j)}\) to 0 for all *j*. The *j*th posterior sample of treatment *k*’s scaled rank is \({\overline{P}}_k^{(j)}=\frac{1}{K-1}{\sum}_{h\ne k}I\left({d}_{1k}^{(j)}>{d}_{1h}^{(j)}\right)\). We can make inferences for the scaled ranks from the posterior samples \({\left\{{\overline{P}}_k^{(j)}\right\}}_{j=1}^J\), and use their posterior means as the Bayesian P-scores. We may also obtain the posterior medians as another set of point estimates, and the 2.5 and 97.5% posterior quantiles as the lower and upper bounds of 95% credible intervals (CrIs), respectively. Because the posterior samples of the scaled ranks take discrete values, the posterior medians and the CrI bounds are also discrete.

### Predictive P-score

Based on the idea of the Bayesian P-score, we can similarly define the predictive P-score for a future study by accounting for the heterogeneity between the existing studies in the NMA and the new study. Specifically, we consider the probabilities in the new study

$${P}_{\mathrm{new}, \it{kh}}=\Pr \left({\delta}_{\mathrm{new},1\it {k}}>{\delta}_{\mathrm{new},1\it {h}}\right),$$

(3)

conditional on the population parameters *d*_{1h}, *d*_{1k}, and *τ* from the NMA. Here, *δ*_{new, 1k} and *δ*_{new, 1h} represent the treatment effects of *k* vs. 1 and *h* vs. 1 in the new study, respectively. The *P*_{new, kh} corresponds to the quantity *P*_{kh} in the NMA; it represents the probability of treatment *k* being better than *h* in the new study. Due to heterogeneity, *δ*_{new, 1k} ∼ *N*(*d*_{1k}, *τ*^{2}) and *δ*_{new, 1h} ∼ *N*(*d*_{1h}, *τ*^{2}). Recall that the correlation coefficients between treatment comparisons are assumed to be 0.5; therefore, such probabilities in the new study can be explicitly calculated as *P*_{new, kh} = Φ((*d*_{1k} − *d*_{1h})/*τ*), which is a function of *d*_{1h}, *d*_{1k}, and *τ*. Finally, we use

$${\overline{P}}_{\mathrm{new},\it{k}}=\frac{1}{K-1}{\sum}_{h\ne k}{P}_{\mathrm{new}, \it{kh}}$$

(4)

to quantify the performance of treatment *k* in the new study. The posterior samples of \({\overline{P}}_{\mathrm{new},k}\) can be derived from the posterior samples of *d*_{1k}, *d*_{1h}, and *τ* during the MCMC algorithm.

Note that the probabilities in Eq. (3) can be written as *E*[*I*(*δ*_{new, 1k} > *δ*_{new, 1h})]. Based on similar observations for the scaled ranks in the NMA, the \({\overline{P}}_{\mathrm{new},k}\) in the new study subsequently becomes

$${\overline{P}}_{\mathrm{new},k}=\frac{1}{K-1}E\left[{\sum}_{h\ne k}I\left({\delta}_{\mathrm{new},1k}>{\delta}_{\mathrm{new},1h}\right)\right]=E\left[\frac{K-{R}_{\mathrm{new},k}}{K-1}\right],$$

where *R*_{new, k} is the true rank of treatment *k* in the new study. Thus, we call \({\overline{P}}_{\mathrm{new},\it{k}}\) *the expected scaled rank in the new study*. Like the Bayesian P-score, we define the *predictive P-score* as the posterior mean of \({\overline{P}}_{\mathrm{new},\it{k}}\). The posterior medians and 95% CrIs can also be obtained using the MCMC samples of \({\overline{P}}_{\mathrm{new},\it {k}}\).

Of note, the predictive P-scores considered in this article are all derived under the Bayesian framework. Strictly speaking, they may be called Bayesian predictive P-scores, as contrasted with Bayesian P-scores. Nevertheless, for convenience, we call them predictive P-scores in short.

When *τ* decreases toward 0 (i.e., the fixed-effects setting where all studies share common treatment effects), *P*_{new, kh} converges to 0 if *d*_{1k} < *d*_{1h} or 1 if *d*_{1k} > *d*_{1h}; that is, *P*_{new, kh} becomes *I*(*d*_{1k} > *d*_{1h}). Therefore, the expected scaled rank in the new study \({\overline{P}}_{\mathrm{new},k}\) converges to the scaled rank in the NMA \({\overline{P}}_k\), and thus the predictive P-score converges to the Bayesian P-score. Conversely, when *τ* increases toward infinity, *P*_{new, kh} converges to 0.5 for all comparisons, so \({\overline{P}}_{\mathrm{new},k}\) (and thus the predictive P-score) converges to 0.5 for each treatment, representing a middle rank. This is consistent with the intuition that the NMA does not provide much information for the new study in the presence of large heterogeneity, as the treatment rankings in the new study are dominated by between-study variabilities.

### Two examples

We give two examples of NMAs with binary outcomes to illustrate the different versions of the P-score. The first example is from Lu and Ades [40]; it was initially reported by Hasselblad [43] (without performing a formal NMA). It investigated the effects of four treatments on smoking cessation, including 1) no contact; 2) self-help; 3) individual counseling; and 4) group counseling. The outcome was the smoking cessation status of an individual after treatment. In the original NMA, the authors found that group counseling was most effective for smoking cessation, followed by individual counseling and self-help, and no contact was the least effective. The dataset contained a total of 16,737 subjects and 24 studies. A treatment was better if it yielded a higher rate of smoking cessation.

The second example was reported by Xu et al. [44]. It investigated the effects of seven immune checkpoint inhibitor (ICI) drugs on all-grade treatment-related adverse events (TrAEs), and aimed to provide a safety ranking of the ICI drugs for the treatment of cancer. The NMA was limited to phase II/III randomized controlled trials that compared two or three of the following treatments: 1) conventional therapy; 2) nivolumab; 3) pembrolizumab; 4) two ICIs; 5) ICI and conventional therapy; 6) atezolizumab; and 7) ipilimumab. The primary outcome was whether the patient had a TrAE. The authors found that there were clinically significant differences in safety between ICI drugs for patients with cancer. In general, atezolizumab was the safest drug, defined by the total number of severe or life-threatening adverse events, followed by nivolumab, pembrolizumab, ipilimumab, and tremelimumab; taking one ICI was found to be safer than taking two ICIs. The dataset contained a total of 126,621 subjects from 23 studies. Unlike the direction in the first example, a treatment was better if it yields a lower rate of TrAEs.

In the following analyses, we will use the numerical labels above to refer to treatments. Appendix A in Additional file 1 gives the complete datasets of the two examples.

### Implementations

The Bayesian NMAs were implemented via the MCMC algorithm using JAGS (version 4.3.0) through the R (version 3.6.2) package “rjags” (version 4–10). We used the vague priors *U*(0, 5) for the heterogeneity standard deviation (SD). We obtained the posterior samples of the log ORs of all treatment comparisons, which were then used to derive the posterior distributions of the Bayesian P-scores and predictive P-scores.

In addition to the vague priors, secondary analyses were performed for each NMA using informative priors [32, 36]. Specifically, based on the recommendations from Turner et al. [32], we used the log-normal priors *LN*(−2.01, 1.64^{2}) and *LN*(−2.13, 1.58^{2}) for the heterogeneity variances in the smoking cessation and all-grade TrAEs data, respectively.

For each NMA, we used three Markov chains; each chain contained a 20,000-run burn-in period for achieving stabilization and convergence. The final posterior samples consisted of a run of 50,000 updates after the burn-in period with thinning rate 2. We examined the stabilization and convergence of MCMC using trace plots and the Gelman–Rubin convergence statistics \(\hat{R}\) of log ORs and the heterogeneity SD [45]. The \(\hat{R}\) values close to 1 indicate adequate convergence.

We used the posterior samples to form the posterior distributions and calculate the posterior means (i.e., Bayesian and predictive P-scores), posterior medians, and 95% CrIs for all treatments’ scaled ranks in the NMA and expected scaled ranks in a new study. Additionally, we calculated the frequentist P-scores using the R package “netmeta” (version 1.2.0). The code for all analyses is in Appendix B in Additional file 1.