 Research
 Open Access
 Published:
Predictive Pscore for treatment ranking in Bayesian network metaanalysis
BMC Medical Research Methodology volume 21, Article number: 213 (2021)
Abstract
Background
Network metaanalysis (NMA) is a widely used tool to compare multiple treatments by synthesizing different sources of evidence. Measures such as the surface under the cumulative ranking curve (SUCRA) and the Pscore are increasingly used to quantify treatment ranking. They provide summary scores of treatments among the existing studies in an NMA. Clinicians are frequently interested in applying such evidence from the NMA to decisionmaking in the future. This prediction process needs to account for the heterogeneity between the existing studies in the NMA and a future study.
Methods
This article introduces the predictive Pscore for informing treatment ranking in a future study via Bayesian models. Two NMAs were used to illustrate the proposed measure; the first assessed 4 treatment strategies for smoking cessation, and the second assessed treatments for allgrade treatmentrelated adverse events. For all treatments in both NMAs, we obtained their conventional frequentist Pscores, Bayesian Pscores, and predictive Pscores.
Results
In the two examples, the Bayesian Pscores were nearly identical to the corresponding frequentist Pscores for most treatments, while noticeable differences existed for some treatments, likely owing to the different assumptions made by the frequentist and Bayesian NMA models. Compared with the Pscores, the predictive Pscores generally had a trend to converge toward a common value of 0.5 due to the heterogeneity. The predictive Pscores’ numerical estimates and the associated plots of posterior distributions provided an intuitive way for clinicians to appraise treatments for new patients in a future study.
Conclusions
The proposed approach adapts the existing frequentist Pscore to the Bayesian framework. The predictive Pscore can help inform medical decisionmaking in future studies.
Background
Network metaanalysis (NMA) of randomized controlled trials is a statistical method widely used to draw conclusions about multiple treatment comparisons in evidencebased medicine [1,2,3,4,5,6,7]. It simultaneously synthesizes both direct and indirect evidence, where the direct evidence comes from headtohead trials and the indirect evidence comes from indirect comparisons with common treatment comparators. For example, the comparison between two active drugs A and B can be informed from the indirect comparisons of A vs. C and B vs. C, where C may be a placebo, or from the direct comparison in clinical trials comparing A vs. B. The synthesis of both direct and indirect evidence can provide more precise estimates of treatment effects (i.e., estimates with smaller variances and thus narrower confidence or credible intervals) than conventional metaanalyses that compare pairs of treatments separately [8,9,10,11].
One of the main purposes of NMAs is finding a succinct way to present summarized results among many competing treatment options and inform future clinical trial designs. Rank probability (i.e., the probability of a treatment having a certain rank r) and cumulative rank probability (i.e., the probability of a treatment being at least the r th best) are commonly used measures for treatment rankings in an NMA. A limitation to these ranking methods is that they do not yield a simple single number to summarize the rank of each treatment [12]. When many treatments are available for a certain disease, these may be hard to interpret and may not be useful measures to clinicians [13]. Some researchers might rely only on the probability of being the best treatment for decisionmaking, which could draw misleading conclusions. It may be more intuitive to summarize the multiple probabilities into a single number for each treatment to assess the overall performance of the treatments.
One such summarization method is the surface under the cumulative ranking curve (SUCRA) proposed by Salanti et al. [12]. The SUCRA is calculated by averaging cumulative rank probabilities; it essentially transforms the mean rank of a treatment to a value between 0 and 1 [14]. The SUCRA is advantageous over the mean rank because its common range from 0 to 1 facilitates consistent interpretations across different NMAs, while the mean rank depends on the number of treatments in a network. The higher value of the SUCRA indicates a better treatment; SUCRA = 0 or 1 indicates an always worst or best treatment, respectively.
A relevant concept is the Pscore, which was originally proposed under the frequentist framework by Rücker and Schwarzer [14]. The Pscore of a treatment is obtained by estimating the effect sizes of pairwise treatment comparisons and assuming their point estimates are normally distributed. Like the SUCRA, the Pscore ranges from 0 to 1, with 0 or 1 being the theoretically worst or best treatment, respectively. Although it is defined differently from the SUCRA, the Pscore has been shown to be identical to the SUCRA under the frequentist framework [14].
These approaches to treatment rankings are based on the existing studies in an NMA; they may not be directly used to inform treatment rankings in future studies due to potential heterogeneity. Heterogeneity is a critical factor in metaanalyses. It is usually quantified by the I^{2} statistic, which is interpreted as the percentage of variability due to betweenstudy heterogeneity rather than withinstudy sampling error [15]. Nevertheless, I^{2} has several limitations [16]. For example, researchers commonly report I^{2} without an interval estimate quantifying its uncertainty and may wrongly use it as an absolute measure [17]. Motivated by these limitations, the prediction interval is recommended for use in metaanalyses (including NMAs) [17,18,19,20,21]. While the conventional confidence or credible interval informs a treatment effect’s uncertainty based on the studies in a current NMA, the prediction interval gives a range of the true treatment effect in future studies. The prediction interval may be more straightforward than heterogeneity measures such as τ^{2} and I^{2} in appraising the impact of heterogeneity. It helps clinicians understand the full uncertainties in treatment effects in future studies.
Similar approaches are needed to predict treatment rankings in a future study setting [22,23,24]. Such attempts have been made under the frequentist framework and can be implemented using Stata commands [13, 25]. The predictive treatment ranking measures account for the heterogeneity between the future study and the studies in the existing NMA; they may provide important information for future clinical trial designs in the presence of many competing treatment options. For example, for ethical considerations, clinicians might want to compare a new drug with existing treatments that have relatively high predictive measures.
In the current literature, many NMAs are performed using Bayesian approaches alongside frequentist ones [26,27,28]. Bayesian approaches offer additional flexibility compared to frequentist approaches, e.g., by specifying informative priors and sophisticated variancecovariance structures within multiarm studies [29,30,31,32,33,34,35,36]. This article extends the Pscore to a future study setting under the Bayesian framework, with the focus on NMAs with binary outcomes.
Methods
Bayesian model for network metaanalysis
We assume that an NMA contains N studies; each study compares a subset of a total of K treatments. The treatments compared in study i are denoted by the set \({\mathcal{T}}_i\) (i = 1, …, N). Let y_{ik} be the outcome measure in study i’s treatment group k (k ∈ T_{i}) and \({\mathcal{D}}_{ik}\) be the known, observed data in this study’s treatment group.
This article focuses on the case of binary outcomes, so y_{ik} is the event count, which is assumed to follow the binomial distribution, and \({\mathcal{D}}_{ik}\) represents the sample size n_{ik}. Without loss of generality, the following materials can be generalized to other types of outcomes. In addition, let b_{i} be the baseline treatment for study i; this can be any treatment in \({\mathcal{T}}_i\), and can differ across studies. For simplicity, we denote it by b when it does not lead to confusion. The Bayesian hierarchical model for the NMA is [2, 37]:
(likelihood)
(link function)
(random effects)
(multiarm studies)
(priors)
We use the canonical logit link function for binomial data, i.e., g(t) = log[t/(1 − t)], which transforms the underlying true event rate p_{ik} to a linear form. The indicator function I (·) returns 0 if k = b and 1 if k ≠ b. Consequently, δ_{ibk} represents the underlying true log odds ratio (OR) of treatment k vs. b in study i. Moreover, μ_{i} represents the baseline effect of study i. Because baseline effects could differ greatly across studies, μ_{i} is commonly modeled as a nuisance parameter. To account for potential heterogeneity, δ_{ibk} is modeled as a random effect; \({\tau}_{bk}^2\) represents the betweenstudy variance for the comparison k vs. b. The betweenstudy variances are typically assumed equal for all comparisons (i.e., \({\tau}_{bk}^2={\tau}^2\)) [1, 2]. This assumption greatly reduces the model complexity. In cases that these variances dramatically differ, one may alter to use more general model specifications [37]. The overall log ORs are the parameters d_{bk} and are of primary interest in NMAs. Within multiarm studies (if any), γ_{bhk} denotes the correlation coefficient between comparisons k vs. b and h vs. b. It is commonly set to 0.5, which is a result of assuming equal betweenstudy variances [2, 38].
The posterior distributions of the parameters of interest can be obtained via the Markov chain Monte Carlo (MCMC) algorithm. Our analyses use the vague priors N(0, 100^{2}) for studyspecific baseline effects μ_{i} and the log ORs of all treatments vs. the reference treatment, say treatment 1 (i.e., d_{1k}). The reference treatment may be different from studyspecific baseline treatments; it is usually a standard control (e.g., placebo) [39]. The overall log ORs of other comparisons are obtained under the evidence consistency assumption, i.e., d_{hk} = d_{1k} − d_{1h} [2]. This article makes the consistency assumption, while it should be routinely checked in NMAs [40,41,42].
Frequentist and Bayesian Pscores
We tentatively assume that the outcome is beneficial (e.g., successful smoking cessation); if the outcome is harmful (e.g., mortality), we may simply invert the direction of treatment comparisons in the following materials. Under the frequentist setting, the Pscore is built on the quantities
where \({\hat{d}}_{1k}\) and \({\hat{d}}_{1h}\) are the point estimates of treatment effects for k vs. 1 and h vs. 1, respectively, and s_{kh} is the standard error of \({\hat{d}}_{1k}{\hat{d}}_{1h}\). Moreover, Φ(·) is the cumulative distribution function of the standard normal distribution. The quantity P_{kh} can be interpreted as the extent of certainty that treatment k is better than h [14]. The frequentist Pscore of treatment k is calculated as \(\frac{1}{K1}{\sum}_{h\ne k}{P}_{kh}\).
Analogously to the frequentist Pscore, conditional on d_{1h} and d_{1k}, the quantities P_{kh} from the Bayesian perspective can be considered as I(d_{1k} > d_{1h}), which are Bernoulli random variables. To quantify the overall performance of treatment k, we may similarly use
Note that \({\overline{P}}_k\) is a parameter under the Bayesian framework, while the frequentist Pscore is a statistic. Moreover, ∑_{h ≠ k}I(d_{1k} > d_{1h}) is equivalent to K − R_{k}, where R_{k} is the true rank of treatment k. Thus, we may also write \({\overline{P}}_k=\left(K{R}_k\right)/\left(K1\right)\); this corresponds to the findings by Rücker and Schwarzer [14]. Consequently, we call \({\overline{P}}_k\) the scaled rank in the NMA for treatment k. It transforms the range of the original rank between 1 and K to a range between 0 and 1. In addition, note that E[I(d_{1k} > d_{1h} Data)] = Pr(d_{1k} > d_{1h} Data), which is analogous to the quantity in Eq. (1) under the frequentist framework. Therefore, we use the posterior mean of the scaled rank \({\overline{P}}_k\) as the Bayesian Pscore; it is a counterpart of the frequentist Pscore.
The scaled ranks \({\overline{P}}_k\) can be feasibly estimated via the MCMC algorithm. Let \({\left\{{d}_{1k}^{(j)};k=2,\dots, K\right\}}_{j=1}^J\) be the posterior samples of the overall relative effects d_{1k} of all treatments vs. the reference treatment 1 in a total of J MCMC iterations after the burnin period, where j indexes the iterations. As d_{11} is trivially 0, we set \({d}_{11}^{(j)}\) to 0 for all j. The jth posterior sample of treatment k’s scaled rank is \({\overline{P}}_k^{(j)}=\frac{1}{K1}{\sum}_{h\ne k}I\left({d}_{1k}^{(j)}>{d}_{1h}^{(j)}\right)\). We can make inferences for the scaled ranks from the posterior samples \({\left\{{\overline{P}}_k^{(j)}\right\}}_{j=1}^J\), and use their posterior means as the Bayesian Pscores. We may also obtain the posterior medians as another set of point estimates, and the 2.5 and 97.5% posterior quantiles as the lower and upper bounds of 95% credible intervals (CrIs), respectively. Because the posterior samples of the scaled ranks take discrete values, the posterior medians and the CrI bounds are also discrete.
Predictive Pscore
Based on the idea of the Bayesian Pscore, we can similarly define the predictive Pscore for a future study by accounting for the heterogeneity between the existing studies in the NMA and the new study. Specifically, we consider the probabilities in the new study
conditional on the population parameters d_{1h}, d_{1k}, and τ from the NMA. Here, δ_{new, 1k} and δ_{new, 1h} represent the treatment effects of k vs. 1 and h vs. 1 in the new study, respectively. The P_{new, kh} corresponds to the quantity P_{kh} in the NMA; it represents the probability of treatment k being better than h in the new study. Due to heterogeneity, δ_{new, 1k} ∼ N(d_{1k}, τ^{2}) and δ_{new, 1h} ∼ N(d_{1h}, τ^{2}). Recall that the correlation coefficients between treatment comparisons are assumed to be 0.5; therefore, such probabilities in the new study can be explicitly calculated as P_{new, kh} = Φ((d_{1k} − d_{1h})/τ), which is a function of d_{1h}, d_{1k}, and τ. Finally, we use
to quantify the performance of treatment k in the new study. The posterior samples of \({\overline{P}}_{\mathrm{new},k}\) can be derived from the posterior samples of d_{1k}, d_{1h}, and τ during the MCMC algorithm.
Note that the probabilities in Eq. (3) can be written as E[I(δ_{new, 1k} > δ_{new, 1h})]. Based on similar observations for the scaled ranks in the NMA, the \({\overline{P}}_{\mathrm{new},k}\) in the new study subsequently becomes
where R_{new, k} is the true rank of treatment k in the new study. Thus, we call \({\overline{P}}_{\mathrm{new},\it{k}}\) the expected scaled rank in the new study. Like the Bayesian Pscore, we define the predictive Pscore as the posterior mean of \({\overline{P}}_{\mathrm{new},\it{k}}\). The posterior medians and 95% CrIs can also be obtained using the MCMC samples of \({\overline{P}}_{\mathrm{new},\it {k}}\).
Of note, the predictive Pscores considered in this article are all derived under the Bayesian framework. Strictly speaking, they may be called Bayesian predictive Pscores, as contrasted with Bayesian Pscores. Nevertheless, for convenience, we call them predictive Pscores in short.
When τ decreases toward 0 (i.e., the fixedeffects setting where all studies share common treatment effects), P_{new, kh} converges to 0 if d_{1k} < d_{1h} or 1 if d_{1k} > d_{1h}; that is, P_{new, kh} becomes I(d_{1k} > d_{1h}). Therefore, the expected scaled rank in the new study \({\overline{P}}_{\mathrm{new},k}\) converges to the scaled rank in the NMA \({\overline{P}}_k\), and thus the predictive Pscore converges to the Bayesian Pscore. Conversely, when τ increases toward infinity, P_{new, kh} converges to 0.5 for all comparisons, so \({\overline{P}}_{\mathrm{new},k}\) (and thus the predictive Pscore) converges to 0.5 for each treatment, representing a middle rank. This is consistent with the intuition that the NMA does not provide much information for the new study in the presence of large heterogeneity, as the treatment rankings in the new study are dominated by betweenstudy variabilities.
Two examples
We give two examples of NMAs with binary outcomes to illustrate the different versions of the Pscore. The first example is from Lu and Ades [40]; it was initially reported by Hasselblad [43] (without performing a formal NMA). It investigated the effects of four treatments on smoking cessation, including 1) no contact; 2) selfhelp; 3) individual counseling; and 4) group counseling. The outcome was the smoking cessation status of an individual after treatment. In the original NMA, the authors found that group counseling was most effective for smoking cessation, followed by individual counseling and selfhelp, and no contact was the least effective. The dataset contained a total of 16,737 subjects and 24 studies. A treatment was better if it yielded a higher rate of smoking cessation.
The second example was reported by Xu et al. [44]. It investigated the effects of seven immune checkpoint inhibitor (ICI) drugs on allgrade treatmentrelated adverse events (TrAEs), and aimed to provide a safety ranking of the ICI drugs for the treatment of cancer. The NMA was limited to phase II/III randomized controlled trials that compared two or three of the following treatments: 1) conventional therapy; 2) nivolumab; 3) pembrolizumab; 4) two ICIs; 5) ICI and conventional therapy; 6) atezolizumab; and 7) ipilimumab. The primary outcome was whether the patient had a TrAE. The authors found that there were clinically significant differences in safety between ICI drugs for patients with cancer. In general, atezolizumab was the safest drug, defined by the total number of severe or lifethreatening adverse events, followed by nivolumab, pembrolizumab, ipilimumab, and tremelimumab; taking one ICI was found to be safer than taking two ICIs. The dataset contained a total of 126,621 subjects from 23 studies. Unlike the direction in the first example, a treatment was better if it yields a lower rate of TrAEs.
In the following analyses, we will use the numerical labels above to refer to treatments. Appendix A in Additional file 1 gives the complete datasets of the two examples.
Implementations
The Bayesian NMAs were implemented via the MCMC algorithm using JAGS (version 4.3.0) through the R (version 3.6.2) package “rjags” (version 4–10). We used the vague priors U(0, 5) for the heterogeneity standard deviation (SD). We obtained the posterior samples of the log ORs of all treatment comparisons, which were then used to derive the posterior distributions of the Bayesian Pscores and predictive Pscores.
In addition to the vague priors, secondary analyses were performed for each NMA using informative priors [32, 36]. Specifically, based on the recommendations from Turner et al. [32], we used the lognormal priors LN(−2.01, 1.64^{2}) and LN(−2.13, 1.58^{2}) for the heterogeneity variances in the smoking cessation and allgrade TrAEs data, respectively.
For each NMA, we used three Markov chains; each chain contained a 20,000run burnin period for achieving stabilization and convergence. The final posterior samples consisted of a run of 50,000 updates after the burnin period with thinning rate 2. We examined the stabilization and convergence of MCMC using trace plots and the Gelman–Rubin convergence statistics \(\hat{R}\) of log ORs and the heterogeneity SD [45]. The \(\hat{R}\) values close to 1 indicate adequate convergence.
We used the posterior samples to form the posterior distributions and calculate the posterior means (i.e., Bayesian and predictive Pscores), posterior medians, and 95% CrIs for all treatments’ scaled ranks in the NMA and expected scaled ranks in a new study. Additionally, we calculated the frequentist Pscores using the R package “netmeta” (version 1.2.0). The code for all analyses is in Appendix B in Additional file 1.
Results
Tables 1 and 2 present the treatment ranking measures in the examples of smoking cessation and allgrade TrAEs, respectively. Appendix C in Additional file 1 presents the trace plots. The MCMC iterations stabilized and converged well in both examples; all values of \(\hat{R}\) were approximately equal to 1. Appendix D in Additional file 1 presents the treatment ranking measures in the secondary analyses using the informative priors. In the two examples, the informative priors produced similar treatment ranking measures to the vague priors.
The posterior means (Bayesian Pscores) and posterior medians of scaled ranks in the NMAs differed noticeably for both examples. Because the posterior samples of scaled ranks were discrete, as suggested by Eq. (2), the posterior medians took discrete values, while the posterior means (Bayesian Pscores) took continuous values. Due to heterogeneity, the expected scaled ranks in a new study were based on probabilities that could continuously range from 0 to 1 as in Eq. (4); thus, both their posterior means (predictive Pscores) and posterior medians took continuous values.
Example of smoking cessation
In the example of smoking cessation, Table 1 shows that treatment 4 had the highest Bayesian Pscore and thus was likely the best treatment, followed by treatments 3 and 2. Treatment 1 was likely the worst because its Bayesian Pscore, 0.038, was closest to 0. The Bayesian Pscores and frequentist Pscores slightly differed; their differences were up to 0.041. Their orders of treatment rankings were identical.
The order of treatment rankings based on the predictive Pscores for the new study also remained consistent with that based on the Pscores for the NMA. Treatment 4 continued to have both the highest Pscore in the new study, followed by treatments 3 and 2, and treatment 1 had the lowest value. Compared with Pscores, the predictive Pscores of all four treatments tended to shrink toward 0.5 due to the heterogeneity. For example, the Bayesian Pscore of treatment 1 increased from 0.038 to the predictive Pscore of 0.192, while the Bayesian Pscore of treatment 4 decreased from 0.879 to the predictive Pscore of 0.746.
Example of treatmentrelated adverse event
In the example of allgrade TrAEs, Table 2 shows that treatment 6 had the highest Bayesian Pscore and was thus likely the best treatment. It was followed by treatments 3 and 2 with very similar Bayesian Pscores (0.780 and 0.764, respectively), treatments 7 and 1 also with similar Bayesian Pscores (0.415 and 0.362, respectively), then treatment 4 with a Bayesian Pscore of 0.164. Treatment 5 was likely the worst with a Bayesian Pscore of 0.092.
Some frequentist Pscores were noticeably different from their Bayesian counterparts. The Bayesian Pscore of treatment 2 was 0.764, and its frequentist Pscore was 0.821; such Pscores of treatment 3 were 0.780 and 0.677, accordingly. Based on the Bayesian Pscores, treatment 2 was worse than treatment 3, but their rankings were reversed based on the frequentist Pscores. These differences were likely because the Bayesian and frequentist Pscores were derived using different models. The Bayesian model accounted for full uncertainties by modeling event counts with binomial likelihoods, while the frequentist model approximated the log OR of each treatment comparison to the normal distribution within each study.
Compared with the Pscores, the predictive Pscores of most treatments did not change much, likely because the heterogeneity was relatively small in this NMA. As in the example of smoking cessation, the predictive Pscores had the trend of shrinking toward 0.5.
Visualizations
Figures 1 and 2 present the posterior distributions of all treatments’ expected scaled ranks in a new study in both examples, where each bar in the histograms covers a range of 0.01. They offer an intuitive tool to compare all treatments simultaneously.
The differences between the posterior means (predictive Pscores) and posterior medians of the expected scaled ranks were relevant to the symmetry of the corresponding posterior distributions. For all treatments in both examples, the posterior distributions were unimodal. The posterior distributions for all treatments in the example of smoking cessation were roughly symmetric, so their posterior medians (predictive Pscores) were close to their posterior means (Table 1). For the example of allgrade TrAEs, the posterior distributions of treatments 4, 5, and 6 were markedly asymmetric; the posterior means (predictive Pscores) of these three treatments were noticeably different from their posterior medians (Table 2). The distributions for other treatments were approximately symmetric, and their posterior means and posterior medians were nearly identical.
Discussion
Implications
While the order of treatment rankings in the two examples remained mostly the same, there were noticeable differences between the frequentist Pscores and Bayesian Pscores for some treatments, primarily owing to the different specifications of the frequentist and Bayesian NMA models. In addition, there were possibly discrepancies between posterior means (predictive Pscores) and posterior medians of the expected scaled ranks in a new study (as well as for scaled ranks in the NMA that lead to Bayesian Pscores). These discrepancies depended on the symmetry of the posterior distributions. Though the posterior medians are conventionally used in Bayesian (network) metaanalyses, we used the posterior means for the Bayesian and predictive Pscores because they correspond to the original definition of the Pscore under the frequentist framework [14]. In practice, both the posterior mean and posterior median may be reported for measuring treatment rankings.
Like the conventional frequentist Pscores, the predictive Pscores should be interpreted with caution, and their uncertainties ought to be taken into considerations [46, 47]. The magnitudes of the predictive Pscores do not imply statistically significant differences between treatments. Instead of being designed to test for treatment differences, they are treatmentspecific summary scores that facilitate clinical interpretations in comparative effectiveness research. Therefore, in addition to the magnitudes of treatment ranking measures, researchers should also pay attention to their uncertainties [46]. The uncertainties can be reflected by the measures’ confidence intervals or CrIs. A benefit of the predictive Pscores is that their CrIs’ limits take continuous values. The CrIs’ limits of the conventional Pscores must take discrete values, which may not accurately reflect the uncertainties. Recently, Wu et al. [48] proposed the normalized entropy to quantify the uncertainties of SUCRA. Similar ideas may be used to measure the predictive Pscore’s uncertainties.
Limitations
This study had several limitations. We have focused on NMAs with binary outcomes and used the OR as the effect measure. The predictive Pscore can be applied to generic NMAs by modifying the likelihoods of outcome measures in the Bayesian hierarchical model. Moreover, we have used the Bayesian NMA model that assumes evidence consistency and used a common heterogeneity variance τ^{2} for all treatment comparisons. In practice, these assumptions should be carefully examined before applying the predictive Pscore to clinical decisionmaking [37, 49, 50].
Future directions
This study used two examples to illustrate the predictive Pscores; however, it is unclear how the predictive Pscores might differ from the conventional Pscores in broader applications of NMAs. As a future topic, it is worthwhile to empirically investigate the magnitudes and directions of their changes via a comprehensive collection of NMAs.
Conclusions
This article has proposed the predictive Pscore by extending the Bayesian Pscore to the future study setting. The predictive Pscore accounts for heterogeneity between the new study and the existing studies in an NMA. It can be used to select optimal treatments from a potentially large pool of options for new patients in the future.
Availability of data and materials
The Supplementary Information includes the code and data for the analyses presented in this article.
Abbreviations
 CrI:

Credible interval
 IC:

Immune checkpoint inhibitor
 MCMC:

Markov chain Monte Carlo
 NMA:

Network metaanalysis
 OR:

Odds ratio
 SD:

Standard deviation
 SUCRA:

Surface under the cumulative ranking curve
 TrAE:

Treatmentrelated adverse event
References
 1.
Lumley T. Network metaanalysis for indirect treatment comparisons. Stat Med. 2002;21(16):2313–24. https://doi.org/10.1002/sim.1201.
 2.
Lu G, Ades AE. Combination of direct and indirect evidence in mixed treatment comparisons. Stat Med. 2004;23(20):3105–24. https://doi.org/10.1002/sim.1875.
 3.
Salanti G. Indirect and mixedtreatment comparison, network, or multipletreatments metaanalysis: many names, many benefits, many concerns for the next generation evidence synthesis tool. Res Synth Methods. 2012;3(2):80–97. https://doi.org/10.1002/jrsm.1037.
 4.
Cipriani A, Higgins JPT, Geddes JR, Salanti G. Conceptual and technical challenges in network metaanalysis. Ann Intern Med. 2013;159(2):130–7. https://doi.org/10.7326/00034819159220130716000008.
 5.
Faltinsen EG, Storebø OJ, Jakobsen JC, Boesen K, Lange T, Gluud C. Network metaanalysis: the highest level of medical evidence? BMJ Evid Based Med. 2018;23(2):56–9. https://doi.org/10.1136/bmjebm2017110887.
 6.
Zhang J, Carlin BP, Neaton JD, Soon GG, Nie L, Kane R, et al. Network metaanalysis of randomized clinical trials: reporting the proper summaries. Clin Trials. 2014;11(2):246–62. https://doi.org/10.1177/1740774513498322.
 7.
Liu Y, DeSantis SM, Chen Y. Bayesian mixed treatment comparisons metaanalysis for correlated outcomes subject to reporting bias. J R Stat Soc: Ser C: Appl Stat. 2018;67(1):127–44. https://doi.org/10.1111/rssc.12220.
 8.
Riley RD, Jackson D, Salanti G, Burke DL, Price M, Kirkham J, et al. Multivariate and network metaanalysis of multiple outcomes and multiple treatments: rationale, concepts, and examples. BMJ. 2017;358:j3932. https://doi.org/10.1136/bmj.j3932.
 9.
Jackson D, White IR, Price M, Copas J, Riley RD. Borrowing of strength and study weights in multivariate and network metaanalysis. Stat Methods Med Res. 2017;26(6):2853–68. https://doi.org/10.1177/0962280215611702.
 10.
Lin L, Xing A, Kofler MJ, Murad MH. Borrowing of strength from indirect evidence in 40 network metaanalyses. J Clin Epidemiol. 2019;106:41–9. https://doi.org/10.1016/j.jclinepi.2018.10.007.
 11.
Lin L. Quantifying and presenting overall evidence in network metaanalysis. Stat Med. 2018;37(28):4114–25. https://doi.org/10.1002/sim.7905.
 12.
Salanti G, Ades AE, Ioannidis JPA. Graphical methods and numerical summaries for presenting results from multipletreatment metaanalysis: an overview and tutorial. J Clin Epidemiol. 2011;64(2):163–71. https://doi.org/10.1016/j.jclinepi.2010.03.016.
 13.
Chaimani A, Higgins JPT, Mavridis D, Spyridonos P, Salanti G. Graphical tools for network metaanalysis in STATA. PLoS One. 2013;8(10):e76654. https://doi.org/10.1371/journal.pone.0076654.
 14.
Rücker G, Schwarzer G. Ranking treatments in frequentist network metaanalysis works without resampling methods. BMC Med Res Methodol. 2015;15:58. https://doi.org/10.1186/s1287401500608.
 15.
Higgins JPT, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in metaanalyses. BMJ. 2003;327(7414):557–60. https://doi.org/10.1136/bmj.327.7414.557.
 16.
Rücker G, Schwarzer G, Carpenter JR, Schumacher M. Undue reliance on I^{2} in assessing heterogeneity may mislead. BMC Med Res Methodol. 2008;8:79. https://doi.org/10.1186/14712288879.
 17.
Borenstein M, Higgins JPT, Hedges LV, Rothstein HR. Basics of metaanalysis: I^{2} is not an absolute measure of heterogeneity. Res Synth Methods. 2017;8(1):5–18. https://doi.org/10.1002/jrsm.1230.
 18.
Higgins JPT, Thompson SG, Spiegelhalter DJ. A reevaluation of randomeffects metaanalysis. J Royal Stat Soc Ser A (Statistics in Society). 2009;172(1):137–59. https://doi.org/10.1111/j.1467985X.2008.00552.x.
 19.
Riley RD, Higgins JPT, Deeks JJ. Interpretation of random effects metaanalyses. BMJ. 2011;342:d549. https://doi.org/10.1136/bmj.d549.
 20.
IntHout J, Ioannidis JPA, Rovers MM, Goeman JJ. Plea for routinely presenting prediction intervals in metaanalysis. BMJ Open. 2016;6(7):e010247. https://doi.org/10.1136/bmjopen2015010247.
 21.
Lin L. Use of prediction intervals in network metaanalysis. JAMA Netw Open. 2019;2(8):e199735. https://doi.org/10.1001/jamanetworkopen.2019.9735.
 22.
DeSantis SM, Zhu H. A Bayesian mixedtreatment comparison metaanalysis of treatments for alcohol dependence and implications for planning future trials. Med Decis Mak. 2014;34(7):899–910. https://doi.org/10.1177/0272989x14537558.
 23.
Nikolakopoulou A, Mavridis D, Salanti G. Planning future studies based on the precision of network metaanalysis results. Stat Med. 2016;35(7):978–1000. https://doi.org/10.1002/sim.6608.
 24.
DeSantis SM, Hwang H. Sample size estimation for future studies using Bayesian multivariate network metaanalysis. Stat Interface. 2020;13(4):511–7. https://doi.org/10.4310/SII.2020.v13.n4.a8.
 25.
White IR. Network metaanalysis. Stata J. 2015;15(4):951–85. https://doi.org/10.1177/1536867x1501500403.
 26.
Nikolakopoulou A, Chaimani A, Veroniki AA, Vasiliadis HS, Schmid CH, Salanti G. Characteristics of networks of interventions: a description of a database of 186 published networks. PLoS One. 2014;9(1):e86754. https://doi.org/10.1371/journal.pone.0086754.
 27.
Petropoulou M, Nikolakopoulou A, Veroniki AA, Rios P, Vafaei A, Zarin W, et al. Bibliographic study showed improving statistical methodology of network metaanalyses published between 1999 and 2015. J Clin Epidemiol. 2017;82:20–8. https://doi.org/10.1016/j.jclinepi.2016.11.002.
 28.
Lee AW. Review of mixed treatment comparisons in published systematic reviews shows marked increase since 2009. J Clin Epidemiol. 2014;67(2):138–43. https://doi.org/10.1016/j.jclinepi.2013.07.014.
 29.
Hong H, Carlin BP, Shamliyan TA, Wyman JF, Ramakrishnan R, Sainfort F, et al. Comparing Bayesian and frequentist approaches for multiple outcome mixed treatment comparisons. Med Decis Mak. 2013;33(5):702–14. https://doi.org/10.1177/0272989x13481110.
 30.
Dias S, Sutton AJ, Ades AE, Welton NJ. Evidence synthesis for decision making 2: a generalized linear modeling framework for pairwise and network metaanalysis of randomized controlled trials. Med Decis Mak. 2013;33(5):607–17. https://doi.org/10.1177/0272989X12458724.
 31.
Efthimiou O, Debray TPA, van Valkenhoef G, Trelle S, Panayidou K, Moons KGM, et al. GetReal in network metaanalysis: a review of the methodology. Res Synth Methods. 2016;7(3):236–63. https://doi.org/10.1002/jrsm.1195.
 32.
Turner RM, Davey J, Clarke MJ, Thompson SG, Higgins JPT. Predicting the extent of heterogeneity in metaanalysis, using empirical data from the Cochrane database of systematic reviews. Int J Epidemiol. 2012;41(3):818–27. https://doi.org/10.1093/ije/dys041.
 33.
Wang Z, Lin L, Hodges JS, Chu H. The impact of covariance priors on armbased Bayesian network metaanalyses with binary outcomes. Stat Med. 2020;39(22):2883–900. https://doi.org/10.1002/sim.8580.
 34.
Jansen JP, Fleurence R, Devine B, Itzler R, Barrett A, Hawkins N, et al. Interpreting indirect treatment comparisons and network metaanalysis for healthcare decision making: report of the ISPOR task force on indirect treatment comparisons good research practices: part 1. Value Health. 2011;14(4):417–28. https://doi.org/10.1016/j.jval.2011.04.002.
 35.
Hoaglin DC, Hawkins N, Jansen JP, Scott DA, Itzler R, Cappelleri JC, et al. Conducting indirecttreatmentcomparison and networkmetaanalysis studies: report of the ISPOR task force on indirect treatment comparisons good research practices: part 2. Value Health. 2011;14(4):429–37. https://doi.org/10.1016/j.jval.2011.01.011.
 36.
Rosenberger KJ, Xing A, Murad MH, Chu H, Lin L. Prior choices of betweenstudy heterogeneity in contemporary Bayesian network metaanalyses: an empirical study. J Gen Intern Med. 2021;36(4):1049–57. https://doi.org/10.1007/s11606020063571.
 37.
Lu G, Ades AE. Modeling betweentrial variance structure in mixed treatment comparisons. Biostatistics. 2009;10(4):792–805. https://doi.org/10.1093/biostatistics/kxp032.
 38.
Higgins JPT, Whitehead A. Borrowing strength from external trials in a metaanalysis. Stat Med. 1996;15(24):2733–49. https://doi.org/10.1002/(SICI)10970258(19961230)15:24%3C2733::AIDSIM562%3E3.0.CO;20.
 39.
Hong H, Chu H, Zhang J, Carlin BP. A Bayesian missing data framework for generalized multiple outcome mixed treatment comparisons. Res Synth Methods. 2016;7(1):6–22. https://doi.org/10.1002/jrsm.1153.
 40.
Lu G, Ades AE. Assessing evidence inconsistency in mixed treatment comparisons. J Am Stat Assoc. 2006;101(474):447–59. https://doi.org/10.1198/016214505000001302.
 41.
Dias S, Welton NJ, Sutton AJ, Caldwell DM, Lu G, Ades AE. Evidence synthesis for decision making 4: inconsistency in networks of evidence based on randomized controlled trials. Med Decis Mak. 2013;33(5):641–56. https://doi.org/10.1177/0272989X12455847.
 42.
Lin L. Evidence inconsistency degrees of freedom in Bayesian network metaanalysis. J Biopharm Stat. 2021;31(3):317–30. https://doi.org/10.1080/10543406.2020.1852247.
 43.
Hasselblad V. Metaanalysis of multitreatment studies. Med Decis Mak. 1998;18(1):37–43. https://doi.org/10.1177/0272989x9801800110.
 44.
Xu C, Chen YP, Du XJ, Liu JQ, Huang CL, Chen L, et al. Comparative safety of immune checkpoint inhibitors in cancer: systematic review and network metaanalysis. BMJ. 2018;363:k4226. https://doi.org/10.1136/bmj.k4226.
 45.
Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Stat Sci. 1992;7(4):457–72. https://doi.org/10.1214/ss/1177011136.
 46.
Trinquart L, Attiche N, Bafeta A, Porcher R, Ravaud P. Uncertainty in treatment rankings: reanalysis of network metaanalyses of randomized trials. Ann Intern Med. 2016;164(10):666–73. https://doi.org/10.7326/M152521.
 47.
Veroniki AA, Straus SE, Rücker G, Tricco AC. Is providing uncertainty intervals in treatment ranking helpful in a network metaanalysis? J Clin Epidemiol. 2018;100:122–9. https://doi.org/10.1016/j.jclinepi.2018.02.009.
 48.
Wu YC, Shih MC, Tu YK. Using normalized entropy to measure uncertainty of rankings for network metaanalyses. Med Decis Mak. 2021;41(6):706–13. https://doi.org/10.1177/0272989x21999023.
 49.
Higgins JPT, Jackson D, Barrett JK, Lu G, Ades AE, White IR. Consistency and inconsistency in network metaanalysis: concepts and models for multiarm studies. Res Synth Methods. 2012;3(2):98–110. https://doi.org/10.1002/jrsm.1044.
 50.
White IR, Barrett JK, Jackson D, Higgins JPT. Consistency and inconsistency in network metaanalysis: model estimation using multivariate metaregression. Res Synth Methods. 2012;3(2):111–25. https://doi.org/10.1002/jrsm.1045.
Acknowledgements
None.
Funding
This study was supported in part by the US National Institutes of Health/National Library of Medicine grant R01 LM012982 and National Institutes of Health/National Center for Advancing Translational Sciences grant UL1 TR001427.
Author information
Affiliations
Contributions
KJR: methodology, formal analysis, investigation, data curation, writing  original draft, writing  review & editing, visualization; RD: writing  review & editing; YC: writing  review & editing; LL: conceptualization, methodology, writing  original draft, writing  review & editing, supervision. All authors read and approved the final manuscript
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
All authors approved this article for publication.
Competing interests
None.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1: Appendix A
. Complete datasets. Appendix B. R code for data analyses. Appendix C. Trace plots. Appendix D. Secondary analyses.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Rosenberger, K.J., Duan, R., Chen, Y. et al. Predictive Pscore for treatment ranking in Bayesian network metaanalysis. BMC Med Res Methodol 21, 213 (2021). https://doi.org/10.1186/s12874021013975
Received:
Accepted:
Published:
Keywords
 Bayesian analysis
 Heterogeneity
 Network metaanalysis
 Pscore
 Prediction
 Treatment ranking