Suppose a network meta-analysis has been conducted using Bayesian methods. We first consider two treatments A and B. Let μ
A
and μ
B
be independent estimates representing the arm-based effects of treatments A and B, respectively, as estimated in the network meta-analysis. Let the effects be scaled thus that higher values represent better success. We are interested in the probability that A is more effective than B, that is we want to compute P(μ
A
>μ
B
).
Independent normally distributed posteriors
For simplicity, let us assume normal distributions for the posteriors, precisely let \(\mu _{A} \sim N(\hat {\mu }_{A}, {\sigma _{A}^{2}}), \mu _{B} \sim N(\hat {\mu }_{B}, {\sigma _{B}^{2}})\). Then the distribution of μ
A
−μ
B
is normal with expectation \(\hat {\mu }_{A} - \hat {\mu }_{B}\) and variance \({\sigma _{A}^{2}} + {\sigma _{B}^{2}}\) and we have
$${\fontsize{9.5}{6}\begin{aligned} P\left(\mu_{A} > \mu_{B}\right) &= P\left(\mu_{A} - \mu_{B} > 0\right)\\ &= 1 - \Phi\left(-\frac{\hat{\mu}_{A} - \hat{\mu}_{B}}{\sqrt{{\sigma_{A}^{2}} + {\sigma_{B}^{2}}}}\right) = \Phi\left(\frac{\hat{\mu}_{A} - \hat{\mu}_{B}}{\sqrt{{\sigma_{A}^{2}} + {\sigma_{B}^{2}}}}\right) \end{aligned}} $$
where Φ is the cumulative distribution function (cdf) of the standard normal distribution. It follows that P(μ
A
>μ
B
)>0.5 is equivalent to \(\Phi \left (\left (\hat {\mu }_{A} - \hat {\mu }_{B}\right)/\sqrt {{\sigma _{A}^{2}} + {\sigma _{B}^{2}}}\right) > 0.5\), which is true if and only if \(\hat {\mu }_{A} > \hat {\mu }_{B}\), independently of \({\sigma _{A}^{2}} + {\sigma _{B}^{2}}\). In other words, whether A or B is thought more effective (‘better’) depends only on the sign of the difference of the point estimates: the treatment with the greater point estimate wins, regardless of the variances.
Fictitious example
Figure 1 shows a fictitious example of two independent normal distributions with means 0.5 and 0 and variances 4 and 1 for treatments A and B, respectively. The theoretical 95 % credible interval of the broader distribution of treatment A (-3.42 to 4.42) completely covers that of the narrower distribution of treatment B (-1.96 to 1.96, dashed). It is natural to conclude that there is no evidence of a difference between the treatments in effectiveness, particularly due to the lack of precision in estimating the effect of A. Note that the densities are cutting each other at two different points: there are regions both to the right and to the left hand side where the density of the flat distribution (treatment A) is greater than that of the distribution of B. In these regions the flat distribution has more mass than the precise distribution, just because it is flat. That is, particularly there is a high probability that A creates unfavorable effects less than −2, that are unlikely to occur under treatment B. Nevertheless, the probability that A is better than B is computed as \(\Phi (0.5/\sqrt {5}) = 0.59\). Since this is greater than 0.5, A is thought better than B.
ROC curve
The probability P(μ
A
>μ
B
) can be interpreted as the area under the curve (AUC) for the receiver operating characteristic (ROC) curve defined by
$$R(t) = 1-F_{A}(F_{B}^{-1}(1-t))$$
where F
A
, F
B
are the cdfs of the posterior distributions of μ
A
and μ
B
(see Additional file 1 for details). In the diagnostic accuracy setting, the AUC provides the probability that, given a randomly selected pair of a diseased and a non-diseased individual, the values of the diseased and the non-diseased individual are in the correct order, e.g., the value of the diseased individual is greater, if higher values indicate illness.
For Bayesian posterior distributions, the AUC provides the probability that, given that treatment A is truly more effective than treatment B and we randomly select a pair of effect estimates for treatment A and treatment B, A proves better than B. Figure 2 shows the ROC curve and the AUC for the fictitious example. The large difference in variances is reflected by the asymmetric appearance of the curve. Moreover, the curve cuts the dotted line, which is due to the above-mentioned region to the left of Fig. 1 where we observe more unfavorable effects occurring under A. The AUC is 59 %. If this ROC curve would occur from the distribution of a potential diagnostic marker, nobody would trust a diagnostic test based on that marker.
We have seen for normal posterior distributions that the treatment with the more favorable point estimate will be ranked first, regardless of the difference that might be quite small, independently of the variances. If only looking at the ranks, we inevitably ignore the potential difference in precision and length of credible intervals between both posterior distributions.
Comparing more than two treatments
We now consider a network meta-analysis with n treatments and Bayesian posteriors μ
i
with means \(\hat {\mu }_{i} \ (i=1, \dots, n)\). We cannot assume that the μ
i
are independent, as they are all informed by the whole network. We have, however, still an estimate for each difference \(\hat {\mu }_{i} - \hat {\mu }_{j}\) with standard deviation σ
ij
. Again assuming normality for the posteriors, we see as above
$$\begin{array}{@{}rcl@{}} P(\mu_{i} > \mu_{j}) = \Phi\left(\frac{\hat{\mu}_{i} - \hat{\mu}_{j}}{\sigma_{ij}}\right) \end{array} $$
((1))
where Φ is the cdf of the standard normal distribution. It follows that the order induced to all treatments by pairwise comparing two treatments preserves the order of the means, independently of the variances. However, the variances enter the above equation and trigger the distance between the underlying probabilities P(μ
i
>μ
j
): the greater the variances compared to the difference, the more the argument in (1) tends to zero and the more P(μ
i
>μ
j
) tends to 0.5.
Surface under the cumulative ranking (SUCRA)
We here recapitulate the definition and interpretation of the SUCRA probabilities introduced by Salanti et al. [6]. First, based on the Bayesian posterior distributions, for each treatment i(i=1,…,n) the probability P(i,k) that treatment i has rank k(k=1,…,n) is computed. For each treatment i, these rank probabilities form a discrete distribution, as \(\sum _{k=1}^{n} P(i,k) = 1\). The cdfs for these distributions can be obtained by
$$F(i,r) = \sum\limits_{k=1}^{r} P(i,k) $$
(r=1,…,n). F(i,r) gives the probability that treatment i has rank r or better and we have F(i,n)=1 for all i. The surface under the cumulative ranking distribution function for treatment i is then defined by
$$\text{SUCRA}(i) = \frac{1}{n-1}\sum\limits_{r=1}^{n-1} F(i,r). $$
To give an interpretation of SUCRA(i), we remember that the expectation of a discrete non-negative random variable with values 1,…,n can be expressed by the area between the cdf F and 1. For the mean rank we have therefore
$$\begin{array}{@{}rcl@{}} \mathrm{E}(\text{rank}(i))& = &n - \sum\limits_{r=1}^{n-1} F(i,r) \\ & = &n - (n-1)\text{SUCRA}(i) \end{array} $$
whence we obtain
$$\text{SUCRA}(i) = \frac{n - \mathrm{E}(\text{rank}(i))}{n-1}. $$
It follows that SUCRA(i) is the inversely scaled average rank of treatment i, scaled such that it is 1 if E(rank(i))=1 (that is, i always ranks first) and 0 if E(rank(i))=n (that is, i always ranks last) [6, 19].
SUCRA(i) can also be interpreted as the average proportion of treatments worse than i.
The mean SUCRA value is 0.5.
A frequentist version of SUCRA: The P-score
We now look at equation (1) from a frequentist perspective. In the frequentist setting, instead of observing Bayesian posteriors with means and standard deviations, we suppose to have observed effect estimates, again written \(\hat {\mu }_{i}\), and standard errors for all pairwise differences \(\hat {\mu }_{i} - \hat {\mu }_{j}\), denoted s
ij
. Again assuming normality, the equation corresponding to (1) is
$$P_{ij} = \Phi\left(\frac{\hat{\mu}_{i} - \hat{\mu}_{j}}{s_{ij}}\right). $$
We give an interpretation for P
ij
. Apparently, \((\hat {\mu }_{i} - \hat {\mu }_{j})/s_{\textit {ij}}\) is the signed z-score of the contrast between treatments i and j, conditioned on the standard errors. The two-sided p-value of this comparison is given by
$$p_{ij} = 2\left(1 - \Phi\left(\frac{|\hat{\mu}_{i} - \hat{\mu}_{j}|}{s_{ij}}\right)\right). $$
It represents the probability that an absolute difference of the observed size or larger occurs, given the null-hypothesis of no difference is true. Hence we have
$$P_{ij} = \left\{ \begin{array}{cl} p_{ij}/2, & \text{if}\,\hat{\mu}_{i} \le \hat{\mu}_{j} \\ 1 - p_{ij}/2, & \text{if}\, \hat{\mu}_{i} > \hat{\mu}_{j} \end{array} \right. $$
Thus, P
ij
is one minus the one-sided p-value of rejecting the null hypothesis μ
i
≤μ
j
in favor of μ
i
>μ
j
. P
ij
is at least 0.5 if we observe \(\hat {\mu }_{i} \ge \hat {\mu }_{j}\), making it likely that μ
i
>μ
j
. P
ij
is less than 0.5 if we observe \(\hat {\mu }_{i} < \hat {\mu }_{j}\), which makes it less likely that μ
i
>μ
j
.
We note that, as often, it seems more natural to interpret P(μ
i
>μ
j
) in the Bayesian setting than to explain the meaning of P
ij
in the frequentist context. Nevertheless, they both result in the same decision rule: the greater P
ij
, the more certain we are that μ
i
>μ
j
, and vice versa. Further we note that we do not claim or need independence of the differences \(\hat {\mu }_{i} - \hat {\mu }_{j}\).
We may consider the means
$$\bar{P}_{i} = \frac{1}{n-1}\sum\limits_{j, \ j \ne i}^{n} P_{ij}. $$
As P
ij
is interpreted as the extent of certainty that μ
i
>μ
j
holds, we may interpret \(\bar {P}_{i}\) as the mean extent of certainty that μ
i
is greater than any other μ
j
, averaged over all competing treatments j (j≠i) with equal weights. In other words, \(\bar {P}_{i}\) represents the rank of treatment i within the given range of treatments, where 1 means theoretically best and 0 means worst. This corresponds to the interpretation of SUCRA(i). We will call \(\bar P_{i}\) the P-score of treatment i. P-scores can be seen as the frequentist equivalent of SUCRA values.
From the definition of P
ij
it follows that P
ji
=1−P
ij
. Thus the sum over all off-diagonal elements of the matrix (P
ij
) is n(n−1)/2. For the mean of the \(\bar P_{i}\) we obtain
$$\frac{1}{n}{\sum\limits_{i}^{n}} \bar{P}_{i} = \frac{1}{n(n-1)}{\sum\limits_{i}^{n}}\sum\limits_{j, \ j \ne i}^{n} P_{ij} = 0.5 $$
which is the same as the mean of all SUCRA values. In Additional file 2 we give a formal proof that P-scores and SUCRA values are identical if the true probabilities are known.