Skip to main content

Answering complex hierarchy questions in network meta-analysis



Network meta-analysis estimates all relative effects between competing treatments and can produce a treatment hierarchy from the most to the least desirable option according to a health outcome. While about half of the published network meta-analyses present such a hierarchy, it is rarely the case that it is related to a clinically relevant decision question.


We first define treatment hierarchy and treatment ranking in a network meta-analysis and suggest a simulation method to estimate the probability of each possible hierarchy to occur. We then propose a stepwise approach to express clinically relevant decision questions as hierarchy questions and quantify the uncertainty of the criteria that constitute them. The steps of the approach are summarized as follows: a) a question of clinical relevance is defined, b) the hierarchies that satisfy the defined question are collected and c) the frequencies of the respective hierarchies are added; the resulted sum expresses the certainty of the defined set of criteria to hold. We then show how the frequencies of all possible hierarchies relate to common ranking metrics.


We exemplify the method and its implementation using two networks. The first is a network of four treatments for chronic obstructive pulmonary disease where the most probable hierarchy has a frequency of 28%. The second is a network of 18 antidepressants, among which Vortioxetine, Bupropion and Escitalopram occupy the first three ranks with frequency 19%.


The developed method offers a generalised approach of producing treatment hierarchies in network meta-analysis, which moves towards attaching treatment ranking to a clear decision question, relevant to all or a subset of competing treatments.

Peer Review reports


Providing a treatment hierarchy is often one of the objectives of systematic reviews that contain multiple interventions [1]. To this aim, several ranking metrics have been developed and are commonly used to accompany network meta-analysis (NMA) results [2]. A common output of NMA is a matrix showing the probability of each treatment being at each possible rank. The graphical display of such a matrix is called rankogram, the restriction of this matrix to the probabilities of occupying the highest rank constitutes the probability of being the best ranking metric, while the Surface Under the Cumulative RAnking curve (SUCRA) summarises the ranking distribution by calculating the area under the cumulative ranking curve [3]. Mean and median ranks are further options to present a treatment hierarchy, with the former being a linear transformation of SUCRA. The P-score measure bypasses the need to calculate probabilities of being at each rank by averaging over probabilities of each treatment being better than any other in the network [4]. P-scores are the frequentist analogue of SUCRAs given that the posterior distributions of relative effects are normal. P-scores have recently been adapted to the Bayesian framework and extended to the predictive P-scores for a future study setting [5].

Despite the plethora of ranking metrics and the popularity of deriving a treatment hierarchy in NMA, relying on such a treatment hierarchy is insufficient. Limitations of ranking treatments include the fact that small differences in outcome values could lead to different hierarchies even if these differences are not clinically relevant, the difficulty in interpreting the values of the ranking metrics and the lack of consideration of multiple outcomes [6, 7]. This has led to the development of several multi-dimensional approaches to treatment ranking including benefit-risk assessments [8,9,10], incorporation of clinically important values [3, 10, 11], consideration of multiple outcomes simultaneously [7, 10, 12] or consideration of a characteristic such as risk of bias when deriving a treatment hierarchy [13]. Moreover, Salanti et al. recently proposed that each ranking metric answers a different treatment hierarchy question [14], and thus differences in the produced hierarchies are to be expected [15]. Linking the ranking metrics with the respective hierarchy question they answer would greatly facilitate the interpretation of the derived hierarchy.

However, the hierarchy questions of interest are not limited to those that can be answered by the available ranking metrics. Often, more complex research questions are posed; in such a case, hierarchy should still depend on these questions, so that its interpretation is relevant and meaningful. In this paper, we suggest an approach for translating clinically relevant questions into hierarchy questions and quantify their uncertainty. To this aim, we use simulations to derive the relative frequency of all possible hierarchies in a network of interventions. Then, we define the set of all possible hierarchies that satisfy a specified criterion, for example that a specific order among treatments is retained in the network and/or a treatment is in a specific position, and the sum of their frequencies constitute the certainty around the criterion.


Definitions: treatment hierarchy and treatment ranking

Let the entire evidence base form a set \(\mathbbm{T}=\left\{{t}_1,{t}_2,\dots, {t}_T\right\}\) (ordered alphabetically) of T treatments. NMA aims to estimate the set of \(\left(\begin{array}{c}T\\ {}2\end{array}\right)\) relative treatment effects \({\mu}_{t_i{t}_j}\) where titj denotes the treatment contrast (i, j = 1, …, T; i < j) [16, 17]. The parameters \({\mu}_{t_i{t}_j}\) denote additive effects, e.g. mean differences or log-odds ratios, where \({\mu}_{t_i{t}_i}=0\). The model is parametrized using only T − 1 relative treatment effects versus a randomly selected reference treatment, here t1. The so called ‘basic parameters’ \({\mu}_{t_1{t}_i}\) are estimated and collected in a vector \(\hat{\boldsymbol{\mu}}\) and we denote their variance-covariance matrix as \(\hat{\boldsymbol{V}}\). The remaining \(\left(\begin{array}{c}T\\ {}2\end{array}\right)-T+1=\) \(\left(\begin{array}{c}T-1\\ {}2\end{array}\right)\) relative treatment effects are derived imposing the constraint of consistency \({\mu}_{t_i{t}_j}={\mu}_{t_k{t}_j}-{\mu}_{t_k{t}_i}\), k ≠ i, j, k = 1, …, T [18, 19].

The ‘true’ underlying treatment hierarchy for the set \(\mathbbm{T}\) is defined as the vector of treatment names, ordered from the most to the least effective. This hierarchy is imposed by the ascending ordering of the ‘true’ underlying relative treatment effects, \({\mu}_{t_1{t}_i}\), assuming a direction, e.g. that large positive values are associated with a beneficial effect for the first treatment. The ‘true’ underlying treatment ranking is defined as the vector of integers between 1 and T that indicates the rank of each treatment \({r}_{t_i}\). For example, if the treatment hierarchy vector is (A, C, D, B), then the treatment ranking vector is (1, 4, 2, 3). We denote the ‘true’ underlying treatment ranking as

$$\mathbf{R}:= \left({r}_{t_1},{r}_{t_2},\dots, {r}_{t_T}\right)$$

which has a 1:1 correspondence with the ‘true’ underlying treatment hierarchy

$$\mathbf{H}:= \mathbbm{T}\ ordered\ by\ \left({r}_{t_1},{r}_{t_2},\dots, {r}_{t_T}\right).$$

The estimated distribution of R is denoted as \(\hat{R_{\mathbbm{T}}}\) and is approximated from the estimated relative treatment effects \({\hat{\mu}}_{t_i{t}_j}\) as follows. First, we sample from the multivariate normal distribution with point estimate \(\hat{\boldsymbol{\mu}}\) as mean and variance-covariance matrix \(\hat{\mathbf{V}}\). In the case of a Bayesian analysis, we do not need to assume a normal approximation, but the whole posterior distribution of μ can be considered. Then, from the approximated normal distribution or the posterior, we can draw a μ vector and get corresponding ranks \(\left({r_{t_1}}^{\ast },{r_{t_2}}^{\ast },\dots, {r_{t_T}}^{\ast}\right)\). Repeating the process a large number of times, say n, will produce a matrix of dimension n × T, which is a sample from the distribution \(\hat{R_{\mathbbm{T}}}\). Then, by using the 1:1 correspondence with the treatment names, we can produce a sample from \(\hat{H_{\mathbbm{T}}}\), which is the estimated distribution of H. The larger the number of treatments included in a network, the greater n should be. A theoretical example of samples from \(\hat{R_{\mathbbm{T}}}\) and \(\hat{H_{\mathbbm{T}}}\) is presented in Table 1 panel a for a hypothetical network of three treatments, results of which are shown in Fig. 1.

Table 1 Sample from \(\hat{R_{\mathbbm{T}}\ }\) and \(\hat{H_{\mathbbm{T}}\ }\) (panel a), ranking matrix (panel b) and hierarchy matrix (panel c) of the hypothetical network of three treatments t1, t2 and t3 of Fig. 1
Fig. 1
figure 1

Network meta-analysis relative treatment effects for a hypothetical network of three treatments, t1, t2 and t3. MD: mean difference; CI: confidence interval

The sample from \(\hat{R_{\mathbbm{T}}}\) is summarized in what we will call the ranking matrix, in which each entry \({p}_{t_i,r}=P\left({\hat{r}}_{t_i}=r| data\right),\) r = 1, …, T shows the proportion of times (the frequency) each treatment ti being at each possible rank r. The estimated probability that treatment ti occupies the rth rank means that it produces better values than exactly T − r treatments. These probabilities have been presented in the literature in graphs called rankograms. Table 1 panel b shows the ranking matrix for the hypothetical network of Fig. 1.

The T! possible treatment hierarchies in a network of T treatments are denoted as hl, l = 1, …, T!. The probability mass function of hl is derived by summarizing the sample from \(\hat{H_{\mathbbm{T}}}\) in a matrix that we call the hierarchy matrix. A particular hierarchy hl features x times in the sample and this defines as \({p}_{h_l}=x/n\) the relative frequency that this hierarchy occurs. The hierarchy matrix is ordered by decreasing frequency that a hierarchy occurs, i.e., h1 corresponds to the most frequent hierarchy. It is also possible that more than one most probable hierarchy exists in a network due to ties. As is the case with \({p}_{t_i,r}\), the estimated probability of each possible hierarchy \({p}_{h_l}\) depends on the data. For small n, several hierarchies will have an estimated probability of 0. Table 1 panel c shows the hierarchy matrix for the example in Fig. 1.

An approach for answering clinically relevant decision questions

In the following, we propose a stepwise approach to express clinically relevant decision questions as hierarchy questions and quantify the uncertainty of the criteria that constitute them. We have developed an R package nmarank, hosted in CRAN [20], which allows users to implement the suggested approach [21]. In [22] readers can find the current version of the nmarank package which can be used to reproduce the results presented in the manuscript. The documentation of the package version 0.2–3 serves as a guide of the functions included in the nmarank package. The suggested approach is outlined below.

Step 1: define a clinically relevant research question

In the first step of the approach, investigators set a question that is considered clinically important. Different users of NMA would normally consider different questions to be of clinical importance; e.g. clinicians might be interested in general questions that capture the entire population of patients, policy decision makers might ask specialized questions while patients might seek answers to research questions focused on their specific patient group. Also, questions may include all treatments in the network or focus on a subset of them, as it is often the case in a clinical setting.

For example, a question of clinical relevance to a decision maker might be whether a treatment ti has better outcome than another (possibly effective but more expensive) treatment tj. Alternatively, clinicians might be interested to know the top three treatments in the network. Other examples of questions include that a specific order ti, tj, tk is retained anywhere in the hierarchy, that a treatment ti occupies a specific rank r or that it is among the best two ranks. It is also possible that we are interested in the case that a treatment ti has better outcome than treatment tj, but against a clinically important value c \(\left({\mu}_{t_i{t}_j}>\mathrm{c}\right)\) instead of their differences being zero \(\left({\mu}_{t_i{t}_j}>0\right)\). Depending on the context, it might also be the case that a combination of criteria constitutes a clinically relevant question. As an example, we might be interested in the case where ti is first or second and tj has better mean outcome value than that of treatment tk plus a clinically important value c = 0.5. As a special case of a clinically relevant question could be that a specific treatment hierarchy occurs, for example that imposed by the estimated relative treatment effects, expressed as ti is first, tj is second, tk is third and so on.

Step 2: find hierarchies compatible with the defined question

After setting the decision question of interest, the aim is to define the set of possible hierarchies out of all T! hierarchies that satisfy the criteria set in Step 1. This is done by selecting those hierarchies in the sample of \(\hat{H_{\mathbbm{T}}}\) for which the criterion is satisfied, thus translating the decision question into a hierarchy question. Depending on the context, the selected hierarchies might be of interest on their own or might only be used for proceeding to Step 3.

If we are interested in a question involving a clinically important value c, then the respective criteria need to be applied to mark the hierarchies that satisfy them in the sampling process of \(\hat{R_{\mathbbm{T}}}\) and \(\hat{H_{\mathbbm{T}}}\) described in section Definitions: treatment hierarchy and treatment ranking. For example, if we are interested in the combination of criteria that ti is first or second and tj has better mean outcome value than that of treatment tk plus a clinically important value c = 0.5, then we need to differentiate between the two types of hierarchies: those where in the sampling the condition \({\mu}_{t_j{t}_k}>\mathrm{c}\) will be satisfied and those where it will not.

Step 3: define certainty that the criterion is satisfied

In Step 3 of the framework, we add the frequencies of the hierarchies \({p}_{h_l}\) selected in Step 2 that satisfy the decision criterion set in Step 1. In our example from Fig. 1, the estimated probability that treatment t1 is at higher rank as t2 is the sum of the frequencies of h1, h4 and h6 hierarchies which amounts to 50% (see Table 1). Considering a case where we combine two criteria, say that we are interested in the case where t1 is first or second and t2 is higher in the hierarchy than t3. Then, we add the frequencies of h2 and h4 hierarchies, which amounts to \({p}_{h_2}+{p}_{h_4}=45\%.\) It might be the case that not all T! possible hierarchies are included in the sample of \(\hat{H_{\mathbbm{T}}}\); this, however, does not pose a problem in the process as the most frequent ones are recorded and used to estimate the certainty of the criterion.

Evaluation of certainty of the criterion

As an extra step of the approach, the amount of certainty of the criterion can be evaluated. This is done by comparing the frequencies derived in Step 3 with the respective frequencies corresponding to other relevant questions, in a similar manner that Bayes factors are derived [23]. For example, we may want to compare the estimated probability that three particular treatments from a family of interventions occupy the first three ranks, pA, versus the respective probability that three other treatments from another family of interventions do, pB, and we do so by taking their ratio \(\frac{p_A}{p_B}\). Alternatively, in a similar setting where we are interested in the optimal family of treatments, consider that ti and tk are the best candidates for family A and tj and tm are the best candidates for family B, m ≠ i, j, k, m = 1, …, T. Then, we may compare the frequencies that ti has better mean outcome value than tj and tk has better mean outcome value than treatment tm versus the probability that tj has better mean outcome value than ti and tm has better mean outcome value than treatment tk.

Relation of the hierarchy matrix with common ranking metrics

Estimated NMA relative treatment effects

The estimated \({H}_{\mathbbm{T}}\) and \({R}_{\mathbbm{T}}\) are T-dimensional distributions constructed from the estimated multivariate normal distribution of the relative treatment effects. A point estimate of the distribution \(\hat{H_{\mathbbm{T}}}\) can be obtained by ordering the point estimates of \({\hat{\mu}}_{t_1{t}_i}\) for each \({t}_i\in \mathbbm{T}\). This estimated treatment hierarchy might or might not be the same as the mode of \(\hat{H_{\mathbbm{T}}}\), which is the hierarchy h1 (the most probable hierarchy). Below we show a hypothetical example with three treatments, t1, t2, t3, of a network with three treatments where h1 differs from the hierarchy imposed by the NMA estimated relative treatment effects.

Figure 2 panel a shows the relative treatment effects of all treatments versus each other. The hierarchy that occurs from the resulted mean differences is {t3, t2, t1}. Fig. 2 panel b illustrates in a two-dimensional plot the probability density function for the bivariate normal distribution \(\hat{\boldsymbol{\mu}}=\left({\hat{\mu}}_{t_1{t}_2},{\hat{\mu}}_{t_1{t}_3}\right)=\left(\mathrm{0.5,0.6}\right)\) with variance-covariance matrix \(\hat{\mathbf{V}}=\left(\begin{array}{cc}25& 0\\ {}0& 25\end{array}\right)\). The T !  = 6 possible hierarchies are represented by the regions separated by the straight lines in Fig.2 panel b; each of the three straight lines divides the area according to whether each pairwise comparison favors the one or the other treatment. The region corresponding to {t3, t2, t1} includes the point estimate (0.5,0.6), noted as the black dot. However, this is not the most probable hierarchy (frequency 15%) as the probability mass is smaller than for other regions; hierarchies {t3, t1, t2} and {t2, t1, t3} are more probable than that of the mean effects, each one having 25% probability of occurring.

Fig. 2
figure 2

Network meta-analysis relative treatment effects of a hypothetical network of three treatments, t1, t2 and t3 (panel a) and two-dimensional plot of the probability density function for the bivariate normal distribution corresponding to the network meta-analysis of panel a with the associated frequencies of all possible hierarchies (panel b). MD: mean difference

Probability of producing the best value

The first column of the ranking matrix shows the frequency that each treatment occupies the highest rank and has been frequently used as a ranking metric usually referred as “probability of being best”. “Probability of being best”, however, has been recently indicated as an inaccurate name for the particular ranking metric as “being the best” may have a large variety of meanings and interpretations [14]. “Probability of producing the best value” has been suggested instead as better reflecting the nature of the particular ranking metric and this name is also adopted in this paper.

The hypothetical triangular network of Fig. 1 is sometimes used to criticize the probability of producing the best value ranking metric. It holds that \({\hat{\mu}}_{t_1{t}_2}={\hat{\mu}}_{t_1{t}_3}=0\), rendering each treatment to have 50% probability of being better than any other, but the former relative effect is associated with greater precision than the later. In such cases, probability of producing the best value favours treatments estimated with greater uncertainty. As indicated in Table 1 panel b (and can be also easily derived from Table 1 panel c), treatment t3 has a probability of producing the best value 45%, followed by t2 with a probability of 30%, while t1 is associated with 25% probability of producing the best value. However, hierarchy {t3, t1, t2} is equally probable with {t2, t1, t3} (25%) and hierarchies {t3, t2, t1} and {t1, t2, t3} are also equally probable (20%). The hierarchy matrix highlights that given that one of the three treatments is second, the other two have equal estimated probabilities of being first or third. The large tails of the distributions of relative treatment effects of t3 versus the other two treatments make it improbable that it ranks on the second position, rendering hierarchies {t2, t3, t1} and {t1, t3, t2} occurring each in 5% of the simulations.

SUCRAs and P-scores

As shown in the example of section Probability of producing the best value, the probability of producing the best value can be derived from the hierarchy matrix. For example, the probability that t1 produces the best value is the sum of probabilities of the hierarchies {t1, t2, t3} and {t1, t3, t2} which is 25%. Similarly, the SUCRAs or P-scores can also be derived from the hierarchy matrix. Consider for example the SUCRA (or P-score) for treatment t1 in the hypothetical triangular network of Fig. 1; it is defined as


which can be calculated from the hierarchy matrix as



Network of treatments for chronic obstructive pulmonary disease

We illustrate the process using a network comparing mortality rates in four treatments for chronic obstructive pulmonary disease: SFC (Salmeterol Fluticasone combination), Salmeterol, Fluticasone and Placebo [24]. Direct studies exist for all possible comparisons in the network, resulting in a fully connected network (Fig. 3 panel a). We synthesize data using the odds ratio as effect measure and assuming a random effects model and common heterogeneity variance across comparisons. SFC is associated with the greatest lowering in mortality compared to Placebo (odds ratio 0.29, 95% confidence interval 0.06 to 1.36), followed by Salmeterol (odds ratio 0.42, 95% confidence interval 0.14 to 1.26) and Fluticasone (odds ratio 0.55, 95% confidence interval 0.16 to 1.85) (Fig. 3 panel b). Heterogeneity standard deviation was estimated as \(\hat{\tau}=0\) using a generalized method of moments estimate [25]. The hierarchy matrix of Table 2 was created using 10,000 simulations from the multivariate normal distribution with mean and variance covariance matrix being the respective quantities from the estimated NMA relative treatment effects. Table 2 lists all possible treatment hierarchies (T !  = 24) along with their frequency of occurring.

Fig. 3
figure 3

Network plot with edges proportional to the number of studies of each direct comparison (panel a) and network meta-analysis relative treatment effects (panel b) for a network of four treatments for chronic obstructive pulmonary disease. OR: odds ratio; CI: confidence interval; SFC: salmeterol fluticasone combination

Table 2 Hierarchy matrix of the network of four treatments for chronic obstructive pulmonary disease of Fig. 3. Checks indicate the hierarchies that fulfil the criteria specified in the columns. Sum shows the probability of the criterion to hold. SFC: salmeterol fluticasone combination; Inf: infinity

We consider the following alternative hierarchy questions of interest as Step 1 of the suggested approach:

  • Criterion A: Hierarchy is exactly ‘SFC, Salmeterol, Fluticasone, Placebo’

  • Criterion B: Order ‘SFC, Fluticasone, Placebo’ is retained anywhere in the hierarchy

  • Criterion C: Fluticasone is among the best two options

  • Criterion D: SFC and Salmeterol are the two best options

  • Criterion E: Both criteria B and C are satisfied

In Step 2 of the approach, we identify the hierarchies that satisfy the above criteria and in Step 3 we add their estimated probabilities (Table 2). Criterion A requires that an exact predefined hierarchy occurs and is thus fulfilled by a single hierarchy, ‘SFC, Salmeterol, Fluticasone, Placebo’, corresponding to the order of mean effects, which appeared with frequency 28%. The criterion that SFC is better than Fluticasone and Fluticasone is better than Placebo (criterion B) is satisfied by all hierarchies for which order ‘SFC, Fluticasone, Placebo’ is retained, with or without other treatments in between. Four hierarchies fulfill this criterion with frequencies 28, 19, 12 and 3%. Thus, the frequency for the particular order to be retained in the hierarchy is 62%. Half of the possible hierarchies should have Fluticasone ranked first or second as required by criterion C. The sum of the frequencies of the respective 12 hierarchies is 44%. Criterion D specifies SFC and Salmeterol in the first two ranks, which is satisfied in four hierarchies with a total frequency of 53%. For the combination criterion E to be satisfied, hierarchies where SFC is better than Fluticasone, Fluticasone is better than Placebo and Fluticasone ranks among the best two options are the target hierarchies; these hierarchies are two (‘SFC, Fluticasone, Salmeterol, Placebo’ and ‘SFC, Fluticasone, Placebo, Salmeterol’) with frequencies 19 and 3% respectively.

Consider that we are interested in evaluating the certainty of criterion A. The frequency of 28% for the most probable hierarchy can be judged by comparing it with that of the subsequent hierarchies. The probability ratios of the frequency of the most probable hierarchy with the second and the third hierarchies are 1.47 and 2.33 respectively (Table 2).

Network of antidepressants for major depression

To illustrate the methods in a larger network of interventions, we take a published NMA of 18 antidepressants for major depression, illustrated in Fig. 4 panel a [26]. We focus on the primary binary outcome ‘efficacy’, defined as at least 50% reduction in the symptoms’ scales between baseline and 8 weeks of follow up. Studies were synthesized using odds ratio and NMA relative treatment effects of all antidepressants versus Fluoxetine are shown in Fig. 4 panel b. Heterogeneity standard deviation was assumed common across comparisons and estimated as \(\hat{\tau}=0.18\) using a generalized method of moments estimate [25]. The number of possible hierarchies is 18! rendering each hierarchy rare to occur. Out of the 500,000 simulations, only 7 hierarchies appeared twice, with the rest hierarchies appearing only once; in Table 3, the 7 most frequent hierarchies are listed.

Fig. 4
figure 4

Network plot of head-to-head studies (panel a) and network meta-analysis relative treatment effects (panel b) for a network of 18 antidepressants major depression. OR: odds ratio; CI: confidence interval

Table 3 Most probable hierarchies for the network of antidepressants of Fig. 4

We consider the following alternative hierarchy questions of interest as Step 1 of the suggested approach:

  • Criterion A: Vortioxetine ranks first, Bupropion second and Escitalopram third

  • Criterion B: Vortioxetine, Bupropion and Escitalopram are the best three treatments

  • Criterion C: Vortioxetine has better outcome value than that of Bupropion and Bupropion has better outcome value than that of Escitalopram

  • Criterion D: Vortioxetine, Bupropion and Escitalopram have an odds ratio of 1.25 or higher against Fluoxetine

  • Criterion E: Vortioxetine, Bupropion and Escitalopram have an odds ratio of 1 or higher against Fluoxetine

The estimated probability that Vortioxetine ranks first, Bupropion second and Escitalopram third (criterion A) is 9%. The estimated probability that these three treatments occupy any of the first three ranks (criterion B) is obviously higher; 19% of times Vortioxetine, Bupropion and Escitalopram were the three treatments with the highest odds ratios. The relative order of the first three treatments is also quite precise; in one third of simulations (33%), Vortioxetine performed better than Bupropion and Bupropion performed better than Escitalopram (criterion C).

Comparisons with a clinically important value are often more of interest than comparisons to the null effect. In the original NMA, an odds ratio of 0.8 and its reciprocal 1.25 was used for the examined outcome to judge upon imprecision of treatment effects and assess the confidence in NMA results. The estimated probability that all three treatments Vortioxetine, Bupropion and Escitalopram have an odds ratio of 1.25 or higher against the old, standard treatment Fluoxetine (criterion D) is 45%, while judging against the null (criterion E) the respective estimated probability is 92%.


In this paper, we suggest an approach to answer complex hierarchy questions in NMA and define the certainty around them. The approach that we took moves away from producing a plain, non-meaningful and difficult to interpret hierarchy and towards attaching treatment ranking to a clear decision question, relevant to all or a subset of competing treatments.

The work presented in this paper is somehow related to the precision in the treatment hierarchy from a NMA as a whole. Preliminary suggestions have associated the uncertainty of the entire treatment hierarchy with the shape of the rankograms: rankograms which show large differences between each treatment being at each rank indicate a precise treatment hierarchy, while flat rankograms reflect uncertainty in the treatment hierarchy [27]. This suggestion can be formalised with the use of the hierarchy matrix. A probability mass function where one hierarchy takes 100% probability and all others zero, has maximum possible precision. In contrast, when each of the T! possible hierarchies have probability 1/T!, the precision is the minimum possible. Precision can be reflected by the magnitude of \({p}_{h_1}\) (the closer to 1, the more precise the treatment hierarchy) or by summarizing the hierarchy matrix in various ways (e.g. taking the variance of all hl). Alternatively, the rate at which ratios of frequencies as those calculated in Table 2 increase also give an indication of the precision of the treatment hierarchy.

However, all these approaches suffer from the drawback that they are dependent on the number of treatments and thus cannot be used as a universal way to judge the precision of the entire treatment hierarchy. In contrast, looking at the certainty of the specified criteria of interest to hold is more meaningful and relevant and constitutes an alternative way of judging imprecision of NMA treatment effects and treatment ranking. Even examples with relatively imprecise results (reflected in wide, overlapping confidence intervals) may be associated with considerable certainty around specific criteria, relevant for decision making. Thus, carefully choosing criteria based on which to judge the imprecision of NMA results is particularly important.


The proposed method has several limitations. In decision making multiple outcomes are often of interest and the current approach cannot take this into consideration. Moreover, benefit-risk assessments between different outcomes may be of interest to decision makers but cannot currently be appropriately handled. It would be potentially helpful that the derived frequencies are accompanied by 95% confidence intervals; as, however, the interpretation of such intervals would be unclear, they are not incorporated in the current version of the nmarank package [21].

Future directions

The method presented in this paper can be extended to adapt to decision questions related to more than one outcome. For example, we could sample from two or more outcomes and measure the frequency for each combination of hierarchies for the considered outcomes. Taking for example a network with two outcomes, O1 and O2 we would calculate \({p}_{h_l,O1}\cap {p}_{h_l,O2}\). The two outcomes can be sampled either separately or simultaneously, calculating or imputing within and between outcomes correlation, as described elsewhere [28, 29]. As all combinations of hierarchies for two (or more) outcomes are to be considered, estimated probabilities for each specific combination will be smaller compared to those from a single outcome.


Medical societies, national and international agencies, guideline panels and clinicians very often use evidence synthesis to make informed decisions and recommendations about the clinical effectiveness of alternative treatment options [30]. Given the need to make treatment recommendations, producing a hierarchy is natural within the aims of NMA end-users. We recommend that treatment hierarchies are attached to a specific decision question, a practice which is currently rarely undertaken in NMA applications. Depending on the setting, estimated probabilities of set criteria might inform or guide decision making regarding the choice of the preferable treatments and modify accordingly clinical practice. In conclusion, the method described in this paper offers an approach to produce clinically relevant output from NMA, which is specifically related to the research question of the particular systematic review.

Availability of data and materials

Outcome data and the code for applying our methods are available in



Network Meta-Analysis


Surface Under the Cumulative RAnking curve


  1. Chaimani A, Caldwell DM, Li T, Higgins JPT, Salanti G. Additional considerations are required when preparing a protocol for a systematic review with multiple interventions. J Clin Epidemiol. 2017;83:65–74.

    Article  Google Scholar 

  2. Petropoulou M, Nikolakopoulou A, Veroniki A-A, Rios P, Vafaei A, Zarin W, et al. Bibliographic study showed improving statistical methodology of network meta-analyses published between 1999 and 2015. J Clin Epidemiol. 2017;82:20–8.

    Article  Google Scholar 

  3. Salanti G, Ades AE, Ioannidis JPA. Graphical methods and numerical summaries for presenting results from multiple-treatment meta-analysis: an overview and tutorial. J Clin Epidemiol. 2011;64(2):163–71.

    Article  Google Scholar 

  4. Rücker G, Schwarzer G. Ranking treatments in frequentist network meta-analysis works without resampling methods. BMC Med Res Methodol. 2015;15:58.

    Article  Google Scholar 

  5. Rosenberger KJ, Duan R, Chen Y, Lin L. Predictive P-score for treatment ranking in Bayesian network meta-analysis. BMC Med Res Methodol. 2021;21(1):213.

    Article  Google Scholar 

  6. Cipriani A, Higgins JPT, Geddes JR, Salanti G. Conceptual and technical challenges in network meta-analysis. Ann Intern Med. 2013;159(2):130–7.

    Article  Google Scholar 

  7. Rücker G, Schwarzer G. Resolve conflicting rankings of outcomes in network meta-analysis: partial ordering of treatments. Res Synth Methods. 2017;8(4):526–36.

    Article  Google Scholar 

  8. Tervonen T, van Valkenhoef G, Buskens E, Hillege HL, Postmus D. A stochastic multicriteria model for evidence-based decision making in drug benefit-risk analysis. Stat Med. 2011;30(12):1419–28.

    Article  Google Scholar 

  9. Tervonen T, Naci H, van Valkenhoef G, Ades AE, Angelis A, Hillege HL, et al. Applying multiple criteria decision analysis to comparative benefit-risk assessment: choosing among statins in primary prevention. Med Decis Mak Int J Soc Med Decis Mak. 2015 Oct;35(7):859–71.

    Article  Google Scholar 

  10. Mavridis D, Porcher R, Nikolakopoulou A, Salanti G, Ravaud P. Extensions of the probabilistic ranking metrics of competing treatments in network meta-analysis to reflect clinically important relative differences on many outcomes. Biom J Biom Z. 2020;62(2):375–85.

    Article  Google Scholar 

  11. Brignardello-Petersen R, Johnston BC, Jadad AR, Tomlinson G. Using decision thresholds for ranking treatments in network meta-analysis results in more informative rankings. J Clin Epidemiol. 2018;98:62–9.

    Article  Google Scholar 

  12. Veroniki AA, Straus SE, Fyraridis A, Tricco AC. The rank-heat plot is a novel way to present the results from a network meta-analysis including multiple outcomes. J Clin Epidemiol. 2016;76:193–9.

    Article  Google Scholar 

  13. Chaimani A, Porcher R, Sbidian É, Mavridis D. A Markov chain approach for ranking treatments in network meta-analysis. Stat Med. 2021;40(2):451–64.

    Article  Google Scholar 

  14. Salanti G, Nikolakopoulou A, Efthimiou O, Mavridis D, Egger M, White IR. Introducing the treatment hierarchy question in network meta-analysis. Am J Epidemiol. 2021;kwab278.

  15. Chiocchia V, Nikolakopoulou A, Papakonstantinou T, Egger M, Salanti G. Agreement between ranking metrics in network meta-analysis: an empirical study. BMJ Open. 2020;10(8):e037744.

    Article  Google Scholar 

  16. Lu G, Ades AE. Combination of direct and indirect evidence in mixed treatment comparisons. Stat Med. 2004;23(20):3105–24.

    CAS  Article  Google Scholar 

  17. Lu G, Welton NJ, Higgins JPT, White IR, Ades AE. Linear inference for mixed treatment comparison meta-analysis: a two-stage approach. Res Synth Methods. 2011;2(1):43–60.

    Article  Google Scholar 

  18. Salanti G. Indirect and mixed-treatment comparison, network, or multiple-treatments meta-analysis: many names, many benefits, many concerns for the next generation evidence synthesis tool. Res Synth Methods. 2012;3(2):80–97.

    Article  Google Scholar 

  19. Jansen JP, Fleurence R, Devine B, Itzler R, Barrett A, Hawkins N, et al. Interpreting indirect treatment comparisons and network meta-analysis for health-care decision making: report of the ISPOR task force on indirect treatment comparisons good research practices: part 1. Value Health J Int Soc Pharmacoeconomics Outcomes Res. 2011;14(4):417–28.

    Article  Google Scholar 

  20. R: The R Project for Statistical Computing [Internet]. [cited 2021 Jun 25]. Available from:

  21. Nikolakopoulou A, Schwarzer G, Papakonstantinou T. nmarank: Complex Hierarchy Questions in Network Meta-Analysis [Internet]. 2021 [cited 2021 Nov 23]. Available from:

  22. GitHub - esm-ispm-unibe-ch/nmarank at reproducible [Internet]. GitHub. [cited 2021 Nov 23]. Available from:

  23. Kass RE, Raftery AE. Bayes factors. J Am Stat Assoc. 1995;90(430):773–95.

    Article  Google Scholar 

  24. Woods BS, Hawkins N, Scott DA. Network meta-analysis on the log-hazard scale, combining count and hazard ratio statistics accounting for multi-arm trials: a tutorial. BMC Med Res Methodol. 2010;10(1):54.

    Article  Google Scholar 

  25. Jackson D, White IR, Riley RD. Quantifying the impact of between-study heterogeneity in multivariate meta-analyses. Stat Med. 2012;31(29):3805–20.

    Article  Google Scholar 

  26. Cipriani A, Furukawa TA, Salanti G, Chaimani A, Atkinson LZ, Ogawa Y, et al. Comparative efficacy and acceptability of 21 antidepressant drugs for the acute treatment of adults with major depressive disorder: a systematic review and network meta-analysis. Lancet Lond Engl. 2018;391(10128):1357–66.

    CAS  Article  Google Scholar 

  27. Salanti G, Del Giovane C, Chaimani A, Caldwell DM, Higgins JPT. Evaluating the quality of evidence from a network meta-analysis. PLoS One. 2014;9(7):e99682.

    Article  Google Scholar 

  28. Efthimiou O, Mavridis D, Cipriani A, Leucht S, Bagos P, Salanti G. An approach for modelling multiple correlated outcomes in a network of interventions using odds ratios. Stat Med. 2014;33(13):2275–87.

    Article  Google Scholar 

  29. Efthimiou O, Mavridis D, Riley RD, Cipriani A, Salanti G. Joint synthesis of multiple correlated outcomes in networks of interventions. Biostat Oxf Engl. 2015;16(1):84–97.

    Google Scholar 

  30. Kanters S, Ford N, Druyts E, Thorlund K, Mills EJ, Bansback N. Use of network meta-analysis in clinical guidelines. Bull World Health Organ. 2016;94(10):782–4.

    Article  Google Scholar 

Download references


Not applicable.


Open Access funding enabled and organized by Projekt DEAL. TP, GS and AN were supported by project funding (Grant No. 179158) from the Swiss National Science Foundation (SNSF). AN was also supported by a SNSF personal fellowship (P400PM_186723). GR was supported by the German Research Foundation (DFG), grant RU 1747/1–2. DM is funded by the European Union’s Horizon 2020 COMPAR-EU project (No 754936).

Author information




TP conceived the idea, developed the methods, produced the results and wrote the R code and the first draft of the manuscript. AN contributed to the development of the methods, the production of the results, the R code and the writing of the manuscript. GS, DM, GR and GS contributed to the development of the methods, reviewed, commented and edited the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Adriani Nikolakopoulou.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no completing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Papakonstantinou, T., Salanti, G., Mavridis, D. et al. Answering complex hierarchy questions in network meta-analysis. BMC Med Res Methodol 22, 47 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Clinically relevant question
  • Indirect evidence
  • Probabilistic ranking
  • Evidence synthesis