Methods for the inclusion of real-world evidence in network meta-analysis

Background Network Meta-Analysis (NMA) is a key component of submissions to reimbursement agencies world-wide, especially when there is limited direct head-to-head evidence for multiple technologies from randomised controlled trials (RCTs). Many NMAs include only data from RCTs. However, real-world evidence (RWE) is also becoming widely recognised as a valuable source of clinical data. This study aims to investigate methods for the inclusion of RWE in NMA and its impact on the level of uncertainty around the effectiveness estimates, with particular interest in effectiveness of fingolimod. Methods A range of methods for inclusion of RWE in evidence synthesis were investigated by applying them to an illustrative example in relapsing remitting multiple sclerosis (RRMS). A literature search to identify RCTs and RWE evaluating treatments in RRMS was conducted. To assess the impact of inclusion of RWE on the effectiveness estimates, Bayesian hierarchical and adapted power prior models were applied. The effect of the inclusion of RWE was investigated by varying the degree of down weighting of this part of evidence by the use of a power prior. Results Whilst the inclusion of the RWE led to an increase in the level of uncertainty surrounding effect estimates in this example, this depended on the method of inclusion adopted for the RWE. ‘Power prior’ NMA model resulted in stable effect estimates for fingolimod yet increasing the width of the credible intervals with increasing weight given to RWE data. The hierarchical NMA models were effective in allowing for heterogeneity between study designs, however, this also increased the level of uncertainty. Conclusion The ‘power prior’ method for the inclusion of RWE in NMAs indicates that the degree to which RWE is taken into account can have a significant impact on the overall level of uncertainty. The hierarchical modelling approach further allowed for accommodating differences between study types. Consequently, further work investigating both empirical evidence for biases associated with individual RWE studies and methods of elicitation from experts on the extent of such biases is warranted. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01399-3.

the case in rare disease areas or in conditions where RCT design may be less feasible. The inclusion of RWE in a network meta-analysis (NMA) of data from RCTs is not a straightforward issue, as the effectiveness estimates obtained from RWE may be subject to selection bias, due to lack of randomisation, and hence use of randomised evidence may be preferable. However, the potential advantage of RWE, particularly for the purpose of health technology assessment (HTA) decision-making, is that it can be a substantial source of evidence thus increasing the available evidence base as well as better representing "real-life" clinical practice. To this extent, RWE can be used to bridge a gap between efficacy and effectiveness to ensure that the evaluation process reflects what is expected in clinical practice in terms of effectiveness of new health technologies. Therefore, recent methodological developments focus on appropriate methods of using such data.
As part of the IMI GetReal initiative, aiming to incorporate real life clinical data into drug development, methodologies were investigated for including such data in the later stages of the drug development process (i.e. health technology assessment), where data on treatment effectiveness can be included in the meta-analysis to inform HTA decision-making [3].
A number of methods have been used to combine evidence from different sources, which include naïve pooling [4], inclusion of external sources of evidence as prior information [5,6], power transform prior approach [7] and hierarchical modelling [8]. These methods were originally introduced in standard pairwise meta-analysis and later generalised by Schmitz et al. (2013) to network meta-analysis (NMA) to combine direct and indirect evidence from a number of studies investigating effectiveness of a number of treatments [9]. NMA has been used routinely in technology assessments conducted by many HTA agencies world-wide. It is a particularly useful meta-analytic tool when data from head-to-head trials on an intervention of interest are limited. NMA is used to combine evidence from studies of heterogeneous treatment contrasts and is also known as mixed treatment comparisons meta-analysis.
The aim of this paper is to investigate the use of NMA to combine estimates obtained from both RCTs and RWE using methods that differentiate between the study designs to account for the potential inherent biases present in RWE. A range of methods for combining RCT data with RWE in an NMA setting are discussed, which include naïve pooling, hierarchical modelling and power transform prior approach. The hierarchical model has been extended here to include power transform priors. The methodology is applied to an illustrative example in relapsing-remitting multiple sclerosis (RRMS) [10].
A systematic literature review was carried out to identify sources of data, from both RCTs and RWE, on the effectiveness of disease modifying therapies (DMTs) used in RRMS patients. The results from the review and extracted data were subsequently used to illustrate how the three methodologies can be used to combine the data from the two types of sources of evidence and to compare their impact on the treatment effect estimates and resulting uncertainty.

Illustrative example and sources of evidence
In a motivating example, DMTs used in patients with RRMS were considered. A systematic review was carried out to identify studies, both randomised and observational, of different DMTs with a main focus on effectiveness of fingolimod to illustrate how the inclusion of RWE in NMA would impact the estimates of effectiveness of fingolimod in the context of a technology appraisal. The literature search was limited to studies reported prior to January 2010, when fingolimod was given licencing authorisation. Data were extracted on the effect of each treatment on relapse rate. Search terms utilised is available in Additional file 1.

Network meta-analysis
A random-effects NMA model with adjustment for multi-arm trials [11] was used as the base case metaanalytic model. To investigate the effect of fingolimod on relapse rate, the number of relapses r ik in each study i and treatment arm k was modelled as count data following the Poisson distribution [12], where E ik is the exposure time in person years and γ ik is the rate at which events (relapses) occur in arm k for study i. Following a standard generalized linear model approach, the conjugate log link was used with random true treatment effect differences δ ibk between treatments k and b which are assumed to follow a common normal distribution: Assuming consistency in the network (which means that, for example, average treatment effect difference d AC , between treatments A and C, equals the sum of average treatment effect differences d AB, between treatments A and B, and d BC, between treatments B and C) allows us to represent treatment effect for each treatment contrast d bk in the network as a difference of basic parameters which are average treatment effects of each treatment in the network compared to a common reference treatment 1; d bk = d 1k − d 1b . Adopting a Bayesian approach to estimating the parameters of Eqs. (1)-(3) requires that prior distributions are placed on the model parameters: the baseline study effects, μ ib , for example, the uniform distribution μ ib~U niform(−10, 10), on the basic parameters, d 1k~U niform(−10, 10) and on the between-study variance σ~Uniform(0, 2).
For multi-arm studies, correlation between treatment effects relative to a common baseline treatment is taken into account by assuming true treatment effects δ i(bk n ) follow a common multivariate normal distribution which can be represented as series of univariate conditional distributions as follows: where n = 2, …, p in the (p + 1)-arm study of p treatment effect estimates relative to the reference treatment.

Naïve pooling approach
The above NMA model was initially used to combine data from RCTs with RWE by including the observational studies at 'face-value' . Data from all studies, regardless of the study design, were combined in the NMA described above.
This model was then extended to account for the differences between the designs of the studies as described in the following sections.

Power prior approach
To take into account the differences in study design between RCTs and observational studies, a 'power transform prior' approach was adopted [7]. This approach allows down-weighting of the RWE, thus making the data from this type of studies contribute less compared to data obtained from the RCTs. This is achieved by introducing a down-weighting factor, alpha (α), which the likelihood contribution of the RWE studies is raised to the power of. Alpha (α) is then varied between zero and one, with zero meaning that RWE is entirely discounted in the NMA, and with one indicating that all RWE is considered at 'face-value' , which is assumed to be the same for each RWE study included in the network. The impact of different levels of weighting on the results of the NMA is performed by considering a series of values for alpha. The results are then summarised both in terms of the effect estimates (and their associated level of uncertainty) and the rankings that the treatments received (based on these effect estimates).
Considering the annualised relapse rate ratio (ARRR) and assuming δ = log(ARRR ), the overall joint posterior distribution is given by, Assuming a standard random-effects NMA model, we combine the likelihood contribution of RWE, raised to the power of alpha, with the likelihood of the RCT data. Together with the prior distributions for the basic parameters, this gives the overall posterior distribution with RWE discounted by the parameter alpha. Assuming that the number of relapses follow a Poisson distribution, the RWE log likelihood (LL) in (1) becomes where h indexes the different values of α.

Hierarchical model approach
An alternative approach to allowing differentiation between study designs in NMA is introducing another level in a Bayesian hierarchical model, modelling the between-study heterogeneity of treatment effects within each study design (RCT or RWE) and across study designs. The hierarchical model by Schmitz et al. (2012) was adapted to model count data using a Poisson distribution. Assuming j = 1, 2 where 1 represents the RCT data and 2 represents the RWE then Eq. (1) now becomes, And, similarly as in the general NMA model, using the log link function Eq. (2) becomes The data from the two sources of evidence, RCT data and RWE data, are modelled separately at the withinstudy and within-design level. Similarly, as in Schmitz et al. (2013) assuming the treatment effects from RCT and RWE evidence are exchangeable, the study designs pecific estimates are combined to estimate an overall measure of treatment effect using random-effects [9].
Thus, if δ 1 ibk and δ 2 ibk represent the treatment effect of treatment k against a reference treatment b, based on the RCT evidence and RWE respectively, then, where d bk is the mean treatment effect of treatment k compared to a reference treatment b and σ 2 is the variance representing the between-studies heterogeneity. Prior distributions need to be placed on the parameters of the model, for example, the following "vague" prior distributions: This model was further extended by adopting a power prior approach at the within-study level for RWE (level one) by down weighting the likelihood contribution of the RWE by the factor alpha, as in Eq. (6), in the hierarchal model in order to provide a further sensitivity analysis. Combining average underlying study effects δ 1 ibk from RCTs with down-weighted effects δ 2 ibk from RWE produces an overall pooled ARRR combined effects d bk .

Implementation and model fit
All models were implemented in WinBUGS version 1.4.3 [13]. The first 10,000 simulations were discarded for all models as a burn-in. The main analyses were based on additional 20,000 iterations in order to ensure convergence. Convergence was investigated by visually inspecting the trace and history plots. Model fit was evaluated using the total residual deviance and the DIC for each network size [6]. Between-study heterogeneity was assessed using the standard deviation across random-effects models. Inconsistency was assessed by assessing residual deviance and performing node splitting analysis [14].   Table 1 shows the annualised relapse rate ratios (ARRRs) (95% credible intervals) for an NMA of RCTs only (lower triangle) and an NMA of both sources of evidence with no adjustments for study design (upper triangle). As seen in Table 1, the ARRRs comparing all treatments vs. placebo are less than one, indicating a relative reduction in ARRRs for all active treatments compared to placebo. When NMA treatment effect estimates are based on both sources of evidence, the levels of uncertainty can increase. For example, when comparing the effectiveness of fingolimod 0.5 mg with Avonex the 95% credible interval of the ARRR increased from (1.64 to 2.38) when using only RCT data, to (1.44 to 2.52) when combined data from both sources of evidence were used. This is likely to be due to the increased between-study heterogeneity, when the two different sources of evidence were combined.

Power prior
The impact of the 'power transform prior' approach on the estimates of ARRRs (of each treatment compared to placebo) obtained from an NMA including both RCTs and RWE can be seen in Fig. 2. The ARRRs of each active treatment compared to placebo are shown for a range of values of the down-weighting factor (alpha) between zero (maximum down-weighting, i.e. RWE not included) and one (RWE considered at 'face-value') (Additional file 4). It can be seen that for most of the active treatments there is relatively little impact of assigning increasing weight to the RWE in terms of the point estimates for the ARRRs. However, the impact on uncertainty around these estimates was noticeable. For example, considering fingolimod 0.5 mg compared to placebo (Fig. 2) for alpha value of 0.001, ARRR (95% credible interval) estimated was 0.42 (0.36, 0.50) while an alpha value of 1.0 resulted in ARRR of 0.41 (0.32, 0.53). Whilst the point estimate remains fairly stable, the 95% credible interval widen as more weight is given to the RWE. This may seem counter-intuitive, as more evidence is being included in the analysis, and therefore uncertainty levels would be expected to decrease. However, in this random-effects NMA, the between-study heterogeneity increased when including RWE, reflecting the differences observed between RCTs and RWE studies. This is represented by an increased between-study variance and in turn increased uncertainty in specific treatment effect estimates (see the last column of the Table in Additional  file 4). However, because this applies consistently across all treatments the net impact, in terms of treatment rankings, is minimal as can be seen in Fig. 3. Table 2 shows the results of adopting a hierarchical NMA which includes an additional level of hierarchy corresponding to the study design. Although the point estimates from the hierarchical model are in a broad agreement with the results presented above using a simpler 'power transform approach' , it can be seen that the levels of uncertainty (in terms of the width of the credible intervals) are generally greater. For example, in comparison to placebo, natalizumab had an ARRR of 0.41 (0.30, 0.57) when using a power prior approach when alpha is 1 (Fig. 2), while an ARRR of 0.40 (0.26, 0.70) in Table 1 Matrix table of annualised relapse rate ratios (95% credible intervals) for network meta-analysis (NMA) using naïve pooling random-effects models a  Table 2). This is due to the fact that the hierarchical model explicitly takes into account the differences between study designs, thus allowing for additional variability across studies. Extending the Hierarchical model to include 'power transform prior' approach ARRR effect estimates for a range of alpha values are included in Additional file 5. These differences in credible intervals were further observed in comparison to the power prior approach estimates. However, including RWE using the hierarchical model did not have any impact on the estimate of effectiveness for fingolimod (0.5 mg and 1.25 mg), which was due to the lack of RWE for this treatment and the nature of the model allowing for additional variability.

Discussion
As previous research has suggested, there are differences between RCTs and RWE studies [15]. However, the results from this study did not show that including the RWE simply over-or underestimated the treatment effect for each treatment, but rather that there was both overand underestimation for different treatments, supporting previous findings [9].
This study has further extended the methods introduced by Schmitz et al. (2013) by adapting them to model count data with the Poisson likelihood as well as extending the hierarchical model to down-weight the observational studies using a modified power prior approach. Both the hierarchical model and the modified hierarchical model are useful as they account for the heterogeneity between study designs and potential bias in RWE studies in the case of the latter [16]. However, the results of these analyses did not differ significantly from the naïve pooling results or basic power transform prior results for this illustrative example. They also produced wider credible intervals due to the increased between-study design heterogeneity when including RWE. Whilst the hierarchical models may be considered more appropriate (in that they Fig. 2 Annualised relapse rate ratios with 95% credible intervals for all active treatments compared to placebo for values of the down-weighting factor (alpha) using the 'power prior' model account for differences in sources of heterogeneity) care needs to be taken, and it is advised to compare the results with those from the naïve pooling in a sensitivity analysis to assess how results differ in practice.
In our illustrative example the inclusion of RWE increased the overall level of uncertainty in the treatment effects, supporting previous findings [9]. For example, when looking at the effectiveness of fingolimod 0.5 mg in the general population, greater heterogeneity was observed across different RWE studies, resulting in additional uncertainty around the effectiveness of fingolimod in the combined analysis in comparison to the effectiveness based on the carefully selected population in RCTs. The inclusion of RWE may increase the overall level of heterogeneity, and thus the uncertainty in estimated treatment effects -as was the case here. Thus, further   evaluation of such methods in other settings, including the use of simulation studies, is warranted, and extension of the hierarchical modelling approach to allow for different types of RWE, either by inclusion of study-level covariates or by adding an extra level into the hierarchy, may ameliorate any potential increase in uncertainty regarding the treatment effects due to increased heterogeneity due to a broader evidence base [17][18][19].
Implications for decision makers are that the methods can allow them to undertake assessments on a larger evidence base, and which includes a wider range of patient demographics and clinical characteristics. The inclusion of RWE in appraising health technologies can provide a larger (and possibly more representative) evidence base for decision-making; however, HTA analysts and decision-makers will need to consider on case-by-case basis whether or not the available RWE is sufficiently credible, whether this type of analysis is acceptable, and how the results should be interpreted and ultimately used.

Limitations
There are a number of limitations of this study that need to be recognised. First, the sample sizes of RWE studies were smaller compared to the larger RCTs available for RRMS, which may have had an impact on the uncertainty of effect estimates when weighting studies. Second, this study has only utilised one illustrative example and results may differ in other clinical area. While this may be the case, it remains of importance to compare the analysis of combined RCT and RWE data to the traditional NMA of RCT data alone to investigate the degree of effectiveness vs. efficacy gap. Thirdly, a Poisson likelihood was used to analyse this data. It is possible that the increased uncertainty could be reduced by utilising a negative binomial likelihood which can account for potential over dispersion when modelling count data. Fourth, meta-regression was not considered in this study. While meta-regression may explain some of the between-study heterogeneity, it may be limited both by the covariate information available and/or the number of studies in the NMA. Fifth, the NMAs in this particular example included aggregate level data only. Access to individual patient data from RWE would allow for adjustment of the results for potential allocation bias, potentially reducing the between-study heterogeneity and consequently the uncertainty around the pooled effectiveness estimates. However, obtaining IPD from observational studies can often be difficult due to the regulations around sharing such data. Further research would be needed to assess the impact of utilising IPD from observational studies. Finally, extraction of count data analysed with exact Poisson likelihood was considered more appropriate than, for example, extracting data on adjusted ARRRs (with modelling based on the normal approximation). However, this has its limitations as this prevents adjustment of treatment effects for confounding factors, which would only be possible with data at the IPD level.

Conclusions
While the 'power transform prior' NMA as well as hierarchical NMA models had little impact on ARRR effect estimates, the degree of inclusion of RWE in the NMAs impacted the level of uncertainty around these effect estimates, likely as a result of increased between-study heterogeneity. The hierarchical NMA models provided another level of uncertainty, accounting to the differing study types (i.e. RCTs and RWE). Therefore, a comprehensive simulation study is required to investigate the ability of these models to correctly estimate treatment effects whilst also accounting for biases introduced by using RWE in different scenarios.
RWE can provide valuable data for HTA decisionmaking and in this paper we have illustrated a number of formal approaches for incorporating such data in evidence synthesis. Further, RWE can provide additional information, particularly in the case of rare diseases where clinical trial data are limited. Inclusion of RWE in meta-analysis can also be useful in clinical development planning as in Martina et al. (2018), who showed that inclusion of non-randomised data in meta-analysis can help inform the design of a future trial and potentially reduce the number of patients required as part of a drug development programme [20]. However, the added value of RWE should be considered on a case-by-case basis.