Network meta-analysis combining individual patient and aggregate data from a mixture of study designs with an application to pulmonary arterial hypertension

Background Network meta-analysis (NMA) is a methodology for indirectly comparing, and strengthening direct comparisons of two or more treatments for the management of disease by combining evidence from multiple studies. It is sometimes not possible to perform treatment comparisons as evidence networks restricted to randomized controlled trials (RCTs) may be disconnected. We propose a Bayesian NMA model that allows to include single-arm, before-and-after, observational studies to complete these disconnected networks. We illustrate the method with an indirect comparison of treatments for pulmonary arterial hypertension (PAH). Methods Our method uses a random effects model for placebo improvements to include single-arm observational studies into a general NMA. Building on recent research for binary outcomes, we develop a covariate-adjusted continuous-outcome NMA model that combines individual patient data (IPD) and aggregate data from two-arm RCTs with the single-arm observational studies. We apply this model to a complex comparison of therapies for PAH combining IPD from a phase-III RCT of imatinib as add-on therapy for PAH and aggregate data from RCTs and single-arm observational studies, both identified by a systematic review. Results Through the inclusion of observational studies, our method allowed the comparison of imatinib as add-on therapy for PAH with other treatments. This comparison had not been previously possible due to the limited RCT evidence available. However, the credible intervals of our posterior estimates were wide so the overall results were inconclusive. The comparison should be treated as exploratory and should not be used to guide clinical practice. Conclusions Our method for the inclusion of single-arm observational studies allows the performance of indirect comparisons that had previously not been possible due to incomplete networks composed solely of available RCTs. We also built on many recent innovations to enable researchers to use both aggregate data and IPD. This method could be used in similar situations where treatment comparisons have not been possible due to restrictions to RCT evidence and where a mixture of aggregate data and IPD are available. Electronic supplementary material The online version of this article (doi:10.1186/s12874-015-0007-0) contains supplementary material, which is available to authorized users.


Background
Decision making bodies for national health care providers, such as the National Institute for Health and Care Excellence (NICE) for the NHS in England and Wales or the Pharmaceutical Benefits Advisory Committee (PBAC) in Australia, have a need to consider all available treatments when making recommendations for clinical practice. There is rarely a single definitive study comparing these treatments and it is often necessary to synthesise the best available evidence to come to a decision [1].
Network meta-analysis (NMA) for indirect mixed treatment comparisons of multiple treatments is a generalization of standard meta-analysis, which is used to combine the results of multiple studies, to the comparison of two or more than treatments. This has become a well-established methodology for evidence synthesis [2,3] and is routinely used and recommended by NICE [4,5]. The gold-standard of evidence to be included in a NMA are randomized controlled trials (RCTs) which include a control arm and whose populations are randomized to reduce bias and improve precision. The results are usually available from literature as only aggregate data. Access to individual patient data (IPD), when available, can be used to understand the relationship between covariates and outcomes [6,7]. Methods for the inclusion of IPD in pairwise metaanalysis have been developed by Sutton et al. [8] and Riley et al. [9,10] and these were extended to the network metaanalysis of binary outcomes by Saramago et al. [7] and Donegan et al. [6]. This model can easily be adapted to continuous outcomes and provides a covariate-adjusted NMA model combining IPD and aggregate data.
One of the requirements to perform an NMA is to have a connected network [4], which can be challenging when not enough RTCs are available, as illustrated in Figure 1 for the case of a NICE technology assessment follicular lymphoma [11]. This is often a problem in new indications for small populations or orphan diseases [12]. However, a decision on the most appropriate treatment is still needed and including non-randomized studies to complete the network and conduct the comparison is a potential solution [13]. A commonly available type of non-randomized study is the single-arm observational study, or before-andafter study [14], in which outcomes in a group of patients are investigated before and after an intervention.
Several methods have been proposed to incorporate such observational studies [15]. One approach is the three-level hierarchical model which allows the incorporation of evidence from many different study designs [16,17]. An example of such a model consists of an overall effect for each treatment j, which can be labelled d j . Treatment effects for each different type of study, such as an RCT effect φ j1 , a before-and-after study effect φ j2 , and a case-control study effect φ j3 , could then be normally distributed around this overall effect. At the bottom level of the hierarchy are the individual study effects δ jki for each treatment j, study type k, and study i, which could be normally distributed around the study type treatment effects φ jk . This approach has the advantage of keeping the inference from each type of trial separate but is not applicable in cases where the number of studies per study type per treatment is small.
An alternative approach to including observational studies, and thus connecting the network, is that of propensity scores which are the probability that a patient would be given a particular treatment on the basis of their background characteristics [18][19][20]. These probabilities are often estimated using logistic regression. However, different propensity score models are required for each treatment and a great many studies are therefore required for each study. This is a particular drawback if IPD is not available for most of the treatments. Another disadvantage is the difficulty of incorporating propensity scores into the existing covariate-adjusted NMA models.
A final alternative for including observational studies in disconnected networks is the method of constructing empirical priors informed by these observational studies [15,21]. These empirical priors inform parameter estimation via: where the L(θ|RCTs) is the likelihood on the basis of the RCT evidence, L(θ|Obs) is the likelihood on the basis of the observational evidence, P(θ) is the prior, and α is a parameter representing the strength given to the observational evidence. If α = 1, for example, the observational evidence would be given the same weight as the RCT evidence. This approach shares the advantage of the hierarchical method in that it explicitly separates the RCT and observational evidence but also shares the disadvantage of the propensity scores method that it is difficult to merge with existing NMA models. The method we choose to build upon is the construction of control arms for before-and-after studies by matching their baseline characteristics to those of control arms in included RCTs [20,22]. This analysis of covariance method uses regression models to estimate the effect of treatments not included in the study. We adapted this in a natural fashion to covariate-adjusted NMA models through an assumption of exchangeability (random-effects) on the placebo effects of study arms. A similar approach was originally applied to meta-analysis [23] and has recently been proposed for the construction of baseline natural history models in NMA [24]. However, recent work has been critical of placing random-effects on the trial-level baseline improvements, namely the placebo effect [24,25], as it interferes with the randomization of the RCTs and learns across trial information. Despite these concerns, in cases such as the application we will discuss necessitate this approach as it would otherwise not be possible to compare the treatments of interest due to the disconnectedness of the evidence network.
Illustrative example: mixed treatment comparison of combination therapies for pulmonary arterial hypertension We will illustrate our method for the inclusion of beforeand-after studies in a mixed treatment comparisons of therapies for pulmonary arterial hypertension (PAH). PAH is a rare disease characterised by progressive elevation of pulmonary vascular resistance leading to right heart failure and death [26]. Current treatments include endothelin receptor antagonists (ERA), phosphodiesterase-5 inhibitors (PDE5i), and prostacyclin analogues (Pr) [27]. These drugs are often used in combination to try to improve outcomes [28,29]. The anticancer therapy imatinib is an oral therapy which has also recently been studied in PAH and its use is of interest to clinicians. No systematic comparison of available monotherapies and combination therapies for PAH has been conducted and, in particular, imatinib as add-on to other combination therapies has not been investigated. Imatinib was being evaluated as an alternative to prostacyclins as additional therapy for patients on a combination of ERA and PDE5i. The comparison of imatinib with prostacyclins for this patient group was not previously possible on the basis of direct evidence or through indirect NMA comparisons restricted to RCTs and we thus took it as our primary objective for treatment comparison. However, our comparison should be viewed as illustrative and should not be used to guide clinical practice as the evidence is indirect and the analysis relies on a number of model assumptions that were necessary to facilitate the comparison.
Our primary evidence base for the NMA was the IPD from the IMPRES trial [30]. This was a randomized placebo-controlled Phase-III trial to investigate the efficacy and safety of imatinib as an add-on to combination therapy for the treatment of PAH. The study included patients with severe PAH and receiving two or more PAHspecific treatments . . Patients were initially on one of four combination treatments, namely ERA + PDE5i, ERA + Pr, PDE5i + Pr, or ERA + PDE5i + Pr., were randomized within these background treatment groups to either imatinib or placebo and were followed-up for at least 24 weeks. Patient group characteristics are reported in Table 1, where heterogeneity in baseline characteristics between randomized groups is exhibited. The high dropout rates in this trial reflect the severity of the disease and the side-effects of the treatments.
The continuous outcome of short term change in 6 minute walk distance (6MWD) from baseline, in meters, is used in licensing decisions by agencies such as the Food and Drug Administration [31] and as a result is the primary outcome in nearly all Phase-III trials in PAH. Although adjusting final 6MWD for baseline 6MWD is the recommended approach when analysing trial results [32], this was not possible as we had only aggregate data from most of the studies and many only reported the change outcome. We therefore chose change in 6MWD as our efficacy measure. Short term was defined as 12 weeks to 1 year, as clinical opinion was that patients would derive maximal benefit from treatments within 12 weeks.
Six covariates of interest were identified by a mixture of exploratory analysis of the IMPRES data and expert clinical opinion. The covariates identified were: the age at baseline (AGE), an indicator for whether a patient is male (SEX), the 6MWD at baseline (WALK), the World Health Organization New York Health Assessment status (STATUS) and pulmonary vascular resistance (PVR). STATUS categorises the severity of PAH into one of four increasingly severe categories, ranging from no limitation of activity and no symptoms with ordinary physical activity to marked limitation of activity and symptoms with any activity, even at rest. PVR is a measure of the resistance of the pulmonary vasculature calculated from the pressure drop across the pulmonary vascular bed divided by the pulmonary blood flow. Means of the covariates were used for the aggregate data.

Systematic literature review of studies in the literature
The results of a systematic literature review were available and was used to identify a network of studies to be included in the analysis. In this review, the MEDLINE® and EMBASE® databases were searched simultaneously. Patient Intervention Comparator Outcome Study type (PICOS) [33] criteria were followed and the quality assessment was performed according to the NICE checklist for RCTs [34]. Details of the PICOS terms are included in Additional file 1. Search terms included a combination of free-text and thesaurus terms relevant to PAH, ERA, prostacyclins, PDE5i, and RCTs, although case-control and cohort studies were also included. The Cochrane Central Register of Controlled Trials was also searched using a similar strategy. The relevance of each citation identified from the databases was based on title and abstract according to the PICOS criteria. As we wanted to explore the effects of covariates, studies that did not report two or more of the 6 covariates of interest were excluded, while we would use IMPRES data to perform single imputation when only one covariate is missing. From this review, we identified and included 5 monotherapy [35][36][37][38][39] and 4 combination therapy [28,[40][41][42] RCTs, summary statistics for which are provided in Table 2 and Table 3, respectively. Additionally, 6 before-and-after studies investigating monotherapies and combination therapies were included [43][44][45][46][47][48] and their summary statistics are reported in Table 4. PRISMA flowcharts are provided for the systematic searches in Figure 2 and Figure 3 and a PRISMA checklist is provided in Additional file 2 [49]. Although there were substantial differences across trials in the doses of the administered treatments, as recorded in Tables 2, 3 and 4, clinical opinion was such that their effects would be comparable.
Only two-arm RCTs were identified, with each arm involving the addition of some treatment or placebo to a group of patients who were either treatment naïve or on some baseline treatment. In these studies, included patients had been on the baseline treatment for a time period before randomization (eg. ERA for at least 4 months prior to randomization [42] which was assumed sufficient to derive maximal benefit from the baseline treatment. This assumption implies that any improvement was due to the additional treatment or the placebo effect.
The before-and-after studies were single-arm observational studies which reported the 6MWD of a group of patients on a particular background therapy before and

Methods
The final evidence network of observational studies and RCTs for the NMA is shown in Figure 4. The treatment effects are labelled β i and are the expected short term improvement in 6MWD. Arrow directions indicate the  interpretation of these parameters, eg. Positive β 2 means that prostacyclins are more effective than placebo. The network of primary interest is highlighted in bold, and the comparison of primary interest β 8 − β 6 , the effectiveness of imatinib against prostacyclins as an add-on to ERA + PDE5i, is highlighted by a bold, dashed, indirect link. This illustrates the necessity of including observational evidence as this network would be disconnected had it been restricted to RCTs. Although this indirect comparison could have been conducted with evidence from only IMPRES and the Jacobs et al. studies, the inclusion of a wider range of evidence strengthens our estimates of covariate adjustments and the short term placebo improvements in 6MWD in PAH patients. The following sections explain our development of a NMA model to estimate the parameters β i by synthesizing all available evidence. As only two-arm RCTs and singlearm observational studies were identified, the models we develop will not be designed for trials with more than two arms. This model development is summarized in Table 5.
Model M1: network meta-analysis of aggregate data from RCTs and observational studies The first model we considered was a simple network meta-analysis of aggregated data from the IMPRES study and aggregate data from the literature. The mean short term change in 6MWD for each study i and arm j, Y ij , was modelled as: where SE ij is the standard error of the observed change in 6MWD in arm j of study i. It should be noted that this parameterization is slightly different to that used in other network meta-analyses [15,50] as we are using a trial level placebo effect α i in combination with a trial level effect of treatment, θ ij . The placebo effect is the mean improvement in 6MWD that a group of patients would experience if they entered trial i and received only placebo, in addition to their background therapy, and is assumed to be the same for each of the arms of the trial. The effect of treatment in arm j is a linear combination   of the effects of additional treatments initiated in that arm at the start of the trial: θ ij = effect of additional treatment initiated in j th arm of i th study β = vector of treatment effects f ij = linear function with coefficients +1 or -1 In Equation (2), random effects with common variance σ 2 β were placed on the treatment effects θ ij in the i th study and j th arm, as they were assumed to be exchangeable and independent. The entries of the treatment effects vector β are the treatment effect parameters β l , which we assumed to be fixed effects. For arms receiving only a placebo, it was assumed that θ iC = 0 so that the improvement in 6MWD is only the placebo effect α i . Two-arm trials with no placebo arm would have mean improvements of α i + θ i1 and α i + θ i2 , where θ i1 and θ i2 are the effect of the treatment combinations in the first and second arms, respectively.
Observational studies consist of only one arm and their inclusion required an assumption about their α i . We assumed that these α i , the placebo improvement in a trial, which would subsume the placebo effect, would be exchangeable across trials. In the model above, we expressed this by placing a Normal random effect with common mean α and variance σ 2 α on the α i s: This use of random effects enables evidence from all RCT and before-and-after observational studies to estimate the expected change in 6MWD. Note that this assumption possibly interferes with randomization as the α i will be drawn towards the mean α and thus the treatment effects β may be biased. An alternative would be to treat the α i as fixed effects [24,25,51] and thus preserve randomization, but this would not allow the inclusion of before-and-after studies.
The linear functions f ij () were almost always single values, eg. β 6 for Jacobs et al. as the only additional treatment was prostacyclin analogues [43]. In the BREATHE-2 study [40], labelled study i for convenience, arm j = 1 was a treatment naïve group started on bosentan (ERA) and intravenous epoprostenol (Pr) while arm j = 2 was a treatment naïve group started on bosentan (ERA) alone. This was represented by the functions: which could be read from Figure 4. The β i are our analogues of the basic parameters in the standard indirect treatment comparison model described in Dias et al. [5], while f ij (β) are our analogues of the functional parameters. The choice of priors for α and the β l s was based on the assumption that no patient would change their walking distance by more than 400 meters, which implied a standard deviation of 200 meters. Assuming that the smallest study had at least 10 patients, this gave SE ¼ 200 ffiffiffi ffi 10 p . and therefore a prior variance for effects on the mean of SE 2 = 4000. We represented these prior beliefs via Normal distribution, which were judged appropriate in the context of changes in 6MWD through exploratory analysis of the IMPRES data and expert clinical opinion. For σ 2 β and σ 2 α , the vague assumptions that σ β ≤ 50 meters and σ α ≤ 50 meters were used, which expressed the belief that individual patients would not differ from the mean improvement in 6MWD by more than 100 meters. Following the recommendation of Lambert et al. [52], a uniform prior representing this belief was placed on the standard deviation. These considerations gave the priors:  Model M2: network meta-analysis of IPD and aggregate data from RCTs and observational studies We extended the aggregate data model described by Equation (1) in Section 2.1 to include individual patient data through the relation: for the change in 6MWD for patient k of arm j and study i, where σ 2 is a common variance parameter to be fit to the data. Although in general we would use a separate σ 2 , with a subscript, for each IPD trial, we have dropped the subscript to simplify the notation as our application only includes a single IPD trial. The treatment effects and placebo effects were as in the aggregate data model M1: Normal prior distributions were again assumed for the means of the Normal distributions and Uniforms were placed on the standard deviations. As in the specification of priors for α and the β l s in model M1, we reasoned that if a patient was assumed not to have an improvement exceeding 400 meters, their standard deviations should be 200 meters and therefore have variances of 40000. These assumptions resulted in the priors: β l e N 0; 40000 ð Þ σ e U 0; 50 ð Þ: As the evidence for the treatment effects β l came from both individual patient and aggregate (mean) level data, the 'vaguer' prior was used. The prior for the placebo effect α and for the standard deviations σ α , and σ β were kept the same as in the aggregate data models in Section 2.1. This was appropriate as they have the same meaning in both the IPD and aggregate data models.
Model M3: across-study and within-study covariate adjustments on the placebo effect To account for across-study heterogeneity, we extended the model to include covariate adjustments on the placebo effects, the α i s. A further advantage was that these adjustments for differences in the patient populations led to better assessments of the placebo improvement in the single-arm before-and-after studies due to their better explanation of the heterogeneity. We also adjusted for heterogeneity within the studies, which is betweenpatient heterogeneity, for which we had IPD. The model was defined for a mean covariate X ij and individual covariate X ijk as follows: The last two equations are as in models M1 and M2. In this model, φ was the effect of the mean and accounted for across-study differences, while π was the effect of an individual's covariate and accounted for within-study differences.
Note that the difference between π and φ in Equation (8) quantifies ecological bias, a bias that arises when the effect of the mean of a covariate is different from effect of the covariate itself, and that if π = φ then there would be no ecological bias.
Priors for α, β l , σ, σ α , and σ β were as in the model with no covariate adjustments of Section 2.2, while a vague Normal distribution for mean effects was used for φ and a vague Normal distribution for individual effects was used for π: φ e N 0; 4000 ð Þ π e N 0; 40000 ð Þ which completed the NMA model combining IPD and aggregate data with covariate adjustments on the placebo effect.

Model M4: within-study covariate adjustments on treatment effects
Our final extension was to include covariate adjustments for the effect of patient characteristics on the efficacy of treatments, the β l s in the models. Such a model would be useful for predicting efficacy and evaluating costeffectiveness in patient subgroups with specific baseline characteristics. As only a small number of studies were available in our example for each treatment effect, it was not practical to account for across-study heterogeneity. We therefore restricted treatment effect covariate adjustments to the within-study level, and thus to only the treatment effect of imatinib for which IPD was available. The model was defined as θ ij e N f ij β ð Þ; σ 2 β ð12Þ Where Equations (7) and (14) are modifications of Equations (8) and (2) to include patient specific treatment effects. The elements γ l of γ were the effects of the covariate on the treatment effect β l . The linear functions f ij () therefore acted on linear combinations of the treatment effects β and their covariate adjustments γ.
The same priors as before were used for α, β l , σ, σ α , σ β , φ and π, while a Normal distribution for individual patent level effects was used for the γ l , i.e.
γ l e N 0; 40000 ð Þ This completed the specification of an NMA model for combining IPD and aggregate data from RCTs and observational studies with covariate adjustments on the placebo and treatment, of imatinib, effect. The models described in these sections are summarized in Table 5 and we applied them to the PAH example.

Covariate selection via DIC-based forward stepwise selection
Model M4 potentially includes covariates at three different levels and the full model space can be quite large. In our PAH example there are 6 possible covariates, so a total of 2 18 possible models. Although a model that includes all of these covariates would be highly adjustable to populations in which predictions are desired, it is necessary to avoid over fitting to the data. To avoid over fitting and produce robust predictions, we use the Deviance Information Criterion (DIC, [53]). This is a predictive criterion that balances fit and complexity. It is computationally infeasible to investigate the full model space so we instead apply DIC-based forward stepwise selection [54,55]. This allows us to search through the space of models using the following steps: Initially, for the PAH example with 6 covariates of interest, Step 2 involves a search of 18 possible models. The second time through involves 17 possible models, and so on. This leads to a maximum of 171 models to search, which is computationally feasible.

Results
All results presented here are from an implementation of the models described in Section 2, and summarized in Table 5, in the WinBUGS [56] software package. This is a Windows based software for Bayesian inference using Gibbs sampling. The code for these models is provided in Additional file 3 and the authors are happy to respond to any queries about its use. All results were sampled from 250 000 iterations of a single Markov chain Monte Carlo (MCMC) chain following a burn-in of 100 000 iterations. We also sampled a second chain from alternate initial values and confirmed that 250 000 iterations was sufficient for convergence on the basis of the Gelman-Rubin statistic [57].

Results of models M1 and M2: NMA with no covariate adjustments
Summary statistics of the posterior distributions of the placebo and treatment effects, on the scale of change in 6MWD in meters, for the comparison of imatinib against prostacyclin analogues as add-on to ERA and PDE5i from the model M1, described in Section 2.1, are presented in Table 6 and Figure 5. This NMA combined only summary statistics from the IMPRES trial and did not make use of the available IPD. The posterior means and 95% credible intervals are comfortably within the prior ranges specified in Section 2.1. The summary of the placebo effect α implies that a randomly selected group of patients would be expected to have a mean 6MWD improvement of 4.78 meters, and for this mean to lie within the range of -4.8 and 14.6 meters with a probability of 95%, were they to enter a placebo arm of one of the studies. This is not unreasonable on the basis of the means and standard errors of the observed changes in 6MWD in the control arms of the RCTs, reported in Table 2 and  Table 3. The wide and inconclusive 95% credible intervals for the treatment effects and comparison are indicative of the weakness of the evidence. Also provided in Table 6 and Figure 5 are the results of model M2, described in Section 2.2, which combined available aggregate data with IPD from the IMPRES study. The means of the posterior distributions do not change very much but the credible intervals for parameters based on IPD from IMPRES, the α (imatinib to E + P5) and treatment effect β 8 , shrink. This reduction in the width of the credible intervals is due to the complex interaction between the vague priors in the different parameterisation of model M2 from M1 and is not due to any improvement in the use of the evidence. Even vague priors are somewhat informative and this is illustrated by the reduction in the credible intervals.  Table 6 and Figure 5, while further parameter estimates are provided in Additional file 4. We used DIC-based forward stepwise selection to choose the covariate adjustments at across-study and within-study level on the placebo effects and within-study level on the treatment effect of imatinib. It was found that the DIC-minimizing model had no across-study covariate adjustments on the placebo effect but had within-study adjustments for AGE, STATUS and PVR on the placebo effect.
The benefit of including IPD is again indicated by the reduction in the 95% credible interval for the treatment effect of imatinib added to ERA and PDE5i (β 8 ) from that of model M1 of aggregated data. The 95% credible interval for the indirect comparison of imatinib against prostacyclins as add-on to ERA and PDE5i, (-76.65, 85.24) from model M3, remains approximately the same width as in the aggregate data model, (-83.70, 89.27) from model M1, as illustrated in Figure 5. This is because the effect of additional prostacyclins is based on only aggregate data. The direction of the effects of AGE (-0.90), STATUS (-11.89), and PVR (-0.05) on the expected 6MWD improvement of a patient in the IMPRES trial imply that older and sicker patients have a lower expected improvement, which is reasonable. The non-selection of across-study covariates indicates that the imputed values for missing covariates, such as PVR in Jacobs et al., have no effect on the results. That some values were imputed may affect the DIC-based selection but this is unlikely to be a strong effect as the covariate adjustments were generally found to have little impact. We further fit model M4, described in Section 2.4, and applied DIC-based stepwise selection to choose covariate adjustments on the treatment effect of imatinib. However, Table 6 Results of four network meta-analyses: based on only aggregate data; combining IPD and aggregate data with no covariate adjustments; combining IPD and aggregate data with covariate adjustments for individual patient AGE, baseline STATUS and baseline PVR at within-study level; using the covariate adjusted IPD and aggregate data model with the observational studies down-weighted by inflating their standard errors by a factor of 10; using the covariate adjusted IPD and aggregate data model with constructed control arms for observational studies with an assumed deterioration of 25 meters in 6MWD   Results sampled from 250000 iterations following a burn-in of 100000. no such covariate adjustments were included so the chosen model M4 was identical to M3. In addition to applying our NMA methodology to the PAH example, we also tested the impact of its assumptions through sensitivity analyses. Sensitivity analysis, model S1: down-weighting the observational studies In our standard NMA models M1, M2, M3 and M4, we gave equal weight to the results of the before-and-after observational studies and those of two-arm RCTs. An alternative to this assumption is to down-weight the results, recognizing internal bias due to lack of rigor, through a multiplicative adjustment to the standard errors of the results in either or both arms of the aggregate data, i.e.
where δ i is the quality weight of the i th study, based on a subjective assessment. This is similar to the weighting of the empirical priors derived from observational evidence discussed in the background section [15,21]. If δ i = 1, it would represent a study that was judged to be of the highest quality, and its evidence would be given full weight. This was the value we assigned to RCT data. Using the covariate adjusted model of Section 3.2, we repeated the simulations with δ i = 0.1, increasing observed standard errors by a factor of 10, for the observational studies, thus down-weighting them, relative to RCTs, to represent their poorer quality.
For example, the observed change in 6MWD from baseline in Jacobs et al. was 41 meters with a standard error of 38, as reported in Table 4. This sensitivity analysis would assume that this standard error had been 380, substantially larger than any of the observed standard errors reported in Tables 2, 3 or 4 (maximum was about 50). We can therefore conclude that if our analysis is robust to down-weighting by a factor of δ i = 0.1, it is likely to be robust to most levels of uncertainty we could plausibly observe.
The results from this sensitivity analysis, labeled model S1, are presented in Table 5 and Figure 5. The main change from the results of the model without downweighting of the observational studies, models M1, M2 and M3 in Table 5, was the increased range of the 95% credible intervals for treatment effects estimated on the basis of observational studies, such as β 6 . The range of the 95% credible interval of the comparison of imatinib against prostacyclins as add-on to ERA + PDE5i was also increased, by a large factor, illustrated in Figure 5, due to its reliance on the down-weighted observational studies. The magnitude of the comparative effectiveness (β 8 − β 6 ) also increased substantially, but is most likely due to the increased random variation illustrated by the expanded credible intervals.
This decrease in the accuracy of the treatment effect estimates and indirect comparisons indicates the influence of the observational studies. As the effect was largely on the accuracy of these estimates and not on their direction, it could be concluded that the NMA methodology was robust to the down-weighting of the observational studies, although its reliance on possibly weak and biased observational evidence was highlighted. Sensitivity analysis, model S2: constructed control arms in the observational studies The lack of control arms in the observational studies presented a difficulty of not knowing what would have happened had patients not been given additional treatment. The NMA models of Section 2 placed Normally distributed random effects on the expected improvements in patients who had entered a study but only received a placebo, An alternative was to construct a control arm for the observational studies by making an assumption about Y iC , the mean change in 6MWD for patients who did not receive additional therapy. As the Jacobs et al. study [43] looked at patients who were deteriorating on oral therapy, we assumed that patients' 6MWD would decrease during the trial if they were not given any new treatments. This study included patients whose 6MWD had decreased by 58 meters over a mean time of 20.6 months before entering the study. Our short term followup was approximately 24 weeks, which is less than half of 20.6 months, so we assumed a mean change of Y iC ≈−25m would be observed in missing control arms over this short term follow-up. We further assumed that the standard error of the mean in this constructed control arm, SE iC , is the same as that observed in the treatment arm. We used these assumptions to construct control arms for all observational studies.
We repeated the analysis using the covariate-adjusted model from Section 3.2 with the constructed control arms, giving the results, labeled model S2, presented in Table 5 and Figure 5. It was difficult to interpret the direction of the change in placebo and treatment effects, due to the effect of covariate adjustments. The direction of the comparison of imatinib against prostacyclins was shifted in favor of prostacyclins, which was expected due to the conservative assumption about the control arms. However, the wide confidence intervals and overall direction of the comparisons remained so the analysis was judged to be robust to this alternative assumption about the control arms.

Discussion
In this paper we have considered the problem of how to perform a network meta-analysis when the RCT evidence does not form a complete network. Our proposal was to complete the network using single-arm before-and-after observational studies by building a covariate adjusted random effects model on the placebo improvements. We built on recent innovations to construct a model which combines IPD and aggregate data from RCTs and beforeand-after studies and allows for the inclusion of covariate adjustments for heterogeneity at across-and within-study level on the placebo effect and within-study level on the treatment effect. Using this model, we performed a clinically novel comparison of the benefit of imatinib against prostacyclins as add-on therapy to PAH patients on a combination of ERA and PDE5i. This comparison was only possible through inclusion of observational studies as an evidence network restricted to RCTs would be disconnected.
As the credible intervals were very wide, the results of our application to PAH were considered to be inconclusive. This was due to the weakness of the evidence as only a few studies, with small sample sizes, were available for each edge of the network. This data limitation may also be the reason why we found that covariate adjustments had little effect on the NMA results and that no across-study adjustments were included on DIC grounds, although we can also interpret this as evidence that heterogeneity had little effect on the NMA. It is possible that important covariates were not reported by IMPRES or other studies, or that reported covariates were incorrectly considered to be of no importance due to the weakness of the data. It is also possible that the stepwise selection algorithm missed important covariates as it only investigates a small portion of the total 2 18 possible models. A simulation/robustness study could address these concerns but would be computationally intensive as the model selection step, even using stepwise selection to reduce the set of models under consideration, was resource intensive. The best strategy to improve the practical utility of this application of NMA to PAH is to collect further evidence, ideally IPD from a new or existing RCT.
Apart from these data limitations, which are specific to the application, there are a number of limitations and untestable assumptions of the model itself. As in many meta-analysis and NMA models [15], we assumed the effects of particular treatments were the same across studies by placing a fixed effect on each β l . In cases where sufficient data are available, this could be relaxed to a random effects assumption where we assume the β l from different trials follow a common, possibly Normal, distribution. Our model also assumed, in Equation (2), that effects of additional treatments had the same variance σ 2 β in all studies, no matter how many additional treatments were being administered. This is possibly implausible as a the effect of a combination of three new treatments should have a higher variance than the effect of a single new treatment. As in the case of fixed effects, this assumption could be relaxed in cases where sufficient data are available. A further simplification that limits the generalizability of our model is that it is restricted to single-or two-arm trials. To extend the model to trials with three or more arms would require careful consideration of correlation in treatment effects across arms and within studies [5,50].
An assumption of our model that is common to most NMA models is the transitivity or consistency of treatment effects across studies. This is the assumption that studies informing the comparison of treatment A against treatment B and of treatment A against treatment C can be used to inform the comparison of B against C. Our evidence network was sparse and contained only one loop, making it impractical to test for consistency of direct and indirect evidence using node-splitting [58] or other measures of inconsistency [59,60]. If more studies became available, it would be recommended to test that comparing ERA + Pr to ERA using the direct evidence [44,46] gave similar results, within some range of acceptability, to performing the comparison with only indirect evidence.
The principle assumption that allowed the inclusion of single-arm before-and-after observational studies was that the placebo effects, the α i s, were exchangeable or that there was no a priori reason that there would be systematic differences between these effects. This assumption allowed us to model the α i s using a random effects distribution. We recommend the use of this assumption and our model in cases where networks are not densely populated or fully connected when restricted to RCT evidence, such as the PAH example. Decision makers would still need to give a recommendation on which treatment to use in such situations [13] and, indeed, in Australia the PBAC already considers non-randomized observational evidence, particularly in the absence of RCTs [1]. However, this type of evidence is considered to be weak and subject to bias by decision making bodies such as NICE [14]. Additionally, the GRADE scale, which is followed by PBAC, rates the quality of such as evidence as low [61]. In cases where networks can be densely populated and fully connected by RCT evidence, this assumption may shrink placebo effects towards the mean and thus interfere with randomization [24,25,51]. In those cases, we recommend treating the α i as independent fixed effects, or nuisance parameters, and not including observational evidence.
In the PAH application, clinical opinion supported the assumption of exchangeable placebo effects, although this assumption is not testable statistically. That no acrossstudy adjustments were included in the model selection step gave an indication that there were no systematic differences in these expected improvements and that our exchangeability assumption was warranted. The simple alternative of constructing a control arm for observational studies was investigated but was found to have little effect on the results. We also explored down-weighting the observational evidence and found, as expected, a reduction in the accuracy of our findings, but no change in the overall direction of the indirect comparisons results.
Our model can be criticized on the grounds that the single-arm studies contribute to the estimation of the distribution for the placebo effects α i . We considered an alternative formulation of our model where only the RCTs would contribute to this estimation and the α i s for the observational studies would be sampled separately from this distribution. This is the method proposed for baseline natural history models by Dias et al. [24]. However, our model is designed to be applied to cases where data would already be limited, such as the PAH example, so a further reduction of the evidence base would be undesirable, although in practice the contribution of the observational studies to the α i estimation will be limited.
A very simple alternative to a random effects assumption for the placebo effects is to use a single fixed effect α for the α i s. This is an assumption that all patient populations started on placebo have the same short term expected improvement in 6MWD and that any differences are due to the treatment or covariate effects. We repeated the NMA with this assumption and found that the results were similar in magnitude and direction to those of the random effects model and that the DIC was considerably higher, with 1889 for fixed effects versus 1870 for random effects. This DIC gives evidence in favor of our random effects model. The single fixed effect model was also not clinically plausible as there were many inherent differences in the studies so a common placebo effect would be difficult to justify.
Several additional sensitivity analyses were conducted. Firstly, as no across-study covariates were included, we applied our final NMA model to an evidence network which included the studies which were excluded due to non-reporting of covariates. This included one extra RCT [62] and three observational studies [63][64][65]. The results of this sensitivity analysis, not reported in this paper, were almost identical to those of the base case. Prior sensitivity analyses, where we tried prior distributions with greater variances, led us to conclude that the results were not dependent on our choice of prior parameters. Although non-normal priors could be easily implemented if the application required them, normal priors were judged to be appropriate for the continuous outcome of change in 6MWD through expert clinical opinion and exploratory analysis of the IMPRES data.
Aside from the extensions to multi-arm trials, separation of the placebo estimation between RCT and observational studies, and other possibilities so far discussed, there are a variety of directions for future extension of our methodology. One such direction would be to apply the model to non-continuous outcomes such as binary outcomes. NMA models combining IPD and aggregate data for binary outcomes have been discussed in the literature [6,7] and the use of a random effects model for placebo effects to include single-arm studies would be a straightforward extension. Our methods are also readily applicable to pairwise meta-analysis, as it was in this setting that the use of random effects modelling of placebo effects to include single-arm studies was first proposed [23]. Although in pairwise meta-analysis the model would no longer be justified on the grounds of completing evidence networks, it may be useful in cases where there are only a limited number of small RCTs and large, highquality single-arm studies are available. An additional direction for research is the joint network meta-analysis of multivariate outcomes, such as PVR and change in 6MWD in PAH [66]. This approach would treat all covariates as responses and would account for missing values, a reason for exclusion of several studies, through a form of multiple imputation. This would have the advantage of using the evidence more consistently, rather than our approach of singly imputing missing covariates, such as PVR in Jacobs et al. [43]. However, this extension would require a greater evidence base than was available for the PAH example.
All of the limitations we have discussed should be kept in mind if applying our model in order to avoid being misled by the results of an analysis in which observational evidence is included. We would recommend conducting the sensitivity analyses we have described to ensure the model and the implications of its various assumptions are fully understood.

Conclusions
We have developed an extension of existing NMA methodology to allow the completion of disconnected networks of RCT evidence through the inclusion of singlearm before-and-after observational studies. This model also brings together many recent developments in network meta-analysis of IPD and aggregate data. Our application to PAH demonstrated the utility of our methodology as comparisons impossible to conduct on the basis of RCTs alone could be conducted through the inclusion of observational studies. Although IPD and covariate adjustments were found to make little difference to the results, we believe this model could be easily applied to many other disease areas and settings which require the inclusion of observational evidence. Our work therefore furthers the range of evidence synthesis problems that can be approached through NMA.