 Research
 Open Access
 Published:
Twostage matchingadjusted indirect comparison
BMC Medical Research Methodology volume 22, Article number: 217 (2022)
Abstract
Background
Anchored covariateadjusted indirect comparisons inform reimbursement decisions where there are no headtohead trials between the treatments of interest, there is a common comparator arm shared by the studies, and there are patientlevel data limitations. Matchingadjusted indirect comparison (MAIC), based on propensity score weighting, is the most widely used covariateadjusted indirect comparison method in health technology assessment. MAIC has poor precision and is inefficient when the effective sample size after weighting is small.
Methods
A modular extension to MAIC, termed twostage matchingadjusted indirect comparison (2SMAIC), is proposed. This uses two parametric models. One estimates the treatment assignment mechanism in the study with individual patient data (IPD), the other estimates the trial assignment mechanism. The first model produces inverse probability weights that are combined with the odds weights produced by the second model. The resulting weights seek to balance covariates between treatment arms and across studies. A simulation study provides proofofprinciple in an indirect comparison performed across two randomized trials. Nevertheless, 2SMAIC can be applied in situations where the IPD trial is observational, by including potential confounders in the treatment assignment model. The simulation study also explores the use of weight truncation in combination with MAIC for the first time.
Results
Despite enforcing randomization and knowing the true treatment assignment mechanism in the IPD trial, 2SMAIC yields improved precision and efficiency with respect to MAIC in all scenarios, while maintaining similarly low levels of bias. The twostage approach is effective when sample sizes in the IPD trial are low, as it controls for chance imbalances in prognostic baseline covariates between study arms. It is not as effective when overlap between the trials’ target populations is poor and the extremity of the weights is high. In these scenarios, truncation leads to substantial precision and efficiency gains but induces considerable bias. The combination of a twostage approach with truncation produces the highest precision and efficiency improvements.
Conclusions
Twostage approaches to MAIC can increase precision and efficiency with respect to the standard approach by adjusting for empirical imbalances in prognostic covariates in the IPD trial. Further modules could be incorporated for additional variance reduction or to account for missingness and noncompliance in the IPD trial.
Background
In many countries, health technology assessment (HTA) addresses whether new treatments should be reimbursed by public health care systems [1]. This often requires estimating relative effects for interventions that have not been directly compared in a headtohead trial [2]. Consider that there are two active treatments of interest, say A and B, that have not been evaluated in the same study, but have been contrasted against a comparator C in different studies. In this situation, an indirect comparison of relative treatment effect estimates is required. The analysis is said to be anchored by the common comparator C.
A typical situation in HTA is that where a pharmaceutical company has individual patient data (IPD) from its own study comparing A versus C, which we shall denote the index trial, but only published aggregatelevel data (ALD) from another study comparing B versus C, which we call the competitor trial. In this twostudy scenario, crosstrial imbalances in effect measure modifiers, effect modifiers for short, make the standard indirect treatment comparisons [3] vulnerable to bias [4]. Novel covariateadjusted indirect comparison methods have been introduced to account for these imbalances and provide equipoise to the comparison [5,6,7,8,9].
The most popular methodology [10] in peerreviewed publications and submissions for reimbursement is matchingadjusted indirect comparison (MAIC) [11,12,13]. MAIC weights the subjects in the index trial to create a “pseudosample” with balanced moments with respect to the competitor trial. The standard formulation of MAIC proposed by Signorovitch et al. [11] uses a method of moments to estimate a logistic regression, which models the trial assignment mechanism. The weights are derived from the fitted model and represent the odds of assignment to the competitor trial for the subjects in the IPD, conditional on selected baseline covariates.
Under no failures of assumptions, MAIC has produced unbiased treatment effect estimation in simulation studies [7, 14,15,16,17,18,19,20]. Nevertheless, there are some concerns about its inefficiency and instability, particularly where covariate overlap is poor and effective sample sizes (ESSs) after weighting are small [21]. These scenarios are pervasive in health technology appraisals [10]. In these cases, weighting methods are sensitive to inordinate influence by a few subjects with extreme weights and are vulnerable to poor precision. A related concern is that feasible numerical solutions may not exist where there is no common covariate support [21, 22]. Where overlap is weak, methods based on modeling the outcome expectation exhibit greater precision and efficiency than MAIC [21, 23,24,25] but are prone to extrapolation, which may lead to severe bias under model misspecification [26, 27].
Consequently, modifications of MAIC that seek to maximize precision have been presented. An alternative implementation estimates the weights using entropy balancing [17, 28]. The proposal is similar to the standard method of moments, with the additional constraint that the weights are as close as possible to unit weights, potentially penalizing extreme weighting schemes. While the approach has appealing computational properties, Phillippo et al. have proved that it is mathematically equivalent to the standard method of moments [29].
More recently, Jackson et al. have developed a distinct weight estimation procedure that satisfies the conventional method of moments while explicitly maximizing the ESS [22]. This translates into minimizing the dispersion of the weights, with more stable weights improving precision at the expense of inducing bias.
Other approaches to limit the undue impact of extreme weights involve truncating or capping the weights. These are common in survey sampling [30] and in many propensity score settings [31, 32] but are yet to be investigated specifically alongside MAIC. Again, a clear tradeoff is involved from a biasvariance standpoint. Lower variance comes at the cost of sacrificing balance and accepting bias [33, 34]. Limitations of weight truncation are that it shifts the target population or estimand definition, and that it requires arbitrary ad hoc decisions on cutoff thresholds.
In order to gain efficiency, I propose a modular extension to MAIC which uses two parametric models. One estimates the treatment assignment mechanism in the index study, the other estimates the trial assignment mechanism. The first model produces inverse probability of treatment weights that are combined with the weights produced by the second model. I term this approach twostage matchingadjusted indirect comparison (2SMAIC).
In the anchored scenario, the conventional version of MAIC relies on randomization in the index trial. In this setting, the treatment assignment mechanism (the true conditional probability of treatment among the subjects) is typically known. In addition, randomization ensures that there is no confounding on expectation. Therefore, it may seem counterintuitive to model the treatment assignment mechanism in this study. Nevertheless, this additional step is beneficial to control for finitesample imbalances in prognostic baseline covariates. These imbalances often arise due to chance and correcting for them leads to efficiency gains.
An advantage of 2SMAIC is that, due to incorporating a treatment assignment model, it is also applicable where the index study is observational. In this case, withinstudy randomization is not leveraged and concerns about internal validity must be addressed by including potential confounders of the treatmentoutcome association in the treatment assignment model. The estimation procedure for the trial assignment weights does not necessarily need to be that of Signorovitch et al. [11] and alternative methods could be used [16, 22]. Further modules could be incorporated to account for missingness [35] and noncompliance [36], e.g. dropout or treatment switching, in the index trial.
I conduct a proofofconcept simulation study to examine the finitesample performance of 2SMAIC with respect to the standard MAIC when the index study is an RCT. The twostage approach improves the precision and efficiency of MAIC without introducing bias. The results are consistent with previous research on the efficiency of propensity score estimators [37, 38]. Finally, the use of weight truncation in combination with MAIC is explored for the first time. Example code to implement the methodologies in R is provided in Additional file 1.
Methods
Context and data structure
We focus on the following setting, which is common in submissions to HTA agencies. Let S and T denote indicators for the assigned study and the assigned treatment, respectively. There are two separate studies that enrolled distinct sets of participants and have now been completed. The index study (S=1) compares active treatment A (T=1) versus C (T=0), e.g. standard of care or placebo. The competitor study (S=2) evaluates active treatment B (T=2) versus C (T=0). Covariateadjusted indirect comparisons such as MAIC perform a treatment comparison in the S=2 sample, implicitly assumed to be of policy interest. We ask ourselves the question: what would be the marginal treatment effect for A versus B had these treatments been compared in an RCT conducted in S=2?
The marginal treatment effect for A vs. B is estimated on the linear predictor (e.g. mean difference, logodds ratio or log hazard ratio) scale as:
where \(\hat {\Delta }_{10}^{(2)}\) is an estimate of the hypothetical marginal treatment effect for A vs. C in the competitor study sample, and \(\hat {\Delta }_{20}^{(2)}\) is an estimate of the marginal treatment effect of B vs. C in the competitor study sample. MAIC uses weighting to transport inferences for the marginal A vs. C treatment effect from S=1 to S=2. The estimate \(\hat {\Delta }_{10}^{(2)}\) is produced, which is then input into Eq. 1. Because the withintrial relative effect estimates are assumed statistically independent, their variances are summed to estimate the variance of the marginal treatment effect for A vs. B.
The manufacturer submitting evidence for reimbursement has access to individuallevel data \(\mathcal {D}_{AC}=({\boldsymbol {x},\boldsymbol {t},\boldsymbol {y}})\) on covariates, treatment and outcomes for the participants in its trial. Here, x is a matrix of pretreatment baseline covariates (e.g. comorbidities, age, gender), of size n×k, where n is the total number of subjects in the study sample and k is the number of covariates. A row vector x_{i}=(x_{i,1},x_{i,2},…,x_{1,k}) of k covariates is recorded for each participant i=1,…n. We let y=(y_{1},y_{2},…,y_{n}) denote a vector of the clinical outcome of interest and t=(t_{1},t_{2},…,t_{n}) denote a binary treatment indicator vector. We shall assume that there is no loss to followup or missing data on covariates, treatment and outcome in \(\mathcal {D}_{AC}\).
We consider all baseline covariates to be prognostic of the clinical outcome and select a subset of these, z⊆x, as marginal effect modifiers for A with respect to C on the linear predictor scale, with a row vector z_{i} recorded for each patient i. In the absence of randomization, the variables in x would induce confounding between the treatment arms in the index study (internal validity bias). On the other hand, crosstrial imbalances in the variables in z induce external validity bias with respect to the competitor study sample.
Neither the manufacturer submitting the evidence nor the HTA agency evaluating it have access to IPD for the competitor trial. We let \(\mathcal {D}_{BC}=[\boldsymbol {\theta }_{\boldsymbol {x}}, \hat {\Delta }_{20}^{(2)}, \hat {V}(\hat {\Delta }_{20}^{(2)})]\) represent the published ALD that is available for this study. No patientlevel covariates, treatment or outcomes are available. Here, θ_{x} denotes a vector of means or proportions for the covariates; although higherorder moments such as variances may also be available. An assumption is that a sufficiently rich set of baseline covariates has been measured for the competitor study. Namely, that summaries for the subset θ_{z}⊆θ_{x} of covariates that are marginal effect modifiers are described in the table of baseline characteristics in the study publication.
Also available is an internally valid estimate \(\hat {\Delta }_{20}^{(2)}\) of the marginal treatment effect for B vs. C in the competitor study sample, and an estimate \(\hat {V}(\hat {\Delta }_{20}^{(2)})\) of its variance. These are either directly reported in the publication or, assuming that the competitor study is a wellconducted RCT, derived from crude aggregate outcomes in the literature.
Matchingadjusted indirect comparison
In MAIC, IPD from the index study are weighted so that the moments of selected covariates are balanced with respect to the published moments of the competitor study. The weight w_{i} for each participant i in the index trial is estimated using a logistic regression:
where α_{0} is the model intercept and α_{1} is a vector of model coefficients. While most applications of weighting, e.g. to control for confounding in observational studies, construct “inverse probability” weights for treatment assignment, MAIC uses “odds weighting” [39, 40] to model trial assignment. The weight w_{i} represents the conditional odds that an individual i with covariates z_{i}, selected as marginal effect modifiers, is enrolled in the competitor study. Alternatively, the weight represents the inverse conditional odds that the individual is enrolled in the index study.
The logistic regression parameters in Eq. 2 cannot be derived using conventional methods such as maximumlikelihood estimation, due to unavailable IPD for the competitor trial. Signorovitch et al. propose using a method of moments instead to enforce covariate balance across studies [11]. Prior to balancing, the IPD covariates are centered on the means or proportions published for the competitor trial. The centered covariates for subject i in the IPD are defined as \(\boldsymbol {z}^{\boldsymbol {*}}_{i} = \boldsymbol {z}_{i}  \boldsymbol {\theta }_{\boldsymbol {z}}\).
Weight estimation involves minimizing the objective function:
The function Q(α_{1}) is convex [11] and can be minimized using standard convex optimization algorithms [41]. Provided that there is adequate overlap, minimization yields the unique finite solution: \(\hat {\boldsymbol {\alpha }}_{\boldsymbol {1}}=\text {argmin}[Q(\boldsymbol {\alpha }_{\boldsymbol {1}})]\). Feasible solutions do not exist if all the values observed for a covariate in z are greater or lesser than its corresponding element in θ_{z} [22].
After minimizing the objective function in Eq. 3, the weight estimated for the ith participant in the IPD is:
The estimated weights are relative, in the sense that any weights that are proportional are equally valid [22]. Weighting reduces the ESS of the index trial. The approximate ESS after weighting is typically estimated as \(\left (\sum _{i}^{n}\hat {w}_{i}\right)^{2}/\sum _{i}^{n}\hat {w}_{i}^{2}\) [5, 42]. Low values of the ESS suggest that a few influential participants with disproportionate weights dominate the reweighted sample.
Consequently, marginal mean outcomes for treatments A and C in the competitor study sample (S=2) are estimated as the weighted average:
where n_{t} denotes the number of participants assigned to treatment t∈{0,1} of the index trial, y_{i,t} represents the observed clinical outcome for subject i in arm t, and \(\hat {w}_{i,t}\) is the weight assigned to patient i under treatment t. For binary outcomes, \(\hat {\mu }_{t}\) would estimate the expected marginal outcome probability under treatment t. Absolute outcome estimates may be desirable as inputs to health economic models [25] or in unanchored comparisons made in the absence of a common control group.
In anchored comparisons, the objective is to estimate a relative effect for A vs. C, as opposed to absolute outcomes. Indirect treatment comparisons are typically conducted on the linear predictor scale [3, 4, 6]. Consequently, this scale is also used to define effect modification, which is scale specific [5].
One can convert the mean absolute outcome predictions produced by Eq. 5 from the natural scale to the linear predictor scale, and compute the marginal treatment effect for A vs. C in S=2 as the difference between the average linear predictions:
Here, g(·) is an appropriate link function, e.g. the identity link produces a mean difference for continuousvalued outcomes, and the \(\text {logit} \left (\hat {\mu }^{(2)}_{t} \right) = \ln \left [\hat {\mu }^{(2)}_{t}/\left (1\hat {\mu }^{(2)}_{t} \right)\right ]\) generates a logodds ratio for binary outcomes. Different, potentially more interpretable, choices such as relative risks and risk differences are possible for the marginal contrast. One can map to these scales by manipulating \(\hat {\mu }_{1}^{(2)}\) and \(\hat {\mu }_{0}^{(2)}\) differently.
Alternatively, the weights generated by Eq. 4 can be used to fit a simple regression of outcome on treatment to the IPD [43]. The model can be fitted using maximumlikelihood estimation, weighting the contribution of each individual i to the likelihood by \(\hat {w}_{i}\). In this approach, the treatment coefficient of the fitted weighted model is the estimated marginal treatment effect \(\hat {\Delta }_{10}^{(2)}\) for A vs. C in S=2.
The original approach to MAIC uses a robust sandwichtype variance estimator [44] to compute the standard error of \(\hat {\Delta }_{10}^{(2)}\). This relies on largesample properties and has understated variability with small ESSs in a previous simulation study investigating MAIC [7] and in other settings [45,46,47,48]. In addition, most implementations of the sandwich estimator, e.g. when fitting the weighted regression [49], ignore the estimation of the trial assignment model, assuming the weights to be fixed quantities. While analytic expressions that incorporate the estimation of the weights could be derived, a practical alternative is to resample via the ordinary nonparametric bootstrap [23, 50, 51], reestimating the weights and the marginal treatment effect for A vs. C in each bootstrap iteration. Point estimates, standard errors and interval estimates can be directly calculated from the bootstrap replicates.
We briefly describe the assumptions required by MAIC and their implications:

1
Internal validity of the effect estimates derived from the index and competitor studies. This is certainly feasible where the studies are RCTs because randomization ensures exchangeability over treatment assignment on expectation. While internal validity may hold in RCTs, it is a more stringent condition for observational studies. The absence of informative measurement error, missing data, nonadherence, etc. is assumed.

2
Consistency under parallel studies [52]. There is only one welldefined version of each treatment [53] or any variations in the versions of treatment are irrelevant [54, 55]. This applies to the common comparator C in particular.

3
Conditional transportability (exchangeability) of the marginal treatment effect for A vs. C from the index to the competitor study [39]. Namely, trial assignment does not affect this measure, conditional on z. Prior research has referred to this assumption as the conditional constancy of relative effects [5, 6, 9]. It is plausible if z comprises all of the covariates that are considered to modify the marginal treatment effect for A vs. C (i.e., there are no unmeasured effect modifiers) [56,57,58]^{Footnote 1}.

4
Sufficient overlap. The ranges of the selected covariates in S=1 should cover their respective moments in S=2. Overlap violations can be deterministic or random. The former arise structurally, due to nonoverlapping trial target populations (eligibility criteria). The latter arise empirically due to chance, particularly where sample sizes are small [60]. Therefore, overlap can be assessed based on absolute sample sizes. The ESS is a convenient onenumber diagnostic.

5
Correct specification of theS=2 covariate distribution. Analysts can only approximate the joint distribution because IPD are unavailable for the competitor study. Covariate correlations are rarely published for S=2 and therefore cannot be balanced by MAIC. In that case, they are assumed equal to those in the pseudosample formed by weighting the IPD [5].
I make a brief remark on the specification of the parametric trial assignment model in Eq. 2. This does not necessarily need to be correct as long as it balances all the covariates, and potential transformations of these covariates, e.g. polynomial transformations and product terms, that modify the marginal treatment effect for A vs. C [9, 23]. Squared terms are often included to balance variances for continuous covariates [11] but initial simulation studies do not report performance benefits [14, 17]. This is probably due to greater reductions in ESS and precision [25].
The identification of effect modifiers will likely require prior background knowledge and substantive domain expertise. Biasvariance tradeoffs are also important. Failing to include an influential effect modifier in z, whether in imbalance or not, leads to bias in S=2 [5, 40, 61]. On the other hand, the inclusion of covariates that are not effect modifiers reduces overlap, thereby increasing the chance of extreme weights. This decreases precision without improving the potential for bias reduction [6, 62], even if the covariates are strongly imbalanced across studies. That is, even if they predict or are associated to trial assignment.
Put simply, as is the case for other weightingbased methods [63, 64], MAIC is potentially unbiased if either the trial assignment mechanism or the outcomegenerating mechanism is known, with the latter leading to better performance due to reduced variance and increased efficiency.
Twostage matchingadjusted indirect comparison
While the standard MAIC models the trial assignment mechanism, twostage MAIC (2SMAIC) additionally models the treatment assignment mechanism in the index trial. The treatment assignment model is estimated to produce inverse probability of treatment weights. Then, these are combined with the odds weights generated by the standard MAIC. The resulting weights seek to balance covariate moments between the studies and the treatment arms of the index trial.
For the treatment assignment mechanism, a propensity score logistic regression of treatment on the covariates is fitted to the IPD:
where β_{0} and β_{1} parametrize the logistic regression. The propensity score e_{i} is defined as the conditional probability that participant i is assigned treatment A versus treatment C given measured covariates x_{i} [65].
Having fitted the model in Eq. 7, e.g. using maximumlikelihood estimation, propensity scores for the subjects in the index trial are predicted using:
where \(\text {expit}(\cdot)=\exp (\cdot)/[1+\exp (\cdot)], \hat {\beta }_{0}\) and \(\hat {\boldsymbol {\beta }}_{\boldsymbol {1}}\) are point estimates of the logistic regression parameters, and \(\hat {e}_{i}\) is an estimate of e_{i}. Inverse probability of treatment weights are constructed by taking the reciprocal of the estimated conditional probability of the treatment assigned in the index study [37]. That would be \(1/\hat {e}_{i}\) for units under treatment A and \(1/(1\hat {e}_{i})\) for units under treatment C.
Consequently, the weights produced by the standard MAIC (Eq. 4) are rescaled by the estimated inverse probability of treatment weights. The contribution of each subject i in the IPD is weighted by:
The weights \(\{ \hat {w}_{i}, i=1,\dots,n \}\) estimated by the standard MAIC are odds, constrained to be positive. These balance the index and competitor study studies in terms of the selected effect modifier moments. The estimated propensity scores \(\{ \hat {e}_{i},\, i=1,\dots,n \}\) are probabilities bounded away from zero and one. Therefore, the weights \(\{ \hat {\omega }_{i},\, i=1,\dots,n \}\) produced by 2SMAIC in Eq. 9 are constrained to be positive. These weights achieve balance in effect modifier moments across studies, but also seek to balance covariate moments between the index trial’s treatment groups.
Marginal mean outcomes for treatments A and C in the competitor study sample are estimated as the weighted average of observed outcomes:
where \(\hat {\omega }_{i,t}\) is the weight assigned to patient i under treatment t. One can convert the mean absolute outcome predictions generated by Eq. 10 to the linear predictor scale, and compute the marginal treatment effect for A vs. C in S=2 as the difference between the average linear predictions, as per Eq. 6. Alternatively, a weighted regression of outcome on treatment alone can be fitted to the IPD, in which case the treatment coefficient of the fitted model represents the estimated marginal treatment effect \(\hat {\Delta }_{10}^{(2)}\) for A vs. C in S=2.
Inference can be based on a robust sandwichtype variance estimator or on resampling approaches such as the nonparametric bootstrap. As noted previously, the sandwich variance estimator is biased downwards when the ESS after weighting is small, leading to overprecision. In practice, the nonparametric bootstrap is a preferred option, reestimating both the trial assignment model and the treatment assignment model in each iteration. This approach explicitly accounts for the estimation of the weights and is expected to perform better where the ESS is small.
It may seem counterintuitive to estimate the treatment assignment mechanism when the index trial is an RCT. The randomized design implies that the true propensity scores {e_{i}, i=1,…,n} are fixed and known. For instance, consider a marginally randomized twoarm trial with a 1:1 treatment allocation ratio. The trial investigators have determined in advance that the probability of being assigned to active treatment versus control is e_{i}=0.5 for all i.
The rationale for estimating the propensity scores is the following. Randomization guarantees that there is no confounding on expectation [66]. Nevertheless, covariate balance is a largesample property, and one may still observe residual covariate imbalances between treatment groups due to chance, especially when the trial sample size is small [67]. As formulated by Senn [66], “over all randomizations the groups are balanced; for a particular randomization they are unbalanced.” The use of estimated propensity scores allows to correct for random finitesample imbalances in prognostic baseline covariates. In the RCT literature, inverse probability of treatment weighting is an established approach for covariate adjustment [68], and has increased precision, efficiency and power with respect to unadjusted analyses in the estimation of marginal treatment effects [48, 69].
Insofar, the use of anchored MAIC has been limited to situations where the index trial is an RCT. 2SMAIC can be used when the index study is observational, provided that the baseline covariates in x offer sufficient control for confounding. In nonrandomized studies, the true propensity score for each participant in the index study is unknown, and additional conditions are needed to produce internally valid estimates of the marginal treatment effect for A vs. C. These are: (1) conditional exchangeability over treatment assignment [70]; and (2) positivity of treatment assignment [60]. Randomized trials tend to meet these assumptions by design. The assumptions have conceptual parallels with the conditional transportability and overlap conditions previously described for MAIC.
The first assumption indicates that the potential outcomes of subjects in each treatment group are independent of the treatment assigned after conditioning on the selected covariates. It relies on all confounders of the effect of treatment on outcome being measured and accounted for [71]. The second assumption indicates that, for every participant in the index study, the probability of being assigned to either treatment is positive, conditional on the covariates selected to ensure exchangeability [60]. This requires overlap between the joint covariate distributions of the subjects under treatment A and under treatment C. This assumption is threatened if there are few or no individuals from either treatment group in certain covariate subgroups/strata.
Simulation study
Aims
The objectives of the simulation study are to provide proofofprinciple for 2SMAIC and to benchmark its statistical performance against that of MAIC in an anchored setting where the index study is an RCT. We also investigate whether weight truncation can improve the performance of MAIC and 2SMAIC by reducing the variance caused by extreme weights.
Each method is assessed using the following frequentist characteristics [72]: (1) unbiasedness; (2) precision; (3) efficiency (accuracy); and (4) randomization validity (valid confidence interval estimates). The selected performance metrics specifically evaluate these criteria. The ADEMP (Aims, Datagenerating mechanisms, Estimands, Methods, Performance measures) framework [72] is used to describe the simulation study design. Example R code implementing the methodologies is provided in Additional file 1. All simulations and analyses have been conducted in R software version 4.1.1 [73]^{Footnote 2}.
Datagenerating mechanisms
We consider continuous outcomes using the mean difference as the measure of effect. For the index and competitor studies, outcome y_{i} for participant i is generated as:
using the notation of the index study data. Each x_{i} contains the values of three correlated continuous covariates, which have been simulated from a multivariate normal distribution with prespecified means and covariance matrix. There is some positive correlation between the three covariates, with pairwise Pearson correlation levels set to 0.2. The covariates have main effects and are prognostic of individuallevel outcomes independently of treatment. They also have firstorder covariatetreatment product terms, thereby modifying the conditional (and marginal) effects of both A and B versus C on the mean difference scale, i.e., z is equivalent to x. The term ε_{i} is an error term for subject i generated from a standard (zeromean, unitvariance) normal distribution.
The main “prognostic” coefficient β_{1,k}=2 for each covariate k. This is considered a strong covariateoutcome association. The interaction coefficient β_{2,k}=1 for each covariate k, indicating notable effect modification. We set the intercept β_{0}=5. Active treatments A and B are assumed to have the same set of effect modifiers with respect to the common comparator, and identical interaction coefficients for each effect modifier. Consequently, the shared (conditional) effect modifier assumption holds [5]. The main treatment coefficient β_{t}=−2 is considered a strong conditional treatment effect versus the control at baseline (when the covariate values are zero).
The continuous outcome may represent a biomarker indicating disease severity. The covariates are comorbidities associated with higher values of the biomarker and which interact with the active treatments to hinder their effect versus the control.
It is assumed that the index and competitor studies are simple, marginally randomized, RCTs. The number of participants in the competitor RCT is 300, with a 1:1 allocation ratio for active treatment vs. control. For this study, individuallevel covariates are summarized as means. These would be available to the analyst in a table of baseline characteristics in the study publication. Individuallevel outcomes are aggregated by fitting a simple linear regression of outcome on treatment to produce an unadjusted estimate of the marginal mean difference for B vs. C, with its corresponding nominal standard error. This information would also be available in the published study.
We adopt a factorial arrangement using two index trial sample sizes times three overlap settings. This results in a total of six simulation scenarios. The following parameter values are varied:

Sample sizes of n∈{140,200} are considered for the index trial, with an allocation ratio of 1:1 for intervention A vs. C. The sample sizes are small but not unusual in applications of MAIC in HTA submissions [10]. It is anticipated that smaller trials are subject to a greater chance of covariate imbalance than larger trials [74].

The level of (deterministic) covariate overlap. Covariates follow normal marginal distributions in both studies. For the competitor trial, the marginal distribution means are fixed at 0.6. For the index trial, the mean μ_{k}∈{0.5,0.4,0.3} for each covariate k. These settings yield strong, moderate and poor overlap, respectively. The standard deviations in both studies are fixed at 0.4, i.e., a one standard deviation increase in each covariate is associated with a 0.8 unit increase in the outcome. Greater covariate imbalances across studies lead to poorer overlap between the trials’ target populations, which translates into more variable weights and a lower ESS. Unless otherwise stated, when describing the results of the simulation study, “covariate overlap” relates to deterministic overlap between the trials’ target populations and not to random violations arising due to small sample sizes.
Estimands
The target estimand is the marginal mean difference for A vs. B in S=2. The treatment coefficient β_{t}=−2 is the same for both A vs. C and B vs. C, and the shared (conditional) effect modifier assumption holds. Therefore, the true conditional treatment effects for A vs. C and B vs. C in S=2 are the same (−2+3×(0.6×1)=−0.2). Because mean differences are collapsible, the true marginal treatment effects for A vs. C and B vs. C coincide with the corresponding conditional estimands. The true marginal effect for A vs. B in S=2 is a composite of that for A vs. C and B vs. C, which cancel out. Consequently, the true marginal mean difference for A vs. B in S=2 is zero.
Note that all the methods being compared conduct the same unadjusted analysis to estimate the marginal treatment effect of B vs. C. Because the competitor study is a randomized trial, this estimate should be unbiased with respect to the corresponding marginal estimand in S=2. Therefore, differences in performance between the methods will arise from the comparison between A and C, for which marginal and conditional estimands are nonnull.
Methods
Each simulated dataset is analyzed using the following methods:

Matchingadjusted indirect comparison (MAIC). The trial assignment model in Eq. 2 contains main effect terms for all three effect modifiers — only covariate means are balanced. The objective function in Eq. 3 is minimized using BFGS [41]. The weights estimated by Eq. 4 are used to fit a weighted simple linear regression of outcome on treatment to the index trial IPD.

Twostage matchingadjusted indirect comparison (2SMAIC). We follow the same steps as for the standard MAIC. In addition, the treatment assignment model in Eq. 7 is fitted to the index study IPD, including main effect terms for all three baseline covariates. Propensity score estimates are generated by Eq. 8 and combined with the weights generated by Eq. 4 as per Eq. 9. The resulting weights are used to fit a weighted simple linear regression of outcome on treatment to the index trial IPD.

Truncated matchingadjusted indirect comparison (TMAIC). This approach is identical to MAIC but the highest estimated weights (Eq. 4) are truncated using a 95th percentile cutpoint, following Susukida et al. [75, 76], WebsterClark et al. [77], and Lee et al. [31]. Specifically, all weights above the 95th percentile are replaced by the value of the 95th percentile.

Truncated twostage matchingadjusted indirect comparison (T2SMAIC). This approach is identical to 2SMAIC but all the estimated weights (Eq. 9) larger than the 95th percentile are set equal to the 95th percentile.
All approaches use the ordinary nonparametric bootstrap to estimate the variance of the A vs. C marginal treatment effect. 2,000 resamples of each simulated dataset are drawn with replacement [50, 78]. Due to patientlevel data limitations for the competitor study, only the IPD of the index trial are resampled in the implementation of the bootstrap. The average marginal mean difference for A vs. C in S=2 is computed as the average across the bootstrap resamples. Its standard error is the standard deviation across these resamples. For the “onestage” MAIC approaches, each bootstrap iteration reestimates the trial assignment model. For the “twostage” MAIC approaches, both the trial assignment and the treatment assignment model are reestimated in each iteration.
All methods perform the indirect treatment comparison in a final stage, where the results of the studyspecific analyses are combined. The marginal mean difference for A vs. B is obtained by directly substituting the point estimates \(\hat {\Delta }_{10}^{(2)}\) and \(\hat {\Delta }_{20}^{(2)}\) in Eq. 1. Its variance is estimated by adding the point estimates of the variance for the withinstudy treatment effect estimates. Waldtype 95% confidence interval estimates are constructed using normal distributions.
Performance measures
We generate 5,000 simulated datasets per simulation scenario. For each scenario and analysis method, the following performance metrics are computed over the 5,000 replicates: (1) bias in the estimated treatment effect; (2) empirical standard error (ESE); (3) mean square error (MSE); and (4) empirical coverage rate of the 95% confidence interval estimates. These metrics are defined explicitly in prior work [7, 72].
The bias evaluates aim 1 of the simulation study. It is equal to the average treatment effect estimate across the simulations because the true estimand is zero (\(\Delta _{12}^{(2)}=0\)). The ESE targets aim 2 and is the standard deviation of the treatment effect estimates over the 5,000 runs. The MSE represents the average squared bias plus the variance across the simulated replicates. It measures overall efficiency (aim 3), accounting for both bias (aim 1) and precision (aim 2). Coverage assesses aim 4, and is computed as the percentage of estimated 95% confidence intervals that contain the true value of the estimand.
We have used 5,000 replicates per scenario based on the analysis method and scenario with the largest longrun variability (standard MAIC with n=140 and poor overlap). Assuming \(\text {SD}(\hat {\Delta }_{12}^{(2)}) \leq 0.53\), the Monte Carlo standard error (MCSE) of the bias is at most \(\sqrt {\text {Var}(\hat {\Delta }_{12}^{(2)})/N_{sim}}=\sqrt {0.28/5000}=0.007\) under 5,000 simulations per scenario, and the MCSE of the coverage, based on an empirical coverage rate of 95% is \(\left (\sqrt {(95 \times 5)/5000}\right)\%=0.31\%\), with the worstcase being 0.71% under 50% coverage. These are considered adequate levels of simulation uncertainty.
Results
Performance measures for all methods and simulation scenarios are reported in Fig. 1. The strong overlap settings are at the top (in ascending order of index trial sample size), followed by the moderate overlap settings and the poor overlap settings at the bottom. For each datagenerating mechanism, there is a ridgeline plot visualizing the spread of point estimates for the marginal A vs. B treatment effect over the 5,000 simulation replicates. Below each plot, a table summarizing the performance metrics of each method is displayed. MCSEs for each metric, used to quantify the simulation uncertainty, have been computed and are presented in parentheses alongside the average of each performance measure. These are considered negligible due to the large number of simulated datasets per scenario. In Fig. 1, Cov denotes the empirical coverage rate of the 95% confidence interval estimates.
In the most extreme scenario (n=140 and poor covariate overlap), weights could not be estimated for 1 of the 5,000 simulated datasets. This was due to total separation: empirically, all the values observed in the index trial for one of the baseline covariates were below the competitor study mean. Therefore, there were no feasible solutions minimizing the objective function in Eq. 3. The affected replicate was discarded, and 4,999 simulated datasets were analyzed in the corresponding scenario. With respect to the treatment assignment model, empirical overlap between treatment arms was always excellent due to randomization in the index trial.
Bias
Even with the small index trial sample sizes, bias is similarly low for MAIC and 2SMAIC without truncation in all simulation scenarios. There is a slight increase in bias as the ESS after weighting decreases, with the bias of highest magnitude occurring with n=140 and poor covariate overlap (the scenario with the lowest ESS after weighting) for MAIC (0.041) and 2SMAIC (0.031). In absolute terms, the bias of 2SMAIC is smaller than that of MAIC in all simulation scenarios. For 2SMAIC, it is within Monte Carlo error of zero in all scenarios except in the most extreme setting, mentioned earlier, and in the setting with n=200 and moderate overlap (0.008). Of all methods, 2SMAIC produces the lowest bias in every simulation scenario.
Weight truncation increases absolute bias in all scenarios. TMAIC and T2SMAIC consistently exhibit greater bias than MAIC and 2SMAIC. When overlap is strong, truncation only induces bias very slightly. As overlap is reduced, the bias induced by truncation is more noticeable, particularly in the n=140 settings. For instance, the bias for TMAIC and T2SMAIC in the scenarios with poor overlap is substantial (for n=140: 0.157 and 0.160, respectively; for n=200, 0.149 and 0.153). For the truncated methods, the magnitude of the bias also appears to increase as the ESS after weighting decreases.
Precision
As expected, all methods incur precision losses as the number of subjects in the index trial and covariate overlap decrease. Despite enforcing randomization in the index trial, 2SMAIC increases precision, as measured by the ESE, with respect to MAIC in every simulation scenario. Reductions in ESE are more dramatic in the n=140 settings than in the n=200 settings. This is attributed to a greater chance of empirical covariate imbalances with smaller sample sizes. Interestingly, reduced covariate overlap seems to minimize the effect of incorporating the second (treatment assignment) stage. This is likely due to precision gains being offset by the presence of extreme weights, which lead to large reductions in ESS and inflate the ESE. The same trends are revealed for T2SMAIC with respect to TMAIC across the simulation scenarios. Both “twostage” versions have reduced ESEs compared to their “onestage” counterparts in all scenarios.
Weight truncation decreases the ESE across all simulation scenarios for onestage and twostage MAIC. This is to be expected as the influence of outlying weights is reduced. When overlap is strong, truncation offers only a small improvement in precision. This has little impact in comparison to the inclusion of a second stage in MAIC. For instance, under strong overlap and n=140, the ESE for MAIC and 2SMAIC is 0.516 and 0.386, respectively; compared to ESEs of 0.489 and 0.371 for the corresponding truncated versions.
The precision gains of weight truncation become more considerable as overlap weakens and the extremity of the weights increases. When overlap is poor, truncation reduces the ESE more sharply than the incorporation of a second stage in MAIC. For example, under poor overlap and n=140, the ESE of MAIC and 2SMAIC is 0.767 and 0.703, respectively, and that of the truncated versions is 0.563 and 0.519. Unsurprisingly, the combination of incorporating the second stage and truncating the weights is most effective at variance reduction. As n decreases, precision seems to be more markedly reduced for the onestage approaches than for the twostage approaches, and for the untruncated approaches than for the truncated ones.
Where covariate overlap is strong, T2SMAIC has the highest precision, followed by 2SMAIC, TMAIC and MAIC. Where covariate overlap is moderate or poor, T2SMAIC has the highest precision, followed by TMAIC, 2SMAIC and MAIC.
Efficiency
As per the ESE, MSE values decrease for all methods as the index trial sample size and covariate overlap increase. In agreement with the trends for precision, the twostage versions of MAIC increase efficiency with respect to the corresponding onestage methods in all scenarios, particularly in the n=140 settings. Efficiency gains for the twostage approaches are stronger where covariate overlap is strong and become less noticeable as covariate overlap weakens, due to extreme weights. For instance, with strong overlap and n=200, MSEs for MAIC and 2SMAIC are 0.205 and 0.127, respectively. With poor overlap and n=200, these are 0.459 and 0.393, respectively.
Differences in MSE between methods are driven more by comparative precision than bias. This is expected in the strong overlap scenarios, where the bias for all methods is negligible, but also occurs in the poor overlap scenarios. The precision gains of truncation more than counterbalance the increase in bias when the variability of the weights is high. As overlap decreases, the relative efficiency of the truncated versus the untruncated approaches is markedly improved. For example, with poor overlap and n=200, the MSE of TMAIC and T2SMAIC is 0.263 and 0.233, respectively (compared to MSEs of 0.459 and 0.393 for MAIC and 2SMAIC).
T2SMAIC is the most efficient method and MAIC is the least efficient method across all simulation scenarios in terms of MSE. Where covariate overlap is strong, T2SMAIC yields the highest efficiency, followed by 2SMAIC, TMAIC and MAIC. Where overlap is poor, T2SMAIC has the highest efficiency, followed by TMAIC, 2SMAIC and MAIC. Where overlap is moderate, 2SMAIC and TMAIC have comparable efficiency.
Coverage
From a frequentist perspective, 95% confidence interval estimates should include the true estimand 95% of the time. Namely, empirical coverage rates should equal the nominal coverage rates to ensure appropriate type I error rates for testing a “no effect” null hypothesis. Theoretically, due to our use of 5,000 Monte Carlo simulations per scenario, empirical coverage rates are statistically significantly different to the desired 0.95 if they are under 0.944 or over 0.956.
Empirical coverage rates for MAIC are statistically significantly different to the nominal coverage rate in all but one scenario: that with strong overlap and n=200. Where covariate overlap is strong or moderate, all other methods exhibit empirical coverage rates that are very close to the advertised nominal values (all differences are not significantly different, except for TMAIC in the scenario with strong overlap and n=140).
There is discernible undercoverage for all methods when overlap is poor. This is particularly the case for the approaches without truncation. For instance, for the smallest sample size (n=140) with poor overlap, the empirical coverage rate is 0.900 for MAIC and 0.917 for 2SMAIC. These anticonservative inferences could arise from the use of normal distributionbased confidence intervals when the ESS after weighting is small. While the largesample normal approximation produces asymptotically valid inferences, a reasonable alternative in small ESS scenarios could be the use of a tdistribution. An open question is how to choose the degrees of freedom of the tdistribution.
Interestingly, coverage drops are larger for the untruncated approaches than for the truncated approaches as overlap weakens. This is surprising because the truncated methods induce sizeable bias in the poor overlap settings, and one would have expected coverage rates to be degraded further by this bias. Weight truncation has improved coverage rates in another simulation study in a different context [31]. This warrants further investigation. Overcoverage is not a problem for any of the methods as the empirical coverage rates never rise above 0.956.
Discussion
Limitations of simulation study
In all simulation scenarios, twostage methods offer enhanced precision and efficiency with respect to onestage methods. These gains are likely linked to the prognostic strength of the baseline covariates included in the treatment assignment model. We have assumed, as is typically the case in practice, that the baseline covariates are prognostic of outcome. Less notable increases in precision and efficiency are expected when covariateoutcome associations are lower.
All approaches depend on the critical assumption of conditional transportability over trials. Given the somewhat arbitrary and unclear process driving selection into different studies in our context (in reality, there is not a formal assignment process determining whether subjects are in study sample S=1 or S=2), I have not specified a true trial assignment mechanism in the simulation study. Nevertheless, the true outcomegenerating mechanism imposes linearity and additivity assumptions in the covariateoutcome associations and the treatmentbycovariate interactions. Conditional transportability holds because the trial assignment model balances means for all the covariates that modify the marginal treatment effect of A vs. C.
In reallife scenarios, it is entirely possible that more complex relationships underlie the outcomegenerating process. These would potentially require balancing higherorder moments, covariatebycovariate interactions and nonlinear transformations of the covariates. In practice, sensitivity analyses will be required to explore whether there are discrepancies in the results produced by different model specifications.
The methods evaluated in this article focus on correcting for imbalances in baseline covariates, i.e., the ‘P’ in the PICO (Population, Intervention, Comparator, Outcome) framework [79]. Nevertheless, there are other kinds of differences which may bias indirect treatment comparisons, e.g. in comparator or endpoint definitions. The methodologies that have been evaluated in this article cannot adjust for these types of differences.
Contributions in light of recent simulation studies
Prior simulation studies in the context of anchored indirect treatment comparisons have concluded that outcome regression is more precise and efficient than weighting when the conditional outcomegenerating mechanism is known [23, 24]. This is likely to remain the case despite the performance gains of 2SMAIC and the truncated approaches with respect to MAIC.
Nevertheless, there is one caveat. In these studies, the (onestage) MAIC trial assignment model only accounts for covariates that are marginal effect modifiers. The reason for this is that including prognostic covariates that are not effect modifiers deteriorates precision without improving the potential for bias reduction. Conversely, the outcome regression approaches have included all prognostic covariates in the outcome model, making use of this prognostic information to increase precision and efficiency. Therefore, the equipoise or fairness in previous comparisons between weighting and outcome regression is debatable.
With 2SMAIC, weighting approaches can now make use of this prognostic information by including the relevant covariates in the treatment assignment model. Future simulation studies comparing weighting and outcome regression should involve 2SMAIC as opposed to its onestage counterpart, particularly in these “perfect information” scenarios.
Extension to observational studies
Almost invariably, anchored MAIC has been applied in a setting where the index trial is randomized. In this setting, the inclusion of the treatment assignment model leads to efficiency gains by increasing precision. Any reduction in bias will be, at most, modest due to the internal validity of the index trial. Nevertheless, in situations where the index study is observational, the treatment assignment model can be useful to reduce internal validity bias due to confounding.
Transporting the results of a nonrandomized study from S=1 to S=2 requires further untestable assumptions. Additional barriers are: (1) susceptibility to unmeasured confounding; and (2) positivity issues. Due to randomization, there is typically excellent overlap between treatment arms in RCTs. However, theoretical (deterministic) violations of positivity may occur in observational study designs [34, 60, 80], e.g. subjects with certain covariate values may have a contraindication for receiving one of the treatments, resulting in a null probability of treatment assignment.
In addition to these conceptual problems, “chance” violations of positivity may occur with small sample sizes or highdimensional data due to sampling variability, in both randomized and nonrandomized studies. These have not been observed in this simulation study. Nearviolations of positivity between treatment arms may lead to extreme inverse probability of treatment weights [81], further inflating variance in 2SMAIC.
Finally, it is worth noting that observational study designs have traditionally been more prone than RCTs to additional causes of internal validity bias, e.g. missing outcome data, measurement error or protocol deviations [82].
Approaches for variance reduction
Weight truncation is a relatively informal but easily implemented method to improve precision by restricting the contribution of extreme weights. The choice of a 95th percentile cutoff is based on prior literature and is somewhat arbitrary, but worked well in this simulation study. Alternative threshold values could be considered.
Lower thresholds will further reduce variance at the cost of introducing more bias and shifting the target population or estimand definition further [32, 83]. The ideal truncation level will vary on a casebycase basis and can be set empirically, e.g. by progressively truncating the weights [32, 84]. Density plots are likely helpful to assess the dispersion of the weights and identify an optimal cutoff point. Weight truncation is likely of little utility where there is sufficient overlap and the weights are wellbehaved. Efficiency gains are expected to decrease with larger sample sizes, as the induced bias could potentially offset the reduction of variance.
We have only explored two strategies to improve efficiency: (1) modeling the trial assignment mechanism; and (2) truncating the weights that are above a certain level. Nevertheless, there are other approaches that could be used in practical applications, either on their own or combined with the procedures explored in this article. Weight trimming [85] is closely related to weight truncation. It involves excluding the subjects with outlying weights, thereby sharing many of the limitations of truncation: setting arbitrary cutoff points, and changing the target population even further. Trimming is unappealing because it directly throws away information, discarding data from some individuals, and likely losing precision with respect to truncation.
The use of stabilized weights is often recommended to gain precision and efficiency [32, 86], particularly when the weights are highly variable. In the implementations of MAIC in this article, the fitted weighted outcome model is considered to be “saturated” (i.e., cannot be misspecified) because it is a marginal model of outcome on a timefixed binary treatment [87]. For saturated models, stabilized and unstabilized weights give identical results [87]. Nevertheless, weight stabilization is encouraged when the weighted outcome model is unsaturated, e.g. with dynamic (timevarying) or continuousvalued treatment regimens [44, 88].
Another approach that has been used to gain efficiency is overlap weighting [89, 90]. It also changes the target estimand, estimating treatment effects in a subsample with good overlap. While the approach is worth consideration, it is challenging to implement in our context because IPD are unavailable for the competitor study.
In the Background section, I referred to the weight estimation procedure by Jackson et al. [22], which satisfies the method of moments while maximizing the ESS, thereby reducing the dispersion of the weights. 2SMAIC is a modular framework and this approach could be used instead of the standard method of moments to estimate the trial assignment odds weights. Different weighting modules could be incorporated to account for missing outcomes [35], treatment switching [91, 92] and other forms of nonadherence to the protocol [36] in the index trial.
Conclusions
I have introduced 2SMAIC, an extension of MAIC that combines a model for the treatment assignment mechanism in the index trial with a model for the trial assignment mechanism. The first model accounts for covariate differences between treatment arms, producing inverse probability weights that can balance the treatment groups of the index study. The second model accounts for effect modifier differences between studies, generating odds weights that achieve balance across trials and allow us to transport the marginal effect for A vs. C from S=1 to S=2. In 2SMAIC, both weights are combined to attain balance between the treatment arms of the index trial and across the studies.
The statistical performance of 2SMAIC has been investigated in scenarios where the index study is an RCT. We find that the addition of a second (treatment assignment) stage increases precision and efficiency with respect to the standard onestage MAIC. It does so without inducing bias and being less prone to undercoverage. Efficiency and precision gains are prominent when the index trial has a small sample size, in which case it is subject to empirical imbalances in prognostic baseline covariates. Twostage MAIC accounts for these chance imbalances through the treatment assignment model, mitigating the precision loss coming with decreasing sample sizes. Precision and efficiency gains are attenuated when there is poor overlap between the target populations of the studies, due to the high extremity of the estimated weights.
The inclusion of weight truncation approaches has been evaluated for the first time in the context of MAIC. The onestage and twostage approaches produced very little bias before truncation was applied. Where covariate overlap was strong and the variability of the weights tolerable, truncation only improved precision and efficiency slightly, while inducing bias. The benefits of truncation become more apparent in situations with weakening overlap, where it diminishes the influence of extreme weights, substantially improving precision and even coverage with respect to the untruncated approaches.
Due to biasvariance tradeoffs, precision improvements always come at the cost of bias. In this simulation study, the tradeoff favors variance reduction over the induced bias, with truncation improving efficiency in all scenarios. Nevertheless, truncation is likely unnecessary when the weights are wellbehaved and the ESS after weighting is sizeable. The combination of a second stage and weight truncation is most effective in improving precision and efficiency in all simulation scenarios.
When covariate overlap is poor, undercoverage is an issue for all methods, particularly for the untruncated approaches. Novel outcome regressionbased techniques [21, 23,24,25, 93] may be preferable in these situations. The development of doubly robust approaches that combine outcome modeling with a model for the trial assignment weights is also attractive, as these would give researchers two chances for correct model specification.
In the absence of a common comparator group, unanchored comparisons contrast the outcomes of single treatment arms between studies. Because one of the stages relies on estimating the treatment assignment mechanism in the index study, the twostage approaches are not applicable in the unanchored case. This is a limitation, as many applications of covariateadjusted indirect comparisons are in this setting [10], both in published studies and in health technology appraisals.
Finally, I address a misconception that has arisen recently in the literature [25, 94]. It is believed that MAIC replicates the unadjusted analysis that would be performed in a hypothetical “ideal RCT” because it targets a marginal estimand, and that MAIC cannot make use of information on prognostic covariates. While all approaches to MAIC target marginal estimands, these produce covariateadjusted estimates of the marginal effect. The standard onestage approach to MAIC accounts for covariate differences across studies. The twostage approaches introduced in this article generate covariateadjusted estimates that also account for imbalances between treatment arms in the index trial, as is the case in covariateadjusted analyses of RCTs.
Availability of data and materials
The files required to generate the data, run the simulations, and reproduce the results are available at http://github.com/remiroazocar/Maic2stage.
Notes
This assumption is strong and untestable. Nevertheless, it is weaker than that required by unanchored comparisons. Unanchored comparisons compare absolute outcome means as opposed to relative effect estimates. Therefore, these rely on the conditional exchangeability of the absolute outcome mean under active treatment (conditional constancy of absolute effects) [5, 6, 40, 59]. This requires capturing all factors that are prognostic of outcome given active treatment.
The files required to run the simulations are available at http://github.com/remiroazocar/Maic2stage.
Abbreviations
 2SMAIC:

Twostage matchingadjusted indirect comparison
 ALD:

Aggregatelevel data
 ESE:

Empirical standard error
 ESS:

Effective sample size
 IPD:

Individual patient data
 HTA:

Health technology assessment
 MAIC:

Matchingadjusted indirect comparison
 MCSE:

Monte Carlo standard error
 MSE:

Mean square error
 PICO:

Population, Intervention, Comparator, Outcome
 RCT:

Randomized controlled trial
 TMAIC:

Truncated matchingadjusted indirect comparison
 T2SMAIC:

Truncated twostage matchingadjusted indirect comparison
References
Vreman RA, Naci H, Goettsch WG, MantelTeeuwisse AK, Schneeweiss SG, Leufkens HG, Kesselheim AS. Decision making under uncertainty: comparing regulatory and health technology assessment reviews of medicines in the united states and europe. Clin Pharmacol Ther. 2020; 108(2):350–7.
Sutton A, Ades A, Cooper N, Abrams K. Use of indirect and mixed treatment comparisons for technology assessment. Pharmacoeconomics. 2008; 26(9):753–67.
Bucher HC, Guyatt GH, Griffith LE, Walter SD. The results of direct and indirect treatment comparisons in metaanalysis of randomized controlled trials. J Clin Epidemiol. 1997; 50(6):683–91.
Dias S, Sutton AJ, Ades A, Welton NJ. Evidence synthesis for decision making 2: a generalized linear modeling framework for pairwise and network metaanalysis of randomized controlled trials. Med Dec Making. 2013; 33(5):607–17.
Phillippo D, Ades T, Dias S, Palmer S, Abrams KR, Welton N. Nice dsu technical support document 18: methods for populationadjusted indirect comparisons in submissions to nice. Sheffield: NICE Decision Support Unit; 2016.
Phillippo DM, Ades AE, Dias S, Palmer S, Abrams KR, Welton NJ. Methods for populationadjusted indirect comparisons in health technology appraisal. Med Dec Making. 2018; 38(2):200–11.
RemiroAzócar A, Heath A, Baio G. Methods for population adjustment with limited access to individual patient data: A review and simulation study. Res Synth Methods. 2021; 12(6):750–75.
RemiroAzócar A, Heath A, Baio G. Conflating marginal and conditional treatment effects: Comments on “assessing the performance of population adjustment methods for anchored indirect comparisons: A simulation study”. Stat Med. 2021; 40(11):2753–8.
RemiroAzócar A, Heath A, Baio G. Effect modification in anchored indirect treatment comparisons: Comments on “matchingadjusted indirect comparisons: Application to timetoevent data”. Stat Med. 2022; 41(8):1541–53.
Phillippo DM, Dias S, Elsada A, Ades A, Welton NJ. Population adjustment methods for indirect comparisons: A review of national institute for health and care excellence technology appraisals. Int J Technol Assess Health Care. 2019;35(3):221–8.
Signorovitch JE, Wu EQ, Andrew PY, Gerrits CM, Kantor E, Bao Y, Gupta SR, Mulani PM. Comparative effectiveness without headtohead trials. Pharmacoeconomics. 2010; 28(10):935–45.
Signorovitch J, Erder MH, Xie J, Sikirica V, Lu M, Hodgkins PS, Wu EQ. Comparative effectiveness research using matchingadjusted indirect comparison: an application to treatment with guanfacine extended release or atomoxetine in children with attentiondeficit/hyperactivity disorder and comorbid oppositional defiant disorder. Pharmacoepidemiol Drug Saf. 2012; 21:130–7.
Signorovitch JE, Sikirica V, Erder MH, Xie J, Lu M, Hodgkins PS, Betts KA, Wu EQ. Matchingadjusted indirect comparisons: a new tool for timely comparative effectiveness research. Value Health. 2012; 15(6):940–7.
Hatswell AJ, Freemantle N, Baio G. The effects of model misspecification in unanchored matchingadjusted indirect comparison: results of a simulation study. Value Health. 2020; 23(6):751–9.
Cheng D, Ayyagari R, Signorovitch J. The statistical performance of matchingadjusted indirect comparisons: Estimating treatment effects with aggregate external control data. Ann Appl Stat. 2020; 14(4):1806–33.
Wang J. On matchingadjusted indirect comparison and calibration estimation. arXiv preprint arXiv:2107.11687. 2021.
Petto H, Kadziola Z, Brnabic A, Saure D, Belger M. Alternative weighting approaches for anchored matchingadjusted indirect comparisons via a common comparator. Value Health. 2019; 22(1):85–91.
Kühnast S, SchiffnerRohe J, Rahnenführer J, Leverkus F. Evaluation of adjusted and unadjusted indirect comparison methods in benefit assessment. Methods Inf Med. 2017; 56(03):261–7.
Weber D, Jensen K, Kieser M. Comparison of methods for estimating therapy effects by indirect comparisons: A simulation study. Med Dec Making. 2020; 40(5):644–54.
Jiang Y, Ni W. Performance of unanchored matchingadjusted indirect comparison (maic) for the evidence synthesis of singlearm trials with timetoevent outcomes. BMC Med Res Methodol. 2020; 20(1):1–9.
Phillippo DM, Dias S, Ades A, Welton NJ. Assessing the performance of population adjustment methods for anchored indirect comparisons: A simulation study. Stat Med. 2020; 39(30):4885–911.
Jackson D, Rhodes K, Ouwens M. Alternative weighting schemes when performing matchingadjusted indirect comparisons. Res Synth Methods. 2021; 12(3):333–46.
RemiroAzócar A, Heath A, Baio G. Parametric gcomputation for compatible indirect treatment comparisons with limited individual patient data. arXiv preprint arXiv:2108.12208. 2021.
RemiroAzócar A, Heath A, Baio G. Marginalization of regressionadjusted treatment effects in indirect comparisons with limited patientlevel data. arXiv preprint arXiv:2008.05951. 2020.
Phillippo DM, Dias S, Ades AE, Welton NJ. Target estimands for efficient decision making: Response to comments on “assessing the performance of population adjustment methods for anchored indirect comparisons: A simulation study”. Stat Med. 2021; 40(11):2759–63.
Ho DE, Imai K, King G, Stuart EA. Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Polit Anal. 2007; 15(3):199–236.
Rubin DB. Estimating causal effects from large data sets using propensity scores. Ann Intern Med. 1997; 127(8):757–63.
Belger M, Brnabic A, Kadziola Z, Petto H, Faries D. Inclusion of multiple studies in matching adjusted indirect comparisons (maic). Value Health. 2015; 18(3):33.
Phillippo DM, Dias S, Ades A, Welton NJ. Equivalence of entropy balancing and the method of moments for matchingadjusted indirect comparison. Res Synth Methods. 2020; 11(4):568–72.
Elliott MR, Little RJ. Modelbased alternatives to trimming survey weights. J Off Stat. 2000; 16(3):191–210.
Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PloS ONE. 2011; 6(3):18174.
Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008; 168(6):656–64.
Moore KL, Neugebauer R, van der Laan MJ, Tager IB. Causal inference in epidemiological studies with strong confounding. Stat Med. 2012; 31(13):1380–404.
Léger M, Chatton A, Le Borgne F, Pirracchio R, Lasocki S, Foucher Y. Causal inference in case of nearviolation of positivity: comparison of methods. Biom J. 2022. In press.
Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Stat Methods Med Res. 2013; 22(3):278–95.
Cain LE, Cole SR. Inverse probabilityofcensoring weights for the correction of timevarying noncompliance in the effect of randomized highly active antiretroviral therapy on incident aids or death. Stat Med. 2009; 28(12):1725–38.
Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med. 2004; 23(19):2937–60.
Hahn J. On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica. 1998;66(2):315–31.
Westreich D, Edwards JK, Lesko CR, Stuart E, Cole SR. Transportability of trial results using inverse odds of sampling weights. Am J Epidemiol. 2017; 186(8):1010–4.
Dahabreh IJ, Robertson SE, Steingrimsson JA, Stuart EA, Hernan MA. Extending inferences from a randomized trial to a new target population. Stat Med. 2020; 39(14):1999–2014.
Nocedal J, Wright S. Numerical optimization. New York: Springer Science and Business Media; 2006.
Kish L. Survey Sampling. New York: Wiley; 1965.
Schafer JL, Kang J. Average causal effects from nonrandomized studies: a practical guide and simulated example. Psychol Methods. 2008; 13(4):279.
Robins JM, Hernan MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000; 11(5):550–60.
Fay MP, Graubard BI. Smallsample adjustments for waldtype tests using sandwich estimators. Biometrics. 2001; 57(4):1198–206.
Chen Z, Kaizar E. On variance estimation for generalizing from a trial to a target population. arXiv preprint arXiv:1704.07789. 2017.
Tipton E, Hallberg K, Hedges LV, Chan W. Implications of small samples for generalization: Adjustments and rules of thumb. Eval Rev. 2017; 41(5):472–505.
Raad H, Cornelius V, Chan S, Williamson E, Cro S. An evaluation of inverse probability weighting using the propensity score for baseline covariate adjustment in smaller population randomised controlled trials with a continuous outcome. BMC Med Res Methodol. 2020; 20(1):1–12.
Zeileis A. Objectoriented computation of sandwich estimators. J Stat Softw. 2006; 16:1–16.
Efron B, Tibshirani RJ. An introduction to the bootstrap. New York: CRC press; 1994.
Sikirica V, Findling RL, Signorovitch J, Erder MH, Dammerman R, Hodgkins P, Lu M, Xie J, Wu EQ. Comparative efficacy of guanfacine extended release versus atomoxetine for the treatment of attentiondeficit/hyperactivity disorder in children and adolescents: applying matchingadjusted indirect comparison methodology. CNS Drugs. 2013; 27(11):943–53.
Hartman E, Grieve R, Ramsahai R, Sekhon JS. From sample average treatment effect to population average treatment effect on the treated: combining experimental with observational studies to estimate population treatment effects. J R Stat Soc Ser A (Stat Soc). 2015; 178(3):757–78.
Rubin DB. Randomization analysis of experimental data: The fisher randomization test comment. J Am Stat Assoc. 1980; 75(371):591–3.
VanderWeele TJ, Hernan MA. Causal inference under multiple versions of treatment. J Causal Infer. 2013; 1(1):1–20.
VanderWeele TJ. Concerning the consistency assumption in causal inference. Epidemiology. 2009; 20(6):880–3.
Hernán MA, VanderWeele TJ. Compound treatments and transportability of causal inference. Epidemiology (Cambridge, Mass.) 2011; 22(3):368.
O’Muircheartaigh C, Hedges LV. Generalizing from unrepresentative experiments: a stratified propensity score approach. J R Stat Soc Ser C (Appl Stat). 2014; 63(2):195–210.
Zhang Z, Nie L, Soon G, Hu Z. New methods for treatment effect calibration, with applications to noninferiority trials. Biometrics. 2016; 72(1):20–29.
Rudolph KE, van der Laan MJ. Robust estimation of encouragement design intervention effects transported across sites. J R Stat Soc Ser B (Stat Methodol). 2017; 79(5):1509–25.
Westreich D, Cole SR. Invited commentary: positivity in practice. Am J Epidemiol. 2010; 171(6):674–7.
Stuart EA. Matching methods for causal inference: A review and a look forward. Stat Sci Rev J Inst Math Stat. 2010; 25(1):1.
Nie L, Zhang Z, Rubin D, Chu J. Likelihood reweighting methods to reduce potential bias in noninferiority trials which rely on historical data to make inference. Ann Appl Stat. 2013; 7(3):1796–813.
Brookhart MA, Schneeweiss S, Rothman KJ, Glynn RJ, Avorn J, Stürmer T. Variable selection for propensity score models. Am J Epidemiol. 2006; 163(12):1149–56.
Shortreed SM, Ertefaie A. Outcomeadaptive lasso: variable selection for causal inference. Biometrics. 2017; 73(4):1111–22.
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983; 70(1):41–55.
Senn S. Testing for baseline balance in clinical trials. Stat Med. 1994; 13(17):1715–26.
Li X, Ding P. Rerandomization and regression adjustment. J R Stat Soc Ser B (Stat Methodol). 2020; 82(1):241–68.
Morris TP, Walker AS, Williamson EJ, White IR. Planning a method for covariate adjustment in individuallyrandomised trials: a practical guide. Trials. 2022;23:328.
Williamson EJ, Forbes A, White IR. Variance reduction in randomised trials by inverse probability weighting using the propensity score. Stat Med. 2014; 33(5):721–37.
Hernán MA, Robins JM. Estimating causal effects from epidemiological data. J Epidemiol Community Health. 2006; 60(7):578–86.
Holland PW. Statistics and causal inference. J Am Stat Assoc. 1986; 81(396):945–60.
Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019; 38(11):2074–102.
Team, R Core, et al. R: A language and environment for statistical computing. R Foundation for Statistical Computing: Vienna; 2013.
Thompson DD, Lingsma HF, Whiteley WN, Murray GD, Steyerberg EW. Covariate adjustment had similar benefits in small and large randomized controlled trials. J Clin Epidemiol. 2015; 68(9):1068–75.
Susukida R, Crum RM, Hong H, Stuart EA, Mojtabai R. Comparing pharmacological treatments for cocaine dependence: Incorporation of methods for enhancing generalizability in metaanalytic studies. Int J Methods Psychiatr Res. 2018; 27(4):1609.
Susukida R, Crum RM, Stuart EA, Mojtabai R. Generalizability of the findings from a randomized controlled trial of a webbased substance use disorder intervention. Am J Addict. 2018; 27(3):231–7.
WebsterClark MA, Sanoff HK, Stürmer T, Peacock Hinton S, Lund JL. Diagnostic assessment of assumptions for external validity: an example using data in metastatic colorectal cancer. Epidemiology (Cambridge, Mass.) 2019; 30(1):103.
Carpenter J, Bithell J. Bootstrap confidence intervals: when, which, what? a practical guide for medical statisticians. Stat Med. 2000; 19(9):1141–64.
Richardson WS, Wilson MC, Nishikawa J, Hayward RS, et al.The wellbuilt clinical question: a key to evidencebased decisions. Acp j club. 1995; 123(3):12–13.
Petersen ML, Porter KE, Gruber S, Wang Y, Van Der Laan MJ. Diagnosing and responding to violations in the positivity assumption. Stat Methods Med Res. 2012; 21(1):31–54.
Li F, Thomas LE, Li F. Addressing extreme propensity scores via the overlap weights. Am J Epidemiol. 2019; 188(1):250–7.
Deeks JJ, Dinnes J, D’Amico R, Sowden AJ, Sakarovitch C, Song F, Petticrew M, Altman D, et al.Evaluating nonrandomised intervention studies. Health Technol Assess (Winchester, England). 2003; 7(27):1–173.
Xiao Y, Moodie EE, Abrahamowicz M. Comparison of approaches to weight truncation for marginal structural cox models. Epidemiol Methods. 2013; 2(1):1–20.
Kish L. Weighting for unequal pi. J Off Stat. 1992; 8(2):183.
Crump RK, Hotz VJ, Imbens GW, Mitnik OA. Dealing with limited overlap in estimation of average treatment effects. Biometrika. 2009; 96(1):187–99.
Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (iptw) using the propensity score to estimate causal treatment effects in observational studies. Stat Med. 2015; 34(28):3661–79.
Shiba K, Kawahara T. Using propensity scores for causal inference: pitfalls and tips. J Epidemiol. 2021; 31:457–63.
Robins JM, Hernán MA. Estimation of the causal effects of timevarying exposures. Longitudinal Data Anal. 2009; 553:599.
Zeng S, Li F, Wang R, Li F. Propensity score weighting for covariate adjustment in randomized clinical trials. Stat Med. 2021; 40(4):842–58.
Desai RJ, Franklin JM. Alternative approaches for confounding adjustment in observational studies using weighting based on the propensity score: a primer for practitioners. BMJ. 2019;367:l5657.
Robins JM, Finkelstein DM. Correcting for noncompliance and dependent censoring in an aids clinical trial with inverse probability of censoring weighted (ipcw) logrank tests. Biometrics. 2000; 56(3):779–88.
Latimer NR, Abrams K, Lambert P, Crowther M, Wailoo A, Morden J, Akehurst R, Campbell M. Adjusting for treatment switching in randomised controlled trials–a simulation study and a simplified twostage method. Stat Methods Med Res. 2017; 26(2):724–51.
Phillippo DM, Dias S, Ades A, Belger M, Brnabic A, Schacht A, Saure D, Kadziola Z, Welton NJ. Multilevel network metaregression for populationadjusted treatment comparisons. J R Stat Soc Ser A (Stat Soc). 2020; 183(3):1189–210.
RemiroAzócar A. Target estimands for populationadjusted indirect comparisons. In press, Stat Med. 2022.
Acknowledgments
Not applicable.
Funding
No financial support was provided for this research.
Author information
Authors and Affiliations
Contributions
ARA conceived the research idea, developed the methodology, performed the analyses, prepared the figures, and wrote and reviewed the manuscript. The author read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
ARA is employed by Bayer plc. The author declares that he has no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Supplementary Material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
RemiroAzócar, A. Twostage matchingadjusted indirect comparison. BMC Med Res Methodol 22, 217 (2022). https://doi.org/10.1186/s12874022016929
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874022016929
Keywords
 Health technology assessment
 Indirect treatment comparison
 Matchingadjusted indirect comparison
 Covariate adjustment
 Covariate balance
 Inverse probability of treatment weighting
 Evidence synthesis