 Research article
 Open Access
 Published:
A comparison of the statistical performance of different metaanalysis models for the synthesis of subgroup effects from randomized clinical trials
BMC Medical Research Methodology volume 19, Article number: 198 (2019)
Abstract
Background
When investigating subgroup effects in metaanalysis, it is unclear whether accounting in metaregression for betweentrial variation in treatment effects, but not betweentrial variation in treatment interaction effects when such effects are present, leads to biased estimates, coverage problems, or wrong standard errors, and whether the use of aggregate data (AD) or individualpatientdata (IPD) influences this assessment.
Methods
Seven different models were compared in a simulation study. Models differed regarding the use of AD or IPD, whether they accounted for betweentrial variation in interaction effects, and whether they minimized the risk of ecological fallacy.
Results
Models that used IPD and that allowed for betweentrial variation of the interaction effect had less bias, better coverage, and more accurate standard errors than models that used AD or ignored this variation. The main factor influencing the performance of models was whether they used IPD or AD. The model that used AD had a considerably worse performance than all models that used IPD, especially when a low number of trials was included in the analysis.
Conclusions
The results indicate that IPD models that allow for the betweentrial variation in interaction effects should be given preference whenever investigating subgroup effects within a metaanalysis.
Background
Metaanalysis is an essential tool for evidencebased clinical practice [1]. It allows the combination of treatment effect estimates across two or more trials, which has two main advantages. First, it provides a summary treatment effect estimate for patient, clinicians, or policymakers seeking information about the effectiveness of a treatment. Because these stakeholders are commonly faced with a large number of scientific literature publications on which they need to base their decisions, a summary treatment effect estimate(s) is(are) (or summary distribution of treatment effects, in the presence of heterogeneity when random effect models are used) likely to facilitate the decisionmaking process. Second, it provides a more precise treatment effect estimate, that is, a treatment effect with narrower confidence intervals, minimizing sampling error. Randomized controlled trials (RCTs) are considered the goldstandard study design to investigate the effectiveness of treatment interventions [2]. Metaanalyses of randomized controlled trials offer, therefore, the optimal basis for evidencebased clinical practice.
The data used to conduct a metaanalysis can be of two forms: an average treatment effect from each study, such as a difference in means for continuous outcomes or a risk ratio for binary outcomes, or it can be at the patient level, where the outcome of interest is available individually for each of the patients included in the trial [3]. The vast majority of metaanalyses are based on average treatment effects, which are commonly referred to as aggregate data (AD) metaanalyses, while a small proportion use individual patient data (IPD) or a combination of AD and IPD [3, 4]. The reason why most metaanalyses use AD is because they are readily available in published, publicly available reports of RCTs. Getting access to IPD is usually difficult or even impossible, as in most cases IPD are not publicly available, clinical trialists often choose not to share them, or datasets have been lost or misplaced [5]. However, the use of IPD in metaanalysis offers significant statistical advantages, such as more precise results with narrower confidence intervals in some scenarios or the proper investigation of subgroup treatment effects [6].
Investigation of subgroup effects is an important part of clinical research [7]. While a treatment may be of low clinical relevance in the broader population of patients with a particular disease, we may identify subgroups of patients who may significantly benefit from this treatment. An intuitive example would be weight reduction in patients with knee osteoarthritis (OA) [8]. While weight reduction may have a significant effect on the knee pain of OA patients who are overweight, it would not work or even be detrimental for those patients with normal or below normal body weight. The investigation of the source of variation of treatment effects between patients is therefore of crucial relevance for the use of trial findings in the clinical setting. Otherwise, useful treatments may be discarded solely based on its irrelevant effect at the overall patient population or we may fail to identify subgroups of patients who are more likely to respond to a treatment that only has a negligible effect in the overall patient population of interest.
Subgroup analyses in single trials are often underpowered since RCTs are generally only powered to identify a treatment effect at the overall patient population. Because of this, metaanalyses that combine data from several trials become a powerful tool in conducting subgroup analyses that have enough statistical power to identify minimal clinically relevant treatment effects in specific subgroups of patients. The investigation of subgroup effects within a metaanalysis can be conducted using metaregression. Regression analyses can be used to explain the variation of treatment effects across different trials [9]. Both patient and trial level characteristics can be used in these regression analyses. While the investigation of the influence of trial level characteristics, such as methodological quality, in the variation of treatment effects can be done using either AD or IPD, the investigation of patient level characteristics, such as age or gender, using only AD may be biased due to ecological fallacy (also known as aggregation bias) since average patient characteristics are regressed against average trial outcomes [10]. Thus, regression analyses using IPD is considered the gold standard when investigating subgroup effects of patientlevel characteristics in a metaanalysis since individual patient’s characteristics can be regressed against the individual’s outcome.
In metaanalysis, it is usually considered important to take into account the betweentrial variation in the effect of interest that is being pooled across trials. For instance, when metaanalyzing treatment effects across trials, we may notice a considerable betweentrial variation. This indicates that there is variability around the overall treatment effect estimate, which may be caused by more than just chance (i.e. sampling error). When this is the case, it is important to use methods that incorporate this additional variability in the estimation of confidence intervals around the pooled effect estimate. This is done in metaanalysis by using a randomeffects approach, which accounts for the betweentrial variation in the treatment effect [11, 12] and allows for the distribution of effects to be quantified [13]. Likewise, it may be important to take into account the betweentrial variation in the interaction effect between patientlevel characteristics and treatment effect, when using regression analyses to conduct subgroup analyses in a metaanalyses. However, regression models implemented in statistical software usually account only for betweentrial variation in the treatment effect when pooling interaction effects across trials. It is unclear whether accounting for betweentrial variance only in treatment effects, but not in interaction effects, leads to biased estimates, coverage problems, or under or overestimation of standard errors, when estimating overall mean effects, and whether this could be influenced by the use of AD or IPD data for the analysis.
Thus, the main purpose of this simulation study is to understand whether the current approach for combining interaction effects, which accounts only for betweentrial variation in treatment effects, but ignores the variation of interaction effects across trials, leads to a suboptimal estimation of a pooled interaction effect across trials. In Methods section, the methods used in the current simulation study, including a description of the statistical models and performance measures used in this investigation will be described. In Results section, the results of the simulation analyses will be presented. In Discussion section, the results will be discussed in the context of previous investigations; limitations and strengths will be described, and conclusions will be presented.
Methods
We now describe i) the methods used to generate the simulated datasets used for the performance analysis; ii) the analysis models that will be compared; and iii) the performance measures that will be used to compare these models. The reporting of the methods and results of this simulation study follow the guidelines proposed by Morris et al. and Burton et al. [14, 15].
Datagenerating mechanisms
Linear predictor used for data generation
A framework was developed to investigate how different models for the assessment of treatment interaction effects in metaanalyses with continuous outcomes perform. Thus, we simulated datasets in order to assess the performance of different models to quantify the interaction between treatment effect and patients’ age within an IPD metaanalysis framework. We assume a linear association between age and knee OA pain, and a linear association between age and the treatment effect. The simulated data mimicked trials of high dose Naproxen compared with placebo for the treatment of knee OA pain quantified with a single assessment at the end of treatment using a 10cm visual analogue scale (VAS), where 0 means no pain and 10 means the worst imaginable pain [16].
The following linear predictor was used to simulate this dataset (Table 1):
where y_{ij} is the predicted knee OA pain measured on a 0 to 10 VAS^{Footnote 1} at the end of treatment on the j^{th} patient in the i^{th} trial, treat_{ij} and age_{ij} are, respectively, the treatment received (a binary indicator variable coded 0 for the control group and 1 for the Naproxen group) and age of the j^{th} patient in the i^{th} trial, beta.treat_{i} is the betweengroup difference in mean VAS (i.e. treatment effect) in trial i and comes from a normal distribution with mean − 1 and variance V_{1,} beta.inter_{i} is the interaction between the treatment effect and age in trial i, u_{ij} and v_{ij} are, respectively, the random intercept and the random slope of the treatment effect on the j^{th} patient in the i^{th} trial, and res.var. is the residual variance of knee OA pain in the i^{th} trial. age is a continuous variable, with a mean that was drawn from a normal distribution with mean of 67 and variance of 20.25, and a variance was drawn from a normal distribution with mean 49 and variance of 1, which allowed the variation of mean.age and variance.age across simulated trials. beta.inter comes from a normal distribution with mean δ and variance V_{2}. δ was set as − 0.01 for the main analysis and as 0 for a sensitivity analysis to assess model performance when there is actually no interaction effect. The δ of − 0.01 assumes that the treatment effect improves by − 0.01 (as a negative effect means a larger treatment effect in this case) by every increase of 1 year in age. V_{2} was set as 0.05 (high but plausible heterogeneity) for the main analysis and as 0.5 (implausible high heterogeneity), 0.005 (low heterogeneity) or 0 (no heterogeneity) for sensitivity analyses. u_{ij} and v_{ij} are assumed to follow a multivariate normal distribution with mean 0, a variance–covariance matrix ∑, and correlation ρ of 0.8. The variance of the random intercept represented by σ^{2} and the variance of the random slope of the treatment effect by τ^{2} and they have a correlation ρ. Thus, this model simulates heterogeneity both in treatment effects and in interaction effects, but the latter is usually not modelled in metaanalysis.
Varying parameters for data generation
It was of interest to assess the performance of the models under different scenarios commonly relevant to evidence synthesis. Of main interest was to see how model performance was dependent on: the number of trials included in the metaanalysis; the magnitude of the interaction effect between treatment effect and age; and the magnitude of the betweentrial heterogeneity of the interaction effect (Table 1). Our estimand θ, the interaction effect, was assumed to be − 0.01 or 0 on a 0 to 10 VAS. An interaction effect of − 0.01 means that for every 10 years increase in patients’ age, the treatment effect comparing Naproxen and placebo will increase by − 0.1, which means that Naproxen will have a linear increase in its effect as compared to placebo as patients get older. This is based on an assumption that older patients will have more pain at baseline, and that treatment effects are directly proportional to pain at baseline. A null interaction effect (i.e. 0) assumes that the treatment effect does not vary according to patients’ age.
Thus, model performance was assessed in 64 different simulated scenarios (Table 2). These datasets had 6, 10, 16, 20, 26, 30, 40, or 50 trials, an interaction effect between the treatment effect and age of − 0.01 or 0, and a betweentrial variance of the interaction effect of 0 (no heterogeneity), 0.005 (low heterogeneity), 0.05 (high but plausible heterogeneity), or 0.5 (implausible high heterogeneity). All simulated datasets had an even number of trials, so that datasets could be equally divided into trials with a low or large number of patients included, as explained below.
Fixed parameters for data generation
The simulation scenarios described above all shared the same fixed parameters (Table 1) regarding: number of patients included in each trial; number of trial arms; the ratio of randomization; the magnitude of the treatment effect; the betweentrial variance of the treatment effect; and the effect of age on knee OA pain. Each of these is described in turn below.
Number of patients included in the trial: half of the trials in the simulated dataset were small and half were large. Trials were considered large if they had on average at least 100 patients in each arm [17]. The total number of patients included in each trial varied randomly between 30 and 199 patients in small trials, and 200 and 400 patients in large trials.
Number of trial arms: all trials had two arms.
Ratio of randomization: all trials had a 1:1 randomization.
Magnitude of the treatment effect: the magnitude of the treatment effect was an effect size of − 1 on a 0 to 10 VAS, favoring patients that received high dose Naproxen as compared to those who received placebo (i.e. lower pain on average among patients that received Naproxen than those that received placebo). Assuming a standard deviation of 2.5, this corresponds to a small to moderate, and minimally clinically relevant effect of − 0.4 standard deviation units, which was the reported effect of high dose Naproxen as compared to placebo in a recently published network metaanalysis of RCTs comparing different types of nonsteroidal antiinflammatory drugs to placebo for the treatment of patients with knee OA [16].
Betweentrial variance of the treatment effect: we used a moderate betweentrial variance of 0.063. This assumes that the treatment effect comes from a normal distribution with approximately 95% of the trial effects ranging between − 1.5, which is considered a moderate to large treatment effect in knee OA in favour of Naproxen, to 0.5, indicating a small, clinically irrelevant treatment effect in favour of placebo.
Effect of age on knee OA pain: it was assumed that the slope of age in knee OA pain would correspond to a yearly increase in pain of 0.05 on a 0 to 10 VAS.
Software used to generate simulated data
Data were simulated in Stata 14.2 using the 64bit Mersenne twister for random number generation. The input seed was ‘1234’.
Code for data generation
The code used for data generation is presented in Additional file 1: Appendix A to facilitate reproducibility and understanding of the methods used in this simulation study.
Estimand
The estimand θ is the interaction effect between the treatment effect and age, which is described in the models below as beta.interC. That is, the magnitude by which the treatment effect changes for every unit increase in age.
Models assessed
The performance of seven different models were compared. Models 1, 2, and 3 were based on the proposed model by Riley et al. for IPD metaanalysis, which separates within from betweentrial estimation of treatment effects, as a way to minimize the risk of ecological fallacy (Table 3) [18].
Model 1: Estimation of the betweentrial variance of the interaction between treatment effect and age, estimation of the betweentrial variance of the treatment effect, and separation of within from betweentrial interaction effects, using individual patient data. Separation of within from betweentrial interaction effects was implemented as previously suggested by Riley et al. [18]. Although we are not aware it has ever been fitted in practice, this model is considered the goldstandard in this investigation because it estimates the randomeffects of both interaction effect and treatment effect that exist in the simulated dataset, and also separates the within and betweentrial interaction effects.
In this model, ageC_{ij} is the covariate age_{ij} centered by its mean value m_{i}, in each trial, and it includes an interaction term between the treatment effect and ageC_{ij}, as well as an interaction term between the treatment effect and m_{i}, to separate the within from the betweentrial effects:
where y_{ij} is the predicted knee OA pain measured on a 0 to 10 VAS on the j^{th} patient in the i^{th} trial, α is the average knee OA pain in the placebo group for patients that have an average age, treat_{ij} and ageC_{ij} are, respectively, the treatment received and age centered around the trial’s mean age of the j^{th} patient in the i^{th} trial, ageM_{i} is the mean age of patients in the i^{th} trial, beta.treat is the fixed betweengroup difference in knee OA pain (i.e. treatment effect), beta.ageC is the fixed coefficient of the centered age covariate, beta.interC_{i} is the interaction between the treatment effect and centered age in trial i, beta.ageM is the fixed coefficient of the mean value of the age covariate, beta.interM is the fixed interaction between the treatment effect and the mean value of age, u_{i} are the random trial effects of knee OA pain in the placebo group for patients that have an average age, v1_{i} are the random trial effects in the interaction between the treatment effect and centered age, and v2_{i} are the random trial effects in the treatment effect. This model assumes an exchangeable covariance structure between the randomeffects of intercept and interaction. As previously stated, this model allows the separation of the pooled withintrial treatmentcovariate interaction beta.interC from the betweentrials interaction beta.interM [18]. This model has been shown to completely separate within and betweentrial effects, thus resulting in independent beta.interC and beta.interM, even when the number of subjects per study is small [18, 19]. The betweentrials interaction beta.interM is actually the same as the output of a metaregression, which only uses aggregate data to estimate associations between individuallevel characteristics and the treatment effect (model 7 is a metaregression) [20].
.Model 2: Estimation of the betweentrial variance of the interaction between treatment effect and age, and separation of within from betweentrial interaction effects, using individual patient data.
The parameters in model 2 are as defined previously in model 1, except that now the only random trial effects are in the interaction between the treatment effect and centered age v_{i}. Again, separation of within from betweentrial interaction effects was implemented as previously suggested by Riley et al. [18].
Model 3: Estimation of the betweentrial variance of the treatment effect, and separation of within from betweentrial interaction effects, using individual patient data.
The parameters in model 3 are as defined previously in model 1, except that now the only random trial effects are in the treatment effect v_{i}. Again, separation of within from betweentrial interaction effects was implemented as previously suggested by Riley et al. [18].
Model 4: Estimation of the betweentrial variance of the interaction between treatment effect and age, estimation of the betweentrial variance of the treatment effect, without separation of within from betweentrial interaction effects, using individual patient data
The parameters in model 4 are as defined previously in model 1, except that mean age is no longer part of the model. This means that the association between mean age and the outcome of knee pain, and the interaction effect between mean age and treatment allocation on the outcome of knee pain, are no longer estimated. Because mean age is not included in model 4, it does not separate the within from betweentrial interaction effects.
Model 5: Estimation of the betweentrial variance of the interaction between treatment effect and age, without separation of within from betweentrial interaction effects, using individual patient data.
The parameters in model 5 are as defined previously in model 2, except for the parameters related to the effect of mean age within trials. Because these parameters were removed, model 5 does not separate the within from betweentrial interaction effects.
Model 6: Estimation of the betweentrial variance of the treatment effect, without separation of within from betweentrial interaction effects, using individual patient data
The parameters in model 6 are as defined previously in model 2, except for the parameters related to the effect of mean age within trials. Because these parameters were removed, model 6 does not separate the within from betweentrial interaction effects.
Model 7: Estimation of the betweentrial variance of the treatment effect using aggregate data, which can only estimate the betweentrial interaction effects
where y_{i} is the predicted betweengroup difference in knee OA in the i^{th} trial, α is the fixed betweengroup difference in knee OA pain, ageM_{i} is the mean age of patients in the i^{th} trial, beta.age is the fixed interaction effect between the treatment effect and the mean age of patients in the i^{th} trial, and u_{i} is the random effect of the betweengroup difference in knee OA pain.
Performance measures
Bias, coverage, empirical and modelbased standard errors, and the relative error in the model standard error for the estimand θ were assessed. Estimation of performance measures and their Monte Carlo Standard Error (MCSE) was conducted as described by Morris et al. [15]. Bias was considered acceptable if it was ≤10 times lower than the simulated interaction effect of − 0.01. Coverage of the 95% confidence interval of the estimated interaction effect was assessed. The explanation on how each of the performance measures were calculated are presented in Additional file 1: Appendix B.
Sample size calculation
The simulation was powered to have a precise estimate of coverage. With an expected coverage of 95%, and a desired MCSE of coverage of 0.5%, 1900 repetitions would be needed [15]. It was then decided to use 2000 repetitions for each simulation, which results in a small MCSE for the estimate of all performance measures (Table 4).
Software used to analyse the performance of the simulated data
All analyses were conducted in Stata 14.2.
Results
In this section, we focus on results pertaining to the large but plausible betweentrial variation of 0.05 in the interaction effect since our main purpose is to assess model performance when this variation is taken into account. Results of other scenarios using a between trial variation of different magnitude are presented in the Additional file 1.
Results of simulations assuming a large betweentrial variation of 0.05 in the interaction effect
Figure 1 shows the interaction effects estimated in each of the 2000 repetitions according to all models for analyses with 6, 20, and 50 trials and the large betweentrial variation of 0.05 in the interaction effect. It can be seen from this graph that the interaction effects from models that used IPD have a much narrower scatter than those from model 7, the only model that used AD. Through visual inspection, the mean value of the interaction effects appears similar across all models. Also, the scatter of interaction effects decreases as the number of trial increases, for all of the models.
Figure 2 shows scatter plots of interaction effects of all models plotted against each other. It can be seen that there is a considerable agreement between all models that used IPD, with higher agreement between models that made the same assumptions regarding betweentrial variation of effects. It can also be seen that there is very low correlation between results from models using IPD and the model 7, which used AD. Although increasing the number of trials decreased the variation of estimates for all models, it did not have an important influence in the correlation between the results of different models.
Figure 3 shows the standard error of the interaction effects according to all models for analyses with 6, 20, and 50 trials. As with the interaction effects, this graph shows that the scatter of standard errors is narrower in models that used IPD than model 7, which used AD. Again, as with the interaction effect, the scatter of the standard error decreases as the number of trials increases. However, different than what was observed with the interaction effects, the mean value of the standard error in model 7 was systematically larger than those in models that used IPD.
Figure 4 shows scatter plots of the standard error of interaction effects of all models plotted against each other. As previously observed for interaction effects, there is a higher correlation in standard errors between models that used the same assumptions regarding the estimation of betweentrial variance. Although there was a clear correlation in standard errors between models that used IPD, this correlation is lower than the one observed for interaction effects. As with interaction effects, an increase in the number of trials did not increase the correlation of standard errors between models, and the correlation between models using IPD and model 7 was very low.
Table 5 shows the estimated bias in the interaction effect in each of the models for all investigated number of trials. Bias was most influenced by whether the betweentrial variation was estimated for the interaction effect, the treatment effect, or both. A separate estimation of within and betweentrial effects did not play an important role. It can be seen that bias tends to decrease as the number of trials increases for all of the models, with a more extensive difference in model 7, which used aggregate data. While all other models had a satisfactory bias with a lower number of trials, model 7 had a bias effect of − 0.065, which is 6.5 times the simulated interaction effect of − 0.01. Model 7 only has an acceptable bias of 0.0007 when 50 trials are included in the analysis. Among the models that used IPD (i.e. models 1–6), only models that modelled exclusively the betweentrial variation of the interaction effect (i.e. models 2 and 5) performed well when the number of trials was only 6. Model 2, which modelled the betweentrial variation in the interaction effect and separated within and betweentrial effects, had the lowest bias (− 0.00006). Model 5, which modelled the betweentrial variation in interaction effects but did not separate within from betweentrial effects, had also an acceptable bias of 0.00038. With 10 trials included in the analysis, the magnitude of bias becomes acceptable for models that exclusively modelled the betweentrial variation of the treatment effect (i.e. models 3 and 6). Only when 16 trials were included in the analysis that models 1 and 4, which modelled betweentrial variation for both treatment effects and interaction effects, achieved acceptable levels of bias. Finally, the MCSE was relatively large for all bias estimates.
Table 6 shows the observed coverage of the interaction effect between treatment effect and age in each of the models. With only 6 trials, coverage was best with model 7, with an observed coverage of 94.4%, while all models using IPD had an observed coverage below 90%. Coverage was worst with models that exclusively estimated the betweentrial variance in treatment effects, with observed values around 83.4%. As the number of trials increased, coverage also increased, but models using IPD consistently overestimated coverage, with observed coverage values always lower than the nominal value. For models using IPD, those that estimated the betweentrial variation in the interaction effect had acceptable coverage levels when at least 20 trials were included in the analysis. Models 3 and 5 only had an acceptable coverage when 40 trials were included in the analysis. Among models using IPD, those that estimated the betweentrial variation in the interaction effect had consistently better coverage than models that did not. Coverage was consistently accurate for model 7 across all numbers of trials. Finally, all coverage estimates were precise, with low MCSEs.
Figures 5, 6 and 7 display zip plots of the 2000 confidence intervals for all IPD models for analyses with 6, 20, or 50 trials (see Additional file 1: Appendix B for an explanation on how to interpret zip plots). We can readily see that all models using IPD had insufficient coverage when only 6 trials were included in the analysis, with models 3 and 6, which only addressed the betweentrial variance of the treatment effect, having the worst coverage. With 20 trials included in the analysis, we can see that coverage is appropriate for most models, but that models 3 and 6 are still slightly overestimating the nominal coverage. With 50 trials included in the analysis it becomes clear that all models using IPD have an appropriate coverage.
Table 7 shows how each model performed in the estimation of standard errors, for different numbers of trials included in the analysis. In general, both empirical and model SEs were smaller in models 1 and 4, where the betweentrial variance of treatment effects and interaction effects are estimated and allowed for. Models 2 and 5, which estimate and allow for the betweentrial variance in interaction effects only, had the second smallest empirical and model SEs. Among models using IPD, models 3 and 6, which estimated and allowed for the betweentrial variance in treatment effects, but ignored the betweentrial variance in interaction effects, yielded the largest SEs, and thus have the lowest precision. Model 7, the only model that used AD, yielded clearly higher SEs than the other models. The difference in SEs between models using IPD were small. These findings were not affected by the number of trials included in the analysis, or whether within and betweentrial effects were modelled separately or together. When assessing the error in the model SE in relation to the empirical SE, the model SE was most accurate if the model also estimated and accounted for the betweentrial variance in interaction effects (models 1,2,4,5), while models that ignored this had a consistently higher relative error in the model SE (models 3,6,7). This pattern was consistent regardless of the number of trials included in the analysis, or if within and betweentrial effects were modelled separately or together.
Table 8 shows the mean squared error of each model for different numbers of trials included in the analysis. It can be seen that the mean squared error decreased as the number of trials included in the analysis increased, that models using IPD have lower mean squared errors than the model that used AD, and that models that accounted for the betweentrial variance in the interaction effect had a lower mean squared error than those that did not. The most important factor leading to higher values of the mean squared error was the use of AD for the analysis. The model that used AD had considerably larger mean squared errors than models that used IPD. Although the mean squared error decreased as the as the number of trials included in the analysis increased, it was still considerably higher for the model using AD even when 50 trials were included in the analysis. Among models that used IPD, differences in mean squared error seem negligible, even though models that accounted for the betweentrial variance in the interaction effect had lower mean squared errors than those that did not.
Results of simulations assuming a null, small, or implausible large betweentrial variation in the interaction effect
Results of the simulation with a betweentrial variance of zero, 0.005, or 0.5 are shown in Additional file 1: Appendices C, D and E. Findings of our simulation with a null or small betweentrial variance were similar to the simulations with a large but plausible betweentrial variance, but with much smaller differences between models using IPD. Differences in performance between models using IPD and the model using AD decreased as the betweentrial variation decreased. The difference in all performance measures between the models using IPD were consistently small, with a pattern indicating that models that allowed for the betweentrial variance in the interaction effect had lower bias, and better precision and coverage. Again, similar to the simulations assuming a large but plausible betweentrial variance, the model using AD had a generally worse performance than models using IPD. Although this difference in performance decreased as the number of trials increased, the model using AD still had a worse performance than models using IPD even when 50 trials were included in the analysis, albeit unimportant differences. No relevant difference was observed between models that separated within and betweentrial associations and models that did not separate them.
Findings of our simulation with an implausible large betweentrial variance were similar to those from the simulations with a large but plausible betweentrial variance. However, the difference in performance between models was accentuated with the implausible large betweentrial variance, mainly when the number of trials included in the analysis was low.
Results of simulations assuming a null interaction effect
Results from analyses assuming a null interaction effect were remarkably similar to those from analyses assuming an interaction effect of − 0.01 (results available on request).
Discussion
Main findings
In this simulation study, the performance of seven different metaregression models for the pooling across trials of interaction effects between a patientlevel characteristic and the treatment effect were compared and it was observed that a better performance was obtained in models that used IPD and that allowed for betweentrial variation of the interaction effect. Although models that allowed for betweentrial variation of the interaction effect had a better performance than models that did not allow for this variation, other model or dataset characteristics played a more important role in model performance. The main factor influencing the performance of models was whether they used IPD or AD. Models based on IPD generally had acceptable levels of bias that were ten times smaller than the simulated interaction effect of − 0.01 when ten or more trials were included in the analysis. Biases in the model using AD were approximately two to six times larger than the simulated interaction effect, and was only acceptable when 50 trials were included in the analysis. The model that used AD also had a considerably lower precision than all models that used IPD, especially when a low number of trials was included in the analysis, although coverage of AD models was generally appropriate even with a low number of trials in the analysis. The second most important factor influencing the performance of the models was the number of trials included in the analysis. In general, as the number of trials included in the analysis increased, the performance of all models improved. Another important factor that influenced model performance was the magnitude of heterogeneity. Differences in performance between models was more evident in the presence of high heterogeneity (i.e. large betweentrial variation of the interaction effect), which is not surprising given only certain models acknowledge the possibility of heterogeneity in interaction effects. The performance of models was also influenced, to a lesser degree, by whether the model allowed for betweentrial variation of the interaction effect. Models that used IPD and allowed for betweentrial variation of the interaction effect had the best performance, generally with lower bias, better coverage, and higher precision than all other models. The model with the worst performance among IPD models was the one that allowed for betweentrial variation in treatment effects but ignored the betweentrial variation in interaction effects, and it is likely the most commonly used model in practice for IPD analyses. The goldstandard model, which was a model that allowed for betweentrial variation in both treatment and interaction effects and used IPD, did not perform well when only ten trials were analysed. No relevant difference was observed between models that separated within and betweentrial associations and models that did not separate them, which is unsurprising, as differences in within and betweentrial associations was not simulated in the dataset (i.e. ecological biases were not simulated). Likewise, the presence of an interaction effect did not seem to influence results, since results of analyses assuming a null interaction effect or an interaction effect of − 0.01 were remarkably similar.
Practical implications
This study shows that when investigating interaction effects in metaregressions, using IPD with a model that accounts for the betweentrial variation of interaction effects is likely to result in more accurate results, with less bias, better coverage, and higher precision, especially in the presence of high heterogeneity of interaction effects. Not enough importance has been given to modelling the betweentrial variation of interaction effects when pooling them across trials, with technical papers about regression models mainly focusing on modelling the betweentrial variation of the treatment effect [18, 20, 21]. The reason for this, perhaps, is because although modelling the betweentrial variation in interaction effect improves the accuracy of results, this improvement seems rather irrelevant in the absence of high heterogeneity. Nevertheless, it is important that the betweentrial variation in interaction effects is reported, perhaps together with prediction intervals around the interaction effect, so that readers can better appreciate how high or how low this variation is [13]. In practice, modelling both betweentrial variations, of the interaction effect and treatment effect, may lead to model convergence problems, especially if a low number of trials is included in the analysis. The lowest number of trials used in the present simulation analysis was six, and no convergence problems were observed. Whenever convergence problems are observed, researchers may try modelling only the betweentrial variation of interaction effects, and only if the problem persists, try to model only the betweentrial variation of treatment effects. Researchers should also always give precedence to IPD over AD when conducting regression analyses to pool interaction effects across trials, especially if the number of trials is below 50, which is often the case. Our simulation analysis indicates that if 50 trials with AD are included in a metaregression, results may be comparable to regression analyses using IPD, in terms of bias and coverage. However, even when 50 trials with AD were included in the metaregression, precision was still considerably lower than a regression analysis with 50 trials with IPD. Thus, since most metaregression analyses include only a few trials, every effort should be made to access IPD when the main goal is to pool subgroup effects across trials. Interestingly, the model using AD had a better coverage than models using IPD when the number of trials analysed was low. This is likely due to the use of a KnappHartung approach for estimating the variance of the mean interaction effect estimate in the AD model, which corrects the between study variance to account for its uncertainty [22]. This correction, developed for AD models, has been shown to have a better coverage performance of the 95% confidence interval than other approaches that assume that the between study variance is known [23]. Moreover, using IPD when investigating subgroup effects in a regression analysis will not only lead to results with better statistical properties, but is also necessary to avoid ecological fallacy [3]. Finally, because IPD is often only available for a small proportion of the trials included in a metaanalysis, an alternative approach would be to combine the evidence from IPD and AD when conducting such regression analyses, which is beyond the scope of the present simulation study [24].
Strengths and limitations
The main strength of this simulation analysis is that it mimicked characteristics commonly found in metaanalyses. This simulation analysis allowed for a wide range of number of trials included in the analysis, for a varying number of patients included across trials included in the analysis with a balanced mix of small and large trials, and for scenarios with small and large betweentrial variation in the interaction effect. Because the motivation to conduct this investigation is future research of osteoarthritis treatments, the simulated dataset used representative magnitude and within and betweentrial variations of treatment effects and average patient age observed in osteoarthritis trials [16]. Moreover, the use of simulated datasets allowed us to observe the statistical properties of different metaregression models without the potential confounding influence of factors such as biases due to low methodological quality, which could hinder the proper interpretation of the results.
The main limitation of the present investigation is the definition of small and large heterogeneity. Heterogeneity is of great relevance to this investigation, since its main goal is to compare models that appropriately address it or not. The cutoffs used to define small and large but plausible heterogeneity were arbitrarily chosen to represent real datasets of osteoarthritis treatments. However, one may argue that even the definition of a small heterogeneity may not be so small, and that the large but plausible heterogeneity may be unusually large for datasets of real osteoarthritis clinical trials. This is done consciously to err on the safe side. That is, because the performance of this models can only be different in the presence of heterogeneity, cutoffs of small and large heterogeneity were chosen so that at least some heterogeneity was always present. We also analysed a simulated dataset with an implausible high heterogeneity for a better understanding of how models perform across different values of betweentrial variance.
Another limitation is that results are only generalizable to trials using continuous outcomes, and arguably, only to osteoarthritis trials. However, it is expected that similar results are observed in trials with binary outcomes. In addition, although the performance of models that separate within from betweentrial associations was compared to the performance of models that do not, the dataset was not simulated to have different within and betweentrial associations of age and treatment effect. Thus, it is not surprising that there was no difference between models that separated or not these two types of associations. Likewise, although simulated datasets included small and large trials, we did not simulate smallstudy effects [17]. Discordant results between small and large trials may play a major role in the results of a randomeffects metaanalysis [11, 25]. Finally, IPD metaanalysis of continuous outcomes are less subject to overfitting and may accommodate a larger number of covariates to account for potential confounding than IPD metaanalysis of binary outcomes or metaregression in general [26,27,28]. Since the main purpose was to investigate the practical implications of adding randomeffects to an interaction test in an IPD metaanalysis, we did not extended our simulations to scenarios where multiple confounding variables may be present.
Previous research
This is the first investigation in continuous outcomes to compare the models that used IPD and AD while accounting for betweentrial variation in the interaction effect. Lambert et al. conducted a simulation study to compare fixedeffect regression analyses using IPD or AD to estimate the interaction between treatment effects and patientlevel characteristics [6]. They reported that AD can be used in metaregression if a large number of large trials is available, and concluded that AD can be used to adequately estimate an interaction between the treatment effect and triallevel characteristics, but that IPD would be needed to accurately estimate an interaction between treatment effects and patientlevel characteristics. Similarly to their study, the current simulation study compared IPD and AD for the estimation of an interaction between treatment effects and patientlevel characteristics, but extended the investigation of Lambert et al. by comparing randomeffects models. Similar to their findings based on fixedeffect models, our investigation in randomeffects models concluded that AD can be used to estimate interaction effects if a large number of trials is available, but that IPD is needed when estimating an interaction between treatment effects and patientlevel characteristics, mainly due to problems with ecological fallacy.
Stewart et al. conducted a similar investigation to ours using a real dataset of 24 RCTs evaluating antiplatelet agents for the prevention of preeclampsia in pregnancy, a binary outcome [21]. They compared three different models that were similar to the ones used in the present investigation: one model that had randomeffects for the treatment effect but had a fixedeffect for the interaction effect (corresponding to model 6 in the present study), one model with the same characteristics of the previous one but that separated the within and betweentrials associations (corresponding to model 3 in the present study), and one model that had randomeffects for both the treatment effect and interaction effect and did not separate within and betweentrials associations (corresponding to model 4 in the present study). The authors concluded that results across these models were virtually the same and led to the same conclusions. This is somewhat different from the conclusions of the present study, as it was noticed that albeit small, there were improvements in the accuracy of results when the model allowed for randomeffects of the interaction between treatment effect and covariate. This difference in conclusions may have happened for different reasons. The first reason is that they had a fixed number of 24 trials in their analysis, and as shown in our analysis, differences in performance between models are most evident with a lower number of trials. Another reason is that there was perhaps a low betweentrial variation in the interaction effect in their dataset. It was shown in this simulation that the lower the betweentrial variance, the less evident is the difference in performance between models. Because we conducted a simulation study which allowed us to vary several parameters that could potentially play a role in the results of metaanalyses, as opposed to the study of Stewart et al. which had only fixed parameters, we could observe and compare the performance of all seven models throughout a wide range of values from these parameters. Finally, this simulation analysis was based on continuous outcomes, while the analysis of Stewart et al. was based on a binary outcome. It is unlikely that the difference in results observed between the studies can be explained by this, however.
Future research
Several future research questions could develop from the current investigation. For instance, future research could be conducted to develop and investigate whether a model based on AD and that allows for betweentrial variation in interaction effects has a better performance than the model based on AD that only allows for the betweentrial variation in the treatment effect. Future research could also explore whether having lower or higher number of trials than what was investigated in the present study could play a role in the performance of these models. Future research could also investigate if similar results are observed with binary outcomes and if the performance of the models change in the presence of ecological fallacy, small study effects, or confounding from single or multiple other sources. Future research should also investigate more thoroughly the performance of different models for the estimation of the between trial variation in the interaction effect, including an assessment of how this estimation or lack thereof may influence the estimation of other parameters in the model (e.g. is the betweentrial variation of the treatment effect overestimated if the interaction term is not included in the model?). Also, the current study only investigates a single twoway interaction, but future studies may investigate more complex interactions, such as a threeway interaction. Finally, future research should consider investigating the use of permutation tests to control for false positive rates to compare results of different regression models using these tests [27].
Conclusions
The results of the present investigation indicate that IPD models that allow for the betweentrial variation in interaction effects should be given preference over models that only allow for betweentrial variation in treatment effects when investigating subgroup effects within a metaanalysis. It was shown that, even in the absence of ecological fallacy, there is significantly less bias and more precision when IPD is used over AD. Among models using IPD, there was less bias, better coverage, and higher precision in those that modelled the betweentrial variation in interaction effects (when one existed), especially when a small number of trials was analysed, and thus should be given precedence whenever possible.
Availability of data and materials
Codes used to generate the simulated dataset are presented in the Additional file 1.
Notes
The actual simulated VAS values were not restricted within 0 and 10 to maintain a near normal shape of y_{ij} to ensure our results were generalizable to analyses assuming a Gaussian distribution.
Abbreviations
 AD:

Aggregate data
 IPD:

Individual patient data
 RCT:

Randomized clinical trial
References
Egger M, Smith GD. Metaanalysis. Potentials and promise. BMJ (Clinical research ed). 1997;315(7119):1371–4.
Guyatt GH, Oxman AD, Kunz R, Vist GE, FalckYtter Y, Schunemann HJ. What is “quality of evidence” and why is it important to clinicians? BMJ (Clinical research ed). 2008;336(7651):995–8.
Riley RD, Lambert PC, AboZaid G. Metaanalysis of individual participant data: rationale, conduct, and reporting. BMJ (Clinical research ed). 2010;340:c221.
Stewart LA, Tierney JF, Clarke M. Chapter 18: Reviews of individual patient data. In: Cochrane handbook for systematic reviews of interventions [Internet]. The Cochrane Collaboration. version 5.1.0; 2011. Available from: www.cochranehandbook.org.
Lo B, DeMets DL. Incentives for clinical trialists to share data. N Engl J Med. 2016;375(12):1112–5.
Lambert PC, Sutton AJ, Abrams KR, Jones DR. A comparison of summary patientlevel covariates in metaregression with individual patient data metaanalysis. J Clin Epidemiol. 2002;55(1):86–94.
Wang R, Lagakos SW, Ware JH, Hunter DJ, Drazen JM. Statistics in medicinereporting of subgroup analyses in clinical trials. N Engl J Med. 2007;357(21):2189–94.
Bliddal H, Leeds AR, Christensen R. Osteoarthritis, obesity and weight loss: evidence, hypotheses and horizons  a scoping review. Obes Rev. 2014;15(7):578–86.
Thompson SG, Higgins JP. How should metaregression analyses be undertaken and interpreted? Stat Med. 2002;21(11):1559–73.
Berlin JA, Santanna J, Schmid CH, Szczech LA, Feldman HI. Individual patient versus grouplevel data metaregressions for the investigation of treatment effect modifiers: ecological bias rears its ugly head. Stat Med. 2002;21(3):371–87.
da Costa BR, Juni P. Systematic reviews and metaanalyses of randomized trials: principles and pitfalls. Eur Heart J. 2014;35(47):3336–45.
Sterne JA, Sutton AJ, Ioannidis JP, Terrin N, Jones DR, Lau J, et al. Recommendations for examining and interpreting funnel plot asymmetry in metaanalyses of randomised controlled trials. BMJ (Clinical research ed). 2011;343:d4002.
Riley RD, Higgins JP, Deeks JJ. Interpretation of random effects metaanalyses. BMJ (Clinical research ed). 2011;342(10):d549.
Burton A, Altman DG, Royston P, Holder RL. The design of simulation studies in medical statistics. Stat Med. 2006;25(24):4279–92.
Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;38(11):2074–102.
da Costa BR, Reichenbach S, Keller N, Nartey L, Wandel S, Juni P, et al. Effectiveness of nonsteroidal antiinflammatory drugs for the treatment of pain in knee and hip osteoarthritis: a network metaanalysis. Lancet. 2017;390(10090):e21–33.
Nuesch E, Trelle S, Reichenbach S, Rutjes AW, Tschannen B, Altman DG, et al. Small study effects in metaanalyses of osteoarthritis trials: metaepidemiological study. BMJ (Clinical research ed). 2010;341(16):c3515.
Riley RD, Lambert PC, Staessen JA, Wang J, Gueyffier F, Thijs L, et al. Metaanalysis of continuous outcomes combining individual patient data and aggregate data. Stat Med. 2008;27(11):1870–93.
Begg MD, Parides MK. Separation of individuallevel and clusterlevel covariate effects in regression analysis of correlated data. Stat Med. 2003;22(16):2591–602.
Burke DL, Ensor J, Riley RD. Metaanalysis using individual participant data: onestage and twostage approaches, and why they may differ. Stat Med. 2017;36(5):855–75.
Stewart GB, Altman DG, Askie LM, Duley L, Simmonds MC, Stewart LA. Statistical analysis of individual participant data metaanalyses: a comparison of methods and recommendations for practice. PLoS One. 2012;7(10):e46042.
Hartung J, Knapp G. On tests of the overall treatment effect in metaanalysis with normally distributed responses. Stat Med. 2001;20(12):1771–82.
Partlett C, Riley RD. Random effects metaanalysis: coverage performance of 95% confidence and prediction intervals following REML estimation. Stat Med. 2017;36(2):301–17.
Sutton AJ, Kendrick D, Coupland CA. Metaanalysis of individual and aggregatelevel data. Stat Med. 2008;27(5):651–69.
Egger M, Juni P, Bartlett C, Holenstein F, Sterne J. How important are comprehensive literature searches and the assessment of trial quality in systematic reviews? Empirical study. Health Technol Assess. 2003;7(1):1–76.
Austin PC, Steyerberg EW. The number of subjects per variable required in linear regression analyses. J Clin Epidemiol. 2015;68(6):627–36.
Higgins JP, Thompson SG. Controlling the risk of spurious findings from metaregression. Stat Med. 2004;23(11):1663–82.
Deeks JJ, Higgins JPT, Altman DG, (Editors). Chapter: 9 Analysing data and undertaking metaanalyses. In: Cochrane handbook for systematic reviews of interventions  version 510 [Internet]. The Cochrane Collaboration. 2011. Available from: www.handbook.cochrane.org.
Acknowledgements
Not applicable
Funding
This investigation was supported in part by a grant from The Arthritis Society. The Arthritis Society had no role in the design of the study, collection, analysis, and interpretation of data or in writing the manuscript.
Author information
Affiliations
Contributions
BRdC and AJS conceived and designed the study. BRdC conducted the simulated analyses. BRdC and AJS interpreted the results. BRDC wrote the first draft of the manuscript. BRdC and AJS read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1 :
Appendix A. Code used to generate the simulated datasets. Appendix B. Performance measures. Appendix C. Results of the simulation study assuming a betweentrial variance of 0.5 in the interaction effect. Appendix D. Results of the simulation study assuming a betweentrial variance of 0.005 in the interaction effect. Appendix E. Results of the simulation study assuming a null betweentrial variance in the interaction effect.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
da Costa, B.R., Sutton, A.J. A comparison of the statistical performance of different metaanalysis models for the synthesis of subgroup effects from randomized clinical trials. BMC Med Res Methodol 19, 198 (2019). https://doi.org/10.1186/s1287401908318
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287401908318
Keywords
 Individual patient data
 Metaanalysis
 Randomeffects
 Evidence synthesis
 Interaction effects
 Subgroup analysis