A comparison of the statistical performance of different meta-analysis models for the synthesis of subgroup effects from randomized clinical trials

Background When investigating subgroup effects in meta-analysis, it is unclear whether accounting in meta-regression for between-trial variation in treatment effects, but not between-trial variation in treatment interaction effects when such effects are present, leads to biased estimates, coverage problems, or wrong standard errors, and whether the use of aggregate data (AD) or individual-patient-data (IPD) influences this assessment. Methods Seven different models were compared in a simulation study. Models differed regarding the use of AD or IPD, whether they accounted for between-trial variation in interaction effects, and whether they minimized the risk of ecological fallacy. Results Models that used IPD and that allowed for between-trial variation of the interaction effect had less bias, better coverage, and more accurate standard errors than models that used AD or ignored this variation. The main factor influencing the performance of models was whether they used IPD or AD. The model that used AD had a considerably worse performance than all models that used IPD, especially when a low number of trials was included in the analysis. Conclusions The results indicate that IPD models that allow for the between-trial variation in interaction effects should be given preference whenever investigating subgroup effects within a meta-analysis.


Background
Meta-analysis is an essential tool for evidence-based clinical practice [1]. It allows the combination of treatment effect estimates across two or more trials, which has two main advantages. First, it provides a summary treatment effect estimate for patient, clinicians, or policy-makers seeking information about the effectiveness of a treatment. Because these stakeholders are commonly faced with a large number of scientific literature publications on which they need to base their decisions, a summary treatment effect estimate(s) is(are) (or summary distribution of treatment effects, in the presence of heterogeneity when random effect models are used) likely to facilitate the decision-making process. Second, it provides a more precise treatment effect estimate, that is, a treatment effect with narrower confidence intervals, minimizing sampling error. Randomized controlled trials (RCTs) are considered the gold-standard study design to investigate the effectiveness of treatment interventions [2]. Meta-analyses of randomized controlled trials offer, therefore, the optimal basis for evidence-based clinical practice.
The data used to conduct a meta-analysis can be of two forms: an average treatment effect from each study, such as a difference in means for continuous outcomes or a risk ratio for binary outcomes, or it can be at the patient level, where the outcome of interest is available individually for each of the patients included in the trial [3]. The vast majority of meta-analyses are based on average treatment effects, which are commonly referred to as aggregate data (AD) meta-analyses, while a small proportion use individual patient data (IPD) or a combination of AD and IPD [3,4]. The reason why most meta-analyses use AD is because they are readily available in published, publicly available reports of RCTs. Getting access to IPD is usually difficult or even impossible, as in most cases IPD are not publicly available, clinical trialists often choose not to share them, or datasets have been lost or misplaced [5]. However, the use of IPD in meta-analysis offers significant statistical advantages, such as more precise results with narrower confidence intervals in some scenarios or the proper investigation of subgroup treatment effects [6].
Investigation of subgroup effects is an important part of clinical research [7]. While a treatment may be of low clinical relevance in the broader population of patients with a particular disease, we may identify subgroups of patients who may significantly benefit from this treatment. An intuitive example would be weight reduction in patients with knee osteoarthritis (OA) [8]. While weight reduction may have a significant effect on the knee pain of OA patients who are overweight, it would not work or even be detrimental for those patients with normal or below normal body weight. The investigation of the source of variation of treatment effects between patients is therefore of crucial relevance for the use of trial findings in the clinical setting. Otherwise, useful treatments may be discarded solely based on its irrelevant effect at the overall patient population or we may fail to identify subgroups of patients who are more likely to respond to a treatment that only has a negligible effect in the overall patient population of interest.
Subgroup analyses in single trials are often underpowered since RCTs are generally only powered to identify a treatment effect at the overall patient population. Because of this, meta-analyses that combine data from several trials become a powerful tool in conducting subgroup analyses that have enough statistical power to identify minimal clinically relevant treatment effects in specific subgroups of patients. The investigation of subgroup effects within a meta-analysis can be conducted using meta-regression. Regression analyses can be used to explain the variation of treatment effects across different trials [9]. Both patient and trial level characteristics can be used in these regression analyses. While the investigation of the influence of trial level characteristics, such as methodological quality, in the variation of treatment effects can be done using either AD or IPD, the investigation of patient level characteristics, such as age or gender, using only AD may be biased due to ecological fallacy (also known as aggregation bias) since average patient characteristics are regressed against average trial outcomes [10]. Thus, regression analyses using IPD is considered the gold standard when investigating subgroup effects of patient-level characteristics in a meta-analysis since individual patient's characteristics can be regressed against the individual's outcome.
In meta-analysis, it is usually considered important to take into account the between-trial variation in the effect of interest that is being pooled across trials. For instance, when meta-analyzing treatment effects across trials, we may notice a considerable between-trial variation. This indicates that there is variability around the overall treatment effect estimate, which may be caused by more than just chance (i.e. sampling error). When this is the case, it is important to use methods that incorporate this additional variability in the estimation of confidence intervals around the pooled effect estimate. This is done in meta-analysis by using a random-effects approach, which accounts for the between-trial variation in the treatment effect [11,12] and allows for the distribution of effects to be quantified [13]. Likewise, it may be important to take into account the between-trial variation in the interaction effect between patient-level characteristics and treatment effect, when using regression analyses to conduct subgroup analyses in a meta-analyses. However, regression models implemented in statistical software usually account only for between-trial variation in the treatment effect when pooling interaction effects across trials. It is unclear whether accounting for between-trial variance only in treatment effects, but not in interaction effects, leads to biased estimates, coverage problems, or under or overestimation of standard errors, when estimating overall mean effects, and whether this could be influenced by the use of AD or IPD data for the analysis.
Thus, the main purpose of this simulation study is to understand whether the current approach for combining interaction effects, which accounts only for between-trial variation in treatment effects, but ignores the variation of interaction effects across trials, leads to a suboptimal estimation of a pooled interaction effect across trials. In Methods section, the methods used in the current simulation study, including a description of the statistical models and performance measures used in this investigation will be described. In Results section, the results of the simulation analyses will be presented. In Discussion section, the results will be discussed in the context of previous investigations; limitations and strengths will be described, and conclusions will be presented.

Methods
We now describe i) the methods used to generate the simulated datasets used for the performance analysis; ii) the analysis models that will be compared; and iii) the performance measures that will be used to compare these models. The reporting of the methods and results of this simulation study follow the guidelines proposed by Morris et al. and Burton et al. [14,15].

Data-generating mechanisms
Linear predictor used for data generation A framework was developed to investigate how different models for the assessment of treatment interaction effects in meta-analyses with continuous outcomes perform. Thus, we simulated datasets in order to assess the performance of different models to quantify the interaction between treatment effect and patients' age within an IPD meta-analysis framework. We assume a linear association between age and knee OA pain, and a linear association between age and the treatment effect. The simulated data mimicked trials of high dose Naproxen compared with placebo for the treatment of knee OA pain quantified with a single assessment at the end of treatment using a 10-cm visual analogue scale (VAS), where 0 means no pain and 10 means the worst imaginable pain [16].
The following linear predictor was used to simulate this dataset (Table 1): where y ij is the predicted knee OA pain measured on a 0  19:198 to 10 VAS 1 at the end of treatment on the j th patient in the i th trial, treat ij and age ij are, respectively, the treatment received (a binary indicator variable coded 0 for the control group and 1 for the Naproxen group) and age of the j th patient in the i th trial, beta.treat i is the between-group difference in mean VAS (i.e. treatment effect) in trial i and comes from a normal distribution with mean − 1 and variance V 1, beta.inter i is the interaction between the treatment effect and age in trial i, u ij and v ij are, respectively, the random intercept and the random slope of the treatment effect on the j th patient in the i th trial, and res.var. is the residual variance of knee OA pain in the i th trial. age is a continuous variable, with a mean that was drawn from a normal distribution with mean of 67 and variance of 20.25, and a variance was drawn from a normal distribution with mean 49 and variance of 1, which allowed the variation of mean.age and variance.age across simulated trials. beta.inter comes from a normal distribution with mean δ and variance V 2 . δ was set as − 0.01 for the main analysis and as 0 for a sensitivity analysis to assess model performance when there is actually no interaction effect. The δ of − 0.01 assumes that the treatment effect improves by − 0.01 (as a negative effect means a larger treatment effect in this case) by every increase of 1 year in age. V 2 was set as 0.05 (high but plausible heterogeneity) for the main analysis and as 0.5 (implausible high heterogeneity), 0.005 (low heterogeneity) or 0 (no heterogeneity) for sensitivity analyses. u ij and v ij are assumed to follow a multivariate normal distribution with mean 0, a variance-covariance matrix ∑, and correlation ρ of 0.8. The variance of the random intercept represented by σ 2 and the variance of the random slope of the treatment effect by τ 2 and they have a correlation ρ. Thus, this model simulates heterogeneity both in treatment effects and in interaction effects, but the latter is usually not modelled in meta-analysis.

Varying parameters for data generation
It was of interest to assess the performance of the models under different scenarios commonly relevant to evidence synthesis. Of main interest was to see how model performance was dependent on: the number of trials included in the meta-analysis; the magnitude of the interaction effect between treatment effect and age; and the magnitude of the between-trial heterogeneity of the interaction effect (Table 1). Our estimand θ, the interaction effect, was assumed to be − 0.01 or 0 on a 0 to 10 VAS. An interaction effect of − 0.01 means that for every 10 years increase in patients' age, the treatment effect comparing Naproxen and placebo will increase by − 0.1, which means that Naproxen will have a linear increase in its effect as compared to placebo as patients get older. This is based on an assumption that older patients will have more pain at baseline, and that treatment effects are directly proportional to pain at baseline. A null interaction effect (i.e. 0) assumes that the treatment effect does not vary according to patients' age. 1 The actual simulated VAS values were not restricted within 0 and 10 to maintain a near normal shape of y ij to ensure our results were generalizable to analyses assuming a Gaussian distribution. Thus, model performance was assessed in 64 different simulated scenarios (Table 2). These datasets had 6, 10, 16, 20, 26, 30, 40, or 50 trials, an interaction effect between the treatment effect and age of − 0.01 or 0, and a between-trial variance of the interaction effect of 0 (no heterogeneity), 0.005 (low heterogeneity), 0.05 (high but plausible heterogeneity), or 0.5 (implausible high heterogeneity). All simulated datasets had an even number of trials, so that datasets could be equally divided into trials with a low or large number of patients included, as explained below.
Fixed parameters for data generation The simulation scenarios described above all shared the same fixed parameters (Table 1) regarding: number of patients included in each trial; number of trial arms; the ratio of randomization; the magnitude of the treatment effect; the between-trial variance of the treatment effect; and the effect of age on knee OA pain. Each of these is described in turn below.
Number of patients included in the trial: half of the trials in the simulated dataset were small and half were large. Trials were considered large if they had on average at least 100 patients in each arm [17]. The total number of patients included in each trial varied randomly between 30 and 199 patients in small trials, and 200 and 400 patients in large trials.
Number of trial arms: all trials had two arms. Ratio of randomization: all trials had a 1:1 randomization.
Magnitude of the treatment effect: the magnitude of the treatment effect was an effect size of − 1 on a 0 to 10 VAS, favoring patients that received high dose Naproxen as compared to those who received placebo (i.e. lower pain on average among patients that received Naproxen than those that received placebo). Assuming a standard deviation of 2.5, this corresponds to a small to moderate, and minimally clinically relevant effect of − 0.4 standard deviation units, which was the reported effect of high dose Naproxen as compared to placebo in a recently published network meta-analysis of RCTs comparing different types of non-steroidal anti-inflammatory drugs to placebo for the treatment of patients with knee OA [16].
Between-trial variance of the treatment effect: we used a moderate between-trial variance of 0.063. This assumes that the treatment effect comes from a normal distribution with approximately 95% of the trial effects ranging between − 1.5, which is considered a moderate to large treatment effect in knee OA in favour of Naproxen, to 0.5, indicating a small, clinically irrelevant treatment effect in favour of placebo.
Effect of age on knee OA pain: it was assumed that the slope of age in knee OA pain would correspond to a yearly increase in pain of 0.05 on a 0 to 10 VAS.

Software used to generate simulated data
Data were simulated in Stata 14.2 using the 64-bit Mersenne twister for random number generation. The input seed was '1234'.

Code for data generation
The code used for data generation is presented in Additional file 1: Appendix A to facilitate reproducibility and understanding of the methods used in this simulation study.

Estimand
The estimand θ is the interaction effect between the treatment effect and age, which is described in the models below as beta.interC. That is, the magnitude by which the treatment effect changes for every unit increase in age.

Models assessed
The performance of seven different models were compared. Models 1, 2, and 3 were based on the proposed model by Riley et al. for IPD meta-analysis, which separates within-from between-trial estimation of treatment effects, as a way to minimize the risk of ecological fallacy (Table 3) [18].
Model 1: Estimation of the between-trial variance of the interaction between treatment effect and age, estimation of the between-trial variance of the treatment effect, and separation of within-from between-trial interaction effects, using individual patient data. Separation of within-from between-trial interaction effects was implemented as previously suggested by Riley et al. [18]. Although we are not aware it has ever been fitted in practice, this model is considered the gold-standard in this investigation because it estimates the randomeffects of both interaction effect and treatment effect that exist in the simulated dataset, and also separates the within and between-trial interaction effects.
In this model, ageC ij is the covariate age ij centered by its mean value m i , in each trial, and it includes an interaction term between the treatment effect and ageC ij , as well as an interaction term between the treatment effect and m i , to separate the within-from the between-trial effects: where y ij is the predicted knee OA pain measured on a 0 to 10 VAS on the j th patient in the i th trial, α is the average knee OA pain in the placebo group for patients that have an average age, treat ij and ageC ij are, respectively, the treatment received and age centered around the trial's mean age of the j th patient in the i th trial, ageM i is the mean age of patients in the i th trial, beta.treat is the fixed between-group difference in knee OA pain (i.e. treatment effect), beta.ageC is the fixed coefficient of the centered age covariate, beta.interC i is the interaction between the treatment effect and centered age in trial i, beta.ageM is the fixed coefficient of the mean value of the age covariate, beta.interM is the fixed interaction between the treatment effect and the mean value of age, u i are the random trial effects of knee OA pain in the placebo group for patients that have an average age, v1 i are the random trial effects in the interaction between the treatment effect and centered age, and v2 i are the random trial effects in the treatment effect. This model assumes an exchangeable covariance structure between the random-effects of intercept and interaction.
As previously stated, this model allows the separation of the pooled within-trial treatment-covariate interaction beta.interC from the between-trials interaction beta.in-terM [18]. This model has been shown to completely separate within-and between-trial effects, thus resulting in independent beta.interC and beta.interM, even when the number of subjects per study is small [18,19]. The between-trials interaction beta.interM is actually the same as the output of a meta-regression, which only uses aggregate data to estimate associations between individual-level characteristics and the treatment effect (model 7 is a meta-regression) [20]. .Model 2: Estimation of the between-trial variance of the interaction between treatment effect and age, and separation of within-from between-trial interaction effects, using individual patient data.
The parameters in model 2 are as defined previously in model 1, except that now the only random trial effects are   19:198 in the interaction between the treatment effect and centered age v i . Again, separation of within-from between-trial interaction effects was implemented as previously suggested by Riley et al. [18]. Model 3: Estimation of the between-trial variance of the treatment effect, and separation of within-from betweentrial interaction effects, using individual patient data.
The parameters in model 3 are as defined previously in model 1, except that now the only random trial effects are in the treatment effect v i . Again, separation of within-from between-trial interaction effects was implemented as previously suggested by Riley et al. [18].
Model 4: Estimation of the between-trial variance of the interaction between treatment effect and age, estimation of the between-trial variance of the treatment effect, without separation of within-from between-trial interaction effects, using individual patient data The parameters in model 4 are as defined previously in model 1, except that mean age is no longer part of the model. This means that the association between mean age and the outcome of knee pain, and the interaction effect between mean age and treatment allocation on the outcome of knee pain, are no longer estimated. Because mean age is not included in model 4, it does not separate the within-from between-trial interaction effects.
Model 5: Estimation of the between-trial variance of the interaction between treatment effect and age, without separation of within-from between-trial interaction effects, using individual patient data.
The parameters in model 5 are as defined previously in model 2, except for the parameters related to the effect of mean age within trials. Because these parameters were removed, model 5 does not separate the withinfrom between-trial interaction effects.
Model 6: Estimation of the between-trial variance of the treatment effect, without separation of withinfrom between-trial interaction effects, using individual patient data The parameters in model 6 are as defined previously in model 2, except for the parameters related to the effect of mean age within trials. Because these parameters were removed, model 6 does not separate the withinfrom between-trial interaction effects.
Model 7: Estimation of the between-trial variance of the treatment effect using aggregate data, which can only estimate the between-trial interaction effects where y i is the predicted between-group difference in knee OA in the i th trial, α is the fixed between-group difference in knee OA pain, ageM i is the mean age of patients in the i th trial, beta.age is the fixed interaction effect between the treatment effect and the mean age of patients in the i th trial, and u i is the random effect of the between-group difference in knee OA pain.

Performance measures
Bias, coverage, empirical and model-based standard errors, and the relative error in the model standard error for the estimand θ were assessed. Estimation of performance measures and their Monte Carlo Standard Error (MCSE) was conducted as described by Morris et al. [15]. Bias was considered acceptable if it was ≤10 times lower than the simulated interaction effect of − 0.01. Coverage of the 95% confidence interval of the estimated interaction effect was assessed. The explanation on how each of the performance measures were calculated are presented in Additional file 1: Appendix B.

Sample size calculation
The simulation was powered to have a precise estimate of coverage. With an expected coverage of 95%, and a desired MCSE of coverage of 0.5%, 1900 repetitions would be needed [15]. It was then decided to use 2000 repetitions for each simulation, which results in a small MCSE for the estimate of all performance measures (Table 4).
Software used to analyse the performance of the simulated data All analyses were conducted in Stata 14.2.

Results
In this section, we focus on results pertaining to the large but plausible between-trial variation of 0.05 in the interaction effect since our main purpose is to assess model performance when this variation is taken into account. Results of other scenarios using a between trial variation of different magnitude are presented in the Additional file 1.
Results of simulations assuming a large between-trial variation of 0.05 in the interaction effect Figure 1 shows the interaction effects estimated in each of the 2000 repetitions according to all models for analyses with 6, 20, and 50 trials and the large between-trial variation of 0.05 in the interaction effect. It can be seen from this graph that the interaction effects from models that used IPD have a much narrower scatter than those from model 7, the only model that used AD. Through visual inspection, the mean value of the interaction effects appears similar across all models. Also, the scatter of interaction effects decreases as the number of trial increases, for all of the models. Figure 2 shows scatter plots of interaction effects of all models plotted against each other. It can be seen that there is a considerable agreement between all models  that used IPD, with higher agreement between models that made the same assumptions regarding between-trial variation of effects. It can also be seen that there is very low correlation between results from models using IPD and the model 7, which used AD. Although increasing the number of trials decreased the variation of estimates for all models, it did not have an important influence in the correlation between the results of different models. Figure 3 shows the standard error of the interaction effects according to all models for analyses with 6, 20, and 50 trials. As with the interaction effects, this graph shows that the scatter of standard errors is narrower in models that used IPD than model 7, which used AD. Again, as with the interaction effect, the scatter of the standard error decreases as the number of trials increases. However, different than what was observed with the interaction effects, the mean value of the standard error in model 7 was systematically larger than those in models that used IPD. Figure 4 shows scatter plots of the standard error of interaction effects of all models plotted against each other. As previously observed for interaction effects, there is a higher correlation in standard errors between models that used the same assumptions regarding the estimation of between-trial variance. Although there was a clear correlation in standard errors between models that used IPD, this correlation is lower than the one observed for interaction effects. As with interaction effects, an increase in the number of trials did not increase the correlation of standard errors between models, and the correlation between models using IPD and model 7 was very low. Table 5 shows the estimated bias in the interaction effect in each of the models for all investigated number of trials. Bias was most influenced by whether the betweentrial variation was estimated for the interaction effect, the treatment effect, or both. A separate estimation of withinand between-trial effects did not play an important role. It can be seen that bias tends to decrease as the number of trials increases for all of the models, with a more extensive difference in model 7, which used aggregate data. While all other models had a satisfactory bias with a lower number of trials, model 7 had a bias effect of − 0.065, which is 6.5 times the simulated interaction effect of − 0.01. Model 7 only has an acceptable bias of 0.0007 when 50 trials are included in the analysis. Among the models that used IPD (i.e. models 1-6), only models that modelled exclusively the between-trial variation of the interaction effect (i.e. models 2 and 5) performed well when the number of trials was only 6. Model 2, which modelled the between-trial variation in the interaction effect and separated withinand between-trial effects, had the lowest bias (− 0.00006). Model 5, which modelled the between-trial variation in interaction effects but did not separate within-from between-trial effects, had also an acceptable bias of 0.00038. With 10 trials included in the analysis, the magnitude of bias becomes acceptable for models that exclusively modelled the between-trial variation of the treatment effect (i.e. models 3 and 6). Only when 16 trials were included in the analysis that models 1 and 4, which modelled between-trial variation for both treatment effects and interaction effects, achieved acceptable levels of bias. Finally, the MCSE was relatively large for all bias estimates. Table 6 shows the observed coverage of the interaction effect between treatment effect and age in each of the models. With only 6 trials, coverage was best with model 7, with an observed coverage of 94.4%, while all models using IPD had an observed coverage below 90%. Coverage was worst with models that exclusively estimated the between-trial variance in treatment effects, with observed values around 83.4%. As the number of trials increased, coverage also increased, but models using IPD consistently overestimated coverage, with observed coverage values always lower than the nominal value. For models using IPD, those that estimated the between-trial variation in the interaction effect had acceptable coverage levels when at least 20 trials were included in the analysis. Models 3 and 5 only had an acceptable coverage when 40 trials were included in the analysis. Among models using IPD, those that estimated the between-trial variation in the interaction effect had consistently better coverage than models that did not. Coverage was consistently accurate for model 7 across all numbers of trials. Finally, all coverage estimates were precise, with low MCSEs. Figures 5, 6 and 7 display zip plots of the 2000 confidence intervals for all IPD models for analyses with 6, 20, or 50 trials (see Additional file 1: Appendix B for an explanation on how to interpret zip plots). We can readily see that all models using IPD had insufficient coverage when only 6 trials were included in the analysis, with models 3 and 6, which only addressed the betweentrial variance of the treatment effect, having the worst coverage. With 20 trials included in the analysis, we can see that coverage is appropriate for most models, but that models 3 and 6 are still slightly overestimating the nominal coverage. With 50 trials included in the analysis it becomes clear that all models using IPD have an appropriate coverage.  Table 7 shows how each model performed in the estimation of standard errors, for different numbers of trials included in the analysis. In general, both empirical and model SEs were smaller in models 1 and 4, where the betweentrial variance of treatment effects and interaction effects are estimated and allowed for. Models 2 and 5, which estimate and allow for the between-trial variance in interaction effects only, had the second smallest empirical and model SEs. Among models using IPD, models 3 and 6, which estimated and allowed for the between-trial variance in treatment effects, but ignored the between-trial variance in interaction effects, yielded the largest SEs, and thus have the lowest precision. Model 7, the only model that used AD, yielded clearly higher SEs than the other models. The difference in SEs between models using IPD were small. These findings were not affected by the number of trials included in the analysis, or whether within-and between-trial effects were modelled separately or together. When assessing the error in the model SE in relation to the empirical SE, the model SE was most accurate if the model also estimated and accounted for the between-trial variance in interaction effects (models 1,2,4,5), while models that ignored this had a consistently higher relative error in the model SE (models 3,6,7). This pattern was consistent regardless of the number of trials included in the analysis, or if within-and between-trial effects were modelled separately or together. Table 8 shows the mean squared error of each model for different numbers of trials included in the analysis. It can be seen that the mean squared error decreased as the number of trials included in the analysis increased, that models using IPD have lower mean squared errors than the model that used AD, and that models that accounted for the between-trial variance in the interaction effect had a lower mean squared error than those that did not. The most important factor leading to higher values of the mean squared error was the use of AD for the analysis. The model that used AD had considerably larger mean squared errors than models that used IPD. Although the mean squared error decreased as the as the number of trials included in the analysis increased, it was still considerably higher for the model using AD even when 50 trials were included in the analysis. Among models that used IPD, differences in mean squared error seem negligible, even though models that accounted for the between-trial variance in the interaction effect had lower mean squared errors than those that did not.

Results of simulations assuming a null, small, or implausible large between-trial variation in the interaction effect
Results of the simulation with a between-trial variance of zero, 0.005, or 0.5 are shown in Additional file 1: Appendices C, D and E. Findings of our simulation with a null or  19:198 small between-trial variance were similar to the simulations with a large but plausible between-trial variance, but with much smaller differences between models using IPD. Differences in performance between models using IPD and the model using AD decreased as the between-trial variation decreased. The difference in all performance measures between the models using IPD were consistently small, with a pattern indicating that models that allowed for the between-trial variance in the interaction effect had lower bias, and better precision and coverage. Again, similar to the simulations assuming a large but plausible betweentrial variance, the model using AD had a generally worse performance than models using IPD. Although this difference in performance decreased as the number of trials increased, the model using AD still had a worse performance than models using IPD even when 50 trials were included in the analysis, albeit unimportant differences. No relevant difference was observed between models that separated within-and between-trial associations and models that did not separate them.
Findings of our simulation with an implausible large between-trial variance were similar to those from the simulations with a large but plausible between-trial variance. However, the difference in performance between models was accentuated with the implausible large between-trial variance, mainly when the number of trials included in the analysis was low.

Results of simulations assuming a null interaction effect
Results from analyses assuming a null interaction effect were remarkably similar to those from analyses assuming an interaction effect of − 0.01 (results available on request).

Main findings
In this simulation study, the performance of seven different meta-regression models for the pooling across trials of interaction effects between a patient-level characteristic and the treatment effect were compared and it was     19:198 observed that a better performance was obtained in models that used IPD and that allowed for between-trial variation of the interaction effect. Although models that allowed for between-trial variation of the interaction effect had a better performance than models that did not allow for this variation, other model or dataset characteristics played a more important role in model performance. The main factor influencing the performance of models was whether they used IPD or AD. Models based on IPD generally had acceptable levels of bias that were ten times smaller than the simulated interaction effect of − 0.01 when ten or more trials were included in the analysis. Biases in the model using AD were approximately two to six times larger than the simulated interaction effect, and was only acceptable when 50 trials were included in the analysis. The model that used AD also had a considerably lower precision than all models that used IPD, especially when a low number of trials was included in the analysis, although coverage of AD models was generally appropriate even with a low number of trials in the analysis. The second most important factor influencing the performance of the models was the number of trials included in the analysis. In general, as the number of trials included in the analysis increased, the performance of all models improved. Another important factor that influenced model performance was the magnitude of heterogeneity. Differences in performance between models was more evident in the presence of high heterogeneity (i.e. large betweentrial variation of the interaction effect), which is not surprising given only certain models acknowledge the possibility of heterogeneity in interaction effects. The performance of models was also influenced, to a lesser degree, by whether the model allowed for between-trial variation of the interaction effect. Models that used IPD and allowed for between-trial variation of the interaction effect had the best performance, generally with lower bias, better coverage, and higher precision than all other models. The model with the worst performance among IPD models was the one that allowed for between-trial variation in treatment effects but ignored the between-trial variation in interaction effects, and it is likely the most commonly used model in practice for IPD analyses. The goldstandard model, which was a model that allowed for between-trial variation in both treatment and interaction effects and used IPD, did not perform well when only ten trials were analysed. No relevant difference was observed  19:198 between models that separated within-and between-trial associations and models that did not separate them, which is unsurprising, as differences in within-and between-trial associations was not simulated in the dataset (i.e. ecological biases were not simulated). Likewise, the presence of an interaction effect did not seem to influence results, since results of analyses assuming a null interaction effect or an interaction effect of − 0.01 were remarkably similar.

Practical implications
This study shows that when investigating interaction effects in meta-regressions, using IPD with a model that accounts for the between-trial variation of interaction effects is likely to result in more accurate results, with less bias, better coverage, and higher precision, especially in the presence of high heterogeneity of interaction effects. Not enough importance has been given to modelling the between-trial variation of interaction effects when pooling them across trials, with technical papers about regression models mainly focusing on modelling the between-trial variation of the treatment effect [18,20,21]. The reason for this, perhaps, is because although modelling the between-trial variation in interaction effect improves the accuracy of results, this improvement seems rather irrelevant in the absence of high heterogeneity. Nevertheless, it is important that the between-trial variation in interaction effects is reported, perhaps together with prediction intervals around the interaction effect, so that readers can better appreciate how high or how low this variation is [13]. In practice, modelling both between-trial variations, of the interaction effect and treatment effect, may lead to model convergence problems, especially if a low number of trials is included in the analysis. The lowest number of trials used in the present simulation analysis was six, and no convergence problems were observed. Whenever convergence problems are observed, researchers may try modelling only the between-trial variation of interaction effects, and only if the problem persists, try to model only the between-trial variation of treatment effects. Researchers should also always give precedence to IPD over AD when conducting regression analyses to pool interaction effects across trials, especially if the number of trials is below 50, which is often the case. Our simulation analysis indicates that if 50 trials with AD are included in a meta-regression, results may be comparable  19:198 to regression analyses using IPD, in terms of bias and coverage. However, even when 50 trials with AD were included in the meta-regression, precision was still considerably lower than a regression analysis with 50 trials with IPD. Thus, since most meta-regression analyses include only a few trials, every effort should be made to access IPD when the main goal is to pool subgroup effects across trials. Interestingly, the model using AD had a better coverage than models using IPD when the number of trials analysed was low. This is likely due to the use of a Knapp-Hartung approach for estimating the variance of the mean interaction effect estimate in the AD model, which corrects the between study variance to account for its uncertainty [22]. This correction, developed for AD models, has been shown to have a better coverage performance of the 95% confidence interval than other approaches that assume that the between study variance is known [23]. Moreover, using IPD when investigating subgroup effects in a regression analysis will not only lead to results with better statistical properties, but is also necessary to avoid ecological fallacy [3]. Finally, because IPD is often only available for a small proportion of the trials included in a meta-analysis, an alternative approach would be to combine the evidence from IPD and AD when conducting such regression analyses, which is beyond the scope of the present simulation study [24].

Strengths and limitations
The main strength of this simulation analysis is that it mimicked characteristics commonly found in metaanalyses. This simulation analysis allowed for a wide range of number of trials included in the analysis, for a varying number of patients included across trials included in the analysis with a balanced mix of small and large trials, and for scenarios with small and large between-trial variation in the interaction effect. Because the motivation to conduct this investigation is future research of osteoarthritis treatments, the simulated dataset used representative magnitude and within-and between-trial variations of treatment effects and average patient age observed in osteoarthritis trials [16]. Moreover, the use of simulated datasets allowed us to observe the statistical properties of different meta-regression models without the potential confounding influence of factors such as biases due to low  methodological quality, which could hinder the proper interpretation of the results.
The main limitation of the present investigation is the definition of small and large heterogeneity. Heterogeneity is of great relevance to this investigation, since its main goal is to compare models that appropriately address it or not. The cut-offs used to define small and large but plausible heterogeneity were arbitrarily chosen to represent real datasets of osteoarthritis treatments. However, one may argue that even the definition of a small heterogeneity may not be so small, and that the large but plausible heterogeneity may be unusually large for datasets of real osteoarthritis clinical trials. This is done consciously to err on the safe side. That is, because the performance of this models can only be different in the presence of heterogeneity, cut-offs of small and large heterogeneity were chosen so that at least some heterogeneity was always present. We also analysed a simulated dataset with an implausible high heterogeneity for a better understanding of how models perform across different values of between-trial variance.
Another limitation is that results are only generalizable to trials using continuous outcomes, and arguably, only to osteoarthritis trials. However, it is expected that similar results are observed in trials with binary outcomes. In addition, although the performance of models that separate within-from between-trial associations was compared to the performance of models that do not, the dataset was not simulated to have different within-and between-trial associations of age and treatment effect. Thus, it is not surprising that there was no difference between models that separated or not these two types of associations. Likewise, although simulated datasets included small and large trials, we did not simulate smallstudy effects [17]. Discordant results between small and large trials may play a major role in the results of a random-effects meta-analysis [11,25]. Finally, IPD metaanalysis of continuous outcomes are less subject to overfitting and may accommodate a larger number of covariates to account for potential confounding than IPD meta-analysis of binary outcomes or meta-regression in general [26][27][28]. Since the main purpose was to investigate the practical implications of adding random-effects to an interaction test in an IPD meta-analysis, we did not extended our simulations to scenarios where multiple confounding variables may be present.

Previous research
This is the first investigation in continuous outcomes to compare the models that used IPD and AD while accounting for between-trial variation in the interaction effect. Lambert et al. conducted a simulation study to compare fixed-effect regression analyses using IPD or AD to estimate the interaction between treatment effects and patient-level characteristics [6]. They reported that AD can be used in meta-regression if a large number of large trials is available, and concluded that AD can be used to adequately estimate an interaction between the treatment effect and trial-level characteristics, but that IPD would be needed to accurately estimate an interaction between treatment effects and patient-level characteristics. Similarly to their study, the current simulation study compared IPD and AD for the estimation of an interaction between treatment effects and patient-level characteristics, but extended the investigation of Lambert et al. by comparing random-effects models. Similar to their findings based on fixed-effect models, our investigation in random-effects models concluded that AD can be used to estimate interaction effects if a large number of trials is available, but that IPD is needed when estimating an interaction between treatment effects and patient-level characteristics, mainly due to problems with ecological fallacy.
Stewart et al. conducted a similar investigation to ours using a real dataset of 24 RCTs evaluating antiplatelet agents for the prevention of pre-eclampsia in pregnancy, a binary outcome [21]. They compared three different models that were similar to the ones used in the present investigation: one model that had random-effects for the treatment effect but had a fixed-effect for the interaction   19:198 effect (corresponding to model 6 in the present study), one model with the same characteristics of the previous one but that separated the within-and between-trials associations (corresponding to model 3 in the present study), and one model that had random-effects for both the treatment effect and interaction effect and did not separate within-and between-trials associations (corresponding to model 4 in the present study). The authors concluded that results across these models were virtually the same and led to the same conclusions. This is somewhat different from the conclusions of the present study, as it was noticed that albeit small, there were improvements in the accuracy of results when the model allowed for random-effects of the interaction between treatment effect and covariate. This difference in conclusions may have happened for different reasons. The first reason is that they had a fixed number of 24 trials in their analysis, and as shown in our analysis, differences in performance between models are most evident with a lower number of trials. Another reason is that there was perhaps a low between-trial variation in the interaction effect in their dataset. It was shown in this simulation that the lower the between-trial variance, the less evident is the difference in performance between models. Because we conducted a simulation study which allowed us to vary several parameters that could potentially play a role in the results of meta-analyses, as opposed to the study of Stewart et al. which had only fixed parameters, we could observe and compare the performance of all seven models throughout a wide range of values from these parameters. Finally, this simulation analysis was based on continuous outcomes, while the analysis of Stewart et al. was based on a binary outcome. It is unlikely that the difference in results observed between the studies can be explained by this, however.

Future research
Several future research questions could develop from the current investigation. For instance, future research could be conducted to develop and investigate whether a model based on AD and that allows for between-trial variation in interaction effects has a better performance than the model based on AD that only allows for the between-trial variation in the treatment effect. Future research could also explore whether having lower or higher number of trials than what was investigated in the present study could play a role in the performance of these models. Future research could also investigate if similar results are observed with binary outcomes and if the performance of the models change in the presence of ecological fallacy, small study effects, or confounding from single or multiple other sources. Future research should also investigate more thoroughly the performance of different models for the estimation of the between trial variation in the interaction effect, including an assessment of how this estimation or lack thereof may influence the estimation of other parameters in the model (e.g. is the between-trial variation of the treatment effect overestimated if the interaction term is not included in the model?). Also, the current study only investigates a single two-way interaction, but future studies may investigate more complex interactions, such as a three-way interaction. Finally, future research should consider investigating the use of permutation tests to control for false positive rates to compare results of different regression models using these tests [27].

Conclusions
The results of the present investigation indicate that IPD models that allow for the between-trial variation in interaction effects should be given preference over models that only allow for between-trial variation in treatment effects when investigating subgroup effects within a meta-analysis. It was shown that, even in the absence of ecological fallacy, there is significantly less bias and more precision when IPD is used over AD. Among models using IPD, there was less bias, better coverage, and higher precision in those that modelled the between-trial variation in interaction effects (when one existed), especially when a small number of trials was analysed, and thus should be given precedence whenever possible.
Additional file 1 : Appendix A. Code used to generate the simulated datasets. Appendix B. Performance measures. Appendix C. Results of the simulation study assuming a between-trial variance of 0.5 in the interaction effect. Appendix D. Results of the simulation study assuming a between-trial variance of 0.005 in the interaction effect. Appendix E. Results of the simulation study assuming a null between-trial variance in the interaction effect.
Abbreviations AD: Aggregate data; IPD: Individual patient data; RCT: Randomized clinical trial