 Research article
 Open Access
 Open Peer Review
 Published:
Statistical approaches to identify subgroups in metaanalysis of individual participant data: a simulation study
BMC Medical Research Methodologyvolume 19, Article number: 183 (2019)
Abstract
Background
Individual participant data metaanalysis (IPDMA) is considered the gold standard for investigating subgroup effects. Frequently used regressionbased approaches to detect subgroups in IPDMA are: metaregression, persubgroup metaanalysis (PSMA), metaanalysis of interaction terms (MAIT), naive onestage IPDMA (ignoring potential studylevel confounding), and centred onestage IPDMA (accounting for potential studylevel confounding). Clear guidance on the analyses is lacking and clinical researchers may use approaches with suboptimal efficiency to investigate subgroup effects in an IPD setting. Therefore, our aim is to overview and compare the aforementioned methods, and provide recommendations over which should be preferred.
Methods
We conducted a simulation study where we generated IPD of randomised trials and varied the magnitude of subgroup effect (0, 25, 50% relative reduction), betweenstudy treatment effect heterogeneity (none, medium, large), ecological bias (none, quantitative, qualitative), sample size (50,100,200), and number of trials (5,10) for binary, continuous and timetoevent outcomes. For each scenario, we assessed the power, false positive rate (FPR) and bias of aforementioned five approaches.
Results
Naive and centred IPDMA yielded the highest power, whilst preserving acceptable FPR around the nominal 5% in all scenarios. Centred IPDMA showed slightly less biased estimates than naïve IPDMA. Similar results were obtained for MAIT, except when analysing binary outcomes (where it yielded less power and FPR < 5%). PSMA showed similar power as MAIT in nonheterogeneous scenarios, but power collapsed as heterogeneity increased, and decreased even more in the presence of ecological bias. PSMA suffered from too high FPRs in nonheterogeneous settings and showed biased estimates in all scenarios. Metaregression showed poor power (< 20%) in all scenarios and completely biased results in settings with qualitative ecological bias.
Conclusions
Our results indicate that subgroup detection in IPDMA requires careful modelling. Naive and centred IPDMA performed equally well, but due to less bias of the estimates in the presence of ecological bias, we recommend the latter.
Introduction
Metaanalyses of individual participant data (IPD) provide the best evidence regarding treatment effects and offer unique opportunities and benefits when investigating subgroup effects [1]. Therefore, IPD metaanalyses (IPDMA) are considered the gold standard for detection of subgroup effects and their use has increased over the last decade [2, 3]. The possibility to standardize subgroup definitions and outcomes across studies, the higher validity and credibility of subgroup findings and the increased flexibility to search for subgroups based on combinations of patient and/or disease characteristics and the avoidance of incorrect results due to ecological bias are benefits of using IPD of multiple trials rather than traditional (aggregate) metaanalysis [4]. Nevertheless, Simmonds et al. (2015) have reported that only 1% of the conducted metaanalyses were using IPD [5].
IPDMA may be conducted either in one or two stages. In twostage IPDMA, each trial is first analysed separately, using an appropriate statistical model. For instance, the first stage may estimate the main treatment effect, or the different effects observed per subgroup, or the treatmentcovariate interaction effect. Subsequently, these effects from different trials are combined into a summary estimate in the second stage of the metaanalysis. Although metaanalytic methods are often used to investigate main treatment effects, they can also be used to investigate subgroups. For instance, the presence of subgroup effects can be investigated by modelling the association of the estimated main treatment effects with a triallevel covariate (metaregression). Alternatively, estimates of subgroup effects or interaction terms can directly be summarized using traditional metaanalysis (MA) methods.
In onestage IPDMA, all IPD from every trial are analysed simultaneously whilst accounting for the clustering of participants within studies. Hereby, researchers may model interactions between treatment and patientlevel covariates either directly (naive IPDMA), or after the covariates are meancentred per study in order to account for potential ecological bias (centred IPDMA) [6].
When IPD are available for all studies, it is often unclear which metaanalysis method should be adopted. In 2015, Simmonds et al. reported that all approaches are still being used in IPDMA, even aggregated data metaanalytic methods such as metaregression and PSMA. Although minor differences are usually observed when summarizing main treatment effects, each of aforementioned methods have specific deficiencies when investigating the presence of subgroup effects [7, 8]. In particular, it is well known that metaregression has poor power and is prone to (ecological) bias [6, 9,10,11]. Persubgroup MA (PSMA) has also been criticized of being prone to ecological bias [12]. Further, MA of interaction terms (MAIT) is considered as less precise when limited number of studies or participants are present [2]. Finally, it has been demonstrated that “naive” IPDMA may suffer from limited precision and excessive false positive rates (type I error) in the presence of triallevel confounding [6], which is similar to ecological bias.
So far, comparisons of metaregression, PSMA, MAIT, “naïve” and centred metaanalysis to study subgroup effects have been limited to either empirical studies or simulation studies only comparing a subset of these approaches. Simmonds and Higgins [13] proved that onestage IPDMA is always more powerful than the twostage methods, under the assumption that there is no betweenstudy heterogeneity, all studies have the same residual variance and all studies use balanced randomization. These assumptions were considered too restrictive. Therefore, Simmonds and Higgins also performed a simulation to compare metaregression and MAIT, but their simulations only included datasets with 250 patients and neither onestage methods nor PSMA were included. Other studies ignored the presence of residual (i.e. unrelated to effect modification) betweenstudy heterogeneity in treatment effect, for example Lambert et al. [10] compared metaregression to naïve IPDMA using simulated datasets without betweenstudy heterogeneity. Koopman et al. compared metaregression, naive IPDMA, and MAIT using only empirical studies [14]. Hua et al. compared different types of onestage approaches using simulated timetoevent data [6]. Burke et al. theoretically explained the differences between the results of naive IPDMA and MAIT [7]. Fisher et al. wrote a critical review over PSMA, MAIT, centred and naive IPDMA methods and applied them on empirical studies [8]. In a subsequent paper, comparing twostage methods only, he advocated the use of MAIT over PSMA and metaregression, and applied all three on empirical studies to point out the differences [12]. Simmonds et al. [5, 15] reviewed the aforementioned statistical approaches in two consecutive papers, one in 2005 and one in 2015. They concluded that onestage methods are used more frequently in 2015 than in 2005 and that metaregression and MAIT were not typically preferred, in contrast to PSMA which is the most frequently used twostage method. PSMA, however, is prone to power reduction when heterogeneity is present and overestimation when there is no heterogeneity [12, 16, 17]. These issues may be worsened if ecological bias and betweenstudy heterogeneity are simultaneously present. Finally, Kontopantelis has performed a simulation study comparing naïve IPDMA and MAIT [18]. He has generated data, covering different IPD sizes and different betweenstudy heterogeneity levels on intercept and treatment effect. Nevertheless, centred IPDMA, PSMA, and metaregression were not included, ecological bias was not generated and the number of participants was above 1000, which in some RCTs may be unrealistic.
To our knowledge all above mentioned studies have focused on a limited number of scenarios. Most studies either focused on only a limited set of available approaches, only one type of outcome or one level of subgroup effect magnitude and only in Hua et al. [6] ecological bias has been introduced.
In our simulation paper, we will compare all aforementioned approaches using binary, continuous and survival outcome measures, focussing on differences in power, false positive rates (FPRs) and bias in the estimates. We will vary the amount of betweenstudy heterogeneity in treatment effects, the magnitude of the subgroup effect, the level of ecological bias, and the number of trials and participants.
Methods
In our study we included five common statistical approaches: centred (onestage) IPDMA, naive (onestage) IPDMA, MA of interaction terms (MAIT), persubgroup MA (PSMA), and metaregression. We simulated datasets with binary, continuous, or survival outcomes. We varied the magnitude of the subgroup effects, the presence of betweenstudy heterogeneity in treatment effect, the level of ecological bias, and the size and number of trials. Our paper is organized as follows. We start with a description of the data generation mechanism (Section 2.1), followed by a description of the statistical approaches (Section 2.2), and the assessment of power, FPRs and bias in the estimates (Section 2.3).
Data generation
In this section we describe the datageneration mechanism in general; details on the parameters can be found in Table 1. In short, IPDsets were generated for continuous, binary, and timetoevent outcomes. We generated equal treatment allocation, as in a twoarm randomized clinical trial with a control and an active treatment. We focused on effect modification by a binary covariate, using smoking as an example throughout the paper. We simulated different baseline risk levels for nonsmokers and smokers; for example, mortality rates may be different. We assumed absence of treatment effect in nonsmokers and varied the magnitude of the treatment effect in smokers (absent, medium, or large), reflecting the subgroup effect. We also varied the magnitude of the additional betweenstudy heterogeneity in treatment effect (absent, medium, or large) and the magnitude of ecological bias (none: 0, quantitative: + 100% of the subgroup effect (favouring treatment effect), qualitative: − 200% of the subgroup effect (favouring placebo effect)) in the scenarios with large and medium subgroup effects. Since the subgroup effect in the nosubgroup effect scenarios is by definition equal to 0, we used the medium subgroupeffect settings to define the size of ecological bias. We also varied the number of trials [5 or 10], and the number of participants (50, 100, or 200 per trial). In total, this resulted in 486 scenarios. Per scenario we generated 1000 IPDsets, with equal treatment allocation in each trial. We varied the percentage of smokers over the trials in order to reflect variability in the prevalence of the potential effect modifier across datasets. Specifically, the percentages were 30, 40, 50, 60, and 70%. We generated the individual outcomes using a generalized linear model (GLM) with a normal distribution with a standard deviation of 1 and an identity link for the continuous outcomes, a Bernoulli distribution and logit link for the binary outcomes, and an exponential distribution and log link for the timetoevent outcomes. We assumed a common baseline effect across studies for the intercept term (b_{0}), a common prognostic effect of smoking (b_{S}), and the interaction between treatment status and smoking (b_{x}). The coefficients we used varied per type of outcome and scenario, see Table 1. For the treatment effect (b_{T}), we generated random effects across studies (H_{j}). The linear predictor in the GLM was:
where i denotes the participant and j the study. H_{j} was drawn from a normal distribution with a mean of 0 and a standard deviation (τ) of 0 (no heterogeneity), 0.25 (medium) or 0.5 (large heterogeneity), reflecting values of τ in the Cochrane Database of Systematic Reviews of 2009–2013 [19]. Note that H_{j} reflects additional betweenstudy heterogeneity, on top of variability due to withinstudy sampling (imprecision) or subgroup effects.
For the continuous outcomes, the average outcome in the control group was 0 and 1 for nonsmokers and smokers respectively. For the binary outcomes, in the control group the event rates of the nonsmokers and smokers were respectively 20 and 40%. In logitscale the abovementioned event rates were approximately  1.385 and − 0.4 respectively (see Table 1). For the timetoevent outcome, the hazard rates in the control group were defined as 2 and 4 events per 1000 persondays for nonsmokers and smokers, respectively. Therefore, the increase in the hazard risk of the smokers was 0.7 on the log scale (see Table 1). For all types of outcomes, the treatment reduced the average outcome only in the smoker’s group by 0% for the no subgroup effect scenario, 25% for the medium subgroup effect scenario, and 50% for the large subgroup effect scenario. For the continuous outcome this resulted in average values of 1, 0.75, and 0.5, for the binary outcome in event rates of 40, 30, and 20%, and for the timetoevent outcome in 4, 3, and 2 events per 1000 persondays in smokers, for the no, medium, and large subgroup effect, respectively.
Statistical approaches
Each of the 486 distinct scenarios was generated 1000 times. All 486,000 simulated datasets were analysed using aforementioned five approaches (more details below). All analyses were conducted with the statistical package R, version 3.4.1 [20] using for onestage approaches lmer [21], glmer [21] and coxme [22] while for twostage approaches the packages metafor [23], logistf [24], coxphf [25].
Centred IPDMA
Onestage approaches jointly analyse the IPD from all trials, accounting for the clustering of participants within trials. In line with recent recommendations [5], effect modifiers should be centred by their mean value in each trial, in order to separate the within and acrosstrial information. In our simulations, the percentage of smokers p_{j} is used to adjust for potential betweentrial differences. Therefore, we fitted a mixed effects model as in Section 2.2.4, but now with two interaction terms to separate across and withintrial information, thus accounting for potential ecological bias.
The statistical model is the following:
where β_{A} is the across studies interaction and β_{W} is the withinstudy interaction effect.
We assumed a common (fixed) effect for β_{w}, as we have not generated betweenstudy heterogeneity on the treatmentsmoking interaction term. We extracted \( \hat{\beta} \) _{w,} which gives the interaction effect (free of ecological bias) and its corresponding pvalue for power, estimate bias and false positive rate (FPR) calculations.
Naive IPDMA
For continuous outcomes we applied linear mixedeffect models, for binary outcomes logistic mixedeffect models and for timetoevent outcomes CoxPH mixed effects models. We used all available data in a single model containing a random effect for treatment and fixed effects for intercept, subgroup and treatmentsubgroup interaction. The statistical models are based on the following specification:
We assumed common effects for β_{2} and β_{x} and random effects for β_{0j} and β_{1j}. We extracted the \( \hat{\beta} \) _{x} estimate, which reflects the treatmentsmoking interaction, and its corresponding pvalue for power, estimate bias and false positive rate (FPR) calculations. This onestage approach is characterised as naive [6, 16], as it does not account for potential ecological bias that may come of unadjusted confounders on trial level, like potential agedifferences between trials.
Persubgroup metaanalysis (PSMA)
We separated the IPDset for each trial into smoking and nonsmoking participants. Per trial, we fitted an appropriate model per outcome: linear, logistic or CoxPH regression. In order to account for single monotone likelihood issues due to zero cells in binary and time to event outcomes, we adopted the Firth’s bias correction for logistic [20, 24] and CoxPH [21, 25] regression. This penalization method reduces the bias that occurs when adopting maximum likelihood estimation (MLE) in finite samples, and has therefore been recommended in (relatively) small datasets [26].
The mathematical description of our GLM model per trial j is:
In the formula above, g represents the appropriate link function (identity, logit or log in case of continuous, binary or count outcomes, respectively).
Per subgroup we applied a randomeffects metaanalysis with the extracted treatment coefficients using the EB method, which is equivalent to the PauleMandel method [27], for the τ^{2} estimation [28] and the HKSJ adjustment [29, 30]. The resulting persubgroup pooled estimates were compared with each other with a Wald test. We extracted the differences between the persubgroup pooled estimates and its corresponding pvalue for power, estimate bias and false positive rate (FPR) calculations
Metaanalysis of interaction terms (MAIT)
In metaanalysis of interaction terms (MAIT), the interaction between the potential effect modifier (here smoking status) and treatment is directly modelled per trial. We hereto fitted an appropriate model per trial: linear, logistic or CoxPH regression, including a treatmentsmoking interaction term. Again, we applied Firth’s bias correction for logistic [24, 31] and CoxPH [25, 32] regression.
The statistical model per trial j is as follows:
We applied a fixed effect metaanalysis for pooling the \( \hat{\beta} \) _{xj} estimates of interaction. We extracted the pooled estimate and its corresponding pvalue for power, estimate bias and false positive rate (FPR) calculations.
Metaregression
Metaregression is a twostage approach that uses weighted regression to associate the effect of a triallevel moderator variable (i.e. the percentage of smokers per trial) with the estimated treatment effect in that trial. Hereto, each trial is first analysed separately using either linear, logistic or Cox proportional hazards (CoxPH) regression. Again, we applied Firth’s bias correction for logistic [24, 31] and CoxPH [25, 32] regression. Subsequently, for each trial the estimated treatment effect \( \hat{\beta} \) _{j} and the percentage of smokers p_{j} are extracted. We evaluated the presence of subgroups by fitting a linear mixed model with as dependent variable the extracted treatment coefficients \( \hat{\beta} \) _{j} and as explanatory variable the percentage of smokers p_{j}. Betweenstudy variation in treatment effects was modelled with a random intercept. Weights were based on the inverse variances of \( \hat{\beta} \) _{j}. We applied the empirical Bayes (EB) (PauleMandel) method for the estimation of the betweenstudy heterogeneity τ^{2} and performed the HartungKnappSidikJonkman (HKSJ) adjustment [28, 29].
The mathematical form of our model is:
where γ_{0j},γ_{1} are the metaregression coefficients and ε_{j} the residual error of study j. We extracted \( {\hat{\gamma}}_1 \) and its corresponding pvalue for power, estimate bias and false positive rate (FPR) calculations, see section 2.3.
Methods comparison
To assess the power, FPRs and bias in the subgroup effect estimates of all approaches, each scenario was repeated 1000 times, and we analysed the data as described in sections 2.2.1–2.2.5. See also Fig. 1, which summarises our simulation procedure.
The power of a statistical test is the probability that the test correctly rejects the null hypothesis (H_{0}). As we are comparing the approaches in a simulation setting, we know the direction of the subgroup effect, if any. Therefore, we conducted a onesided test with a significance level of 0.025 in the scenarios with the medium or large subgroup effects to assess the power of the approaches. As we applied Firth’s bias correction, all approaches converged, and we defined power as the percentage of significant results, based on all simulations.
The FPR is the probability of finding a statically significant subgroup effect where there is none. Therefore, we conducted a twosided hypothesis test with a nominal significance level of 0.05 in all scenarios without subgroup effect and calculated the percentage of statistically significant results.
The estimand we are investigating is the treatmenteffect modification term. In our datageneration mechanism that would be the equivalent to the interaction term b_{w} see formula [1]. Each of our aforementioned approaches estimates this treatment effect modification term in a different manner. Per approach, we calculated the bias in the estimate of the resulting coefficient, a difference between the estimand and the coefficient estimate (\( {\mathbf{b}}_{\boldsymbol{w}}\hat{{\boldsymbol{\beta}}_{\boldsymbol{method}}} \)).
Results
For illustrative purposes we show the power and FPR results of our simulations for the scenarios of five trials with each 100 participants in Figs. 2 and 3. Furthermore, we show the bias for each method into Table 2, Table 3, Table 4, Table 5, and Table 6. The above setting was considered most representative for typical IPDMA. The results of other scenarios are shown in the appendices (Additional files 1, 2, 3, 4, 5, 6, 7, 8 and 9). Results were similar to Figs. 2 and 3, but with an increasing trend in power as subgroup effect, number of participants or number of trials increased.
False positive rates
Figure 2 shows that centred and naive IPDMA result in consistent type I error rates (around nominal 5%) for all types of outcome. However, for MAIT we noticed that FPRs were low (around 2.5 to 3.5%) when modelling binary outcomes. PSMA yielded high FPRs in scenarios without heterogeneity, reaching approximately 9% for survival outcomes. But when heterogeneity increased, PSMA’s FPRs decreased even below 5%, reaching 1% for continuous outcomes. Finally, metaregression showed type I error rates approximately at 5% for all types of outcome.
For scenarios with ecological bias, centred and naïve IPDMA showed slightly reduced FPRs for the binary outcomes in scenarios without heterogeneity. In the other scenarios they performed as described. MAIT results were not affected by the addition of ecological bias. PSMA showed mixed results. For binary outcomes the addition of ecological bias resulted in increased FPRs, while for the continuous outcomes the FPRs decreased, and for timetoevent outcomes FPRs remained unaffected. Furthermore, similar to settings without ecological bias, PSMA showed a decreasing trend in the FPRs estimates when heterogeneity increased. Finally, metaregression showed increased FPRs especially for continuous outcomes, reaching 18% in the scenarios favouring placebo (qualitative ecological bias).
Power
In general, naïve and centred IPDMA showed approximately similar results. In scenarios without ecological bias naïve IPDMA showed slightly more power than centred IPDMA. In scenarios with ecological bias centred IPDMA showed more power than naïve, which increased even more in the qualitative compared to the quantitative ecological bias scenarios. Furthermore, the power of centred IPDMA was highest in scenarios with ecological bias. Compared to IPDMA methods MAIT showed decreased levels of statistically significant results. In scenarios with a limited number of trials (n = 5) and small sample sizes (n = 50 participants), we observed the strongest difference. A similar increase in the power of MAIT as seen for the centred IPDMA was observed with the addition of ecological bias. With PSMA, power decreased as heterogeneity increased. For continuous outcomes we observed the strongest decrease: the power dropped from 80% in scenarios without heterogeneity to 25% in highly heterogeneous scenarios. PSMA showed decreased power in scenarios with ecological bias, compared to those without. Finally, metaregression demonstrated low power in all scenarios and often misidentified the direction of the interaction effect (i.e. statistically significant results for negative rather than positive interaction coefficients), especially in scenarios with qualitative ecological bias.
Binary outcomes
Figure 3a shows the results of our simulations for binary outcomes. In scenarios with 5 trials and 100 participants, all five statistical approaches showed less than 70% power to detect large subgroup effects. In scenarios without ecological bias centred and naive IPDMA approaches had similar power results: around 60% power across all three heterogeneity levels. When ecological bias was added, centred IPDMA showed an increase from 60 to 70% in power. The increase in power was less for naïve IPDMA, from 60 to 65%. MAIT showed lower power than the centred and naïve IPDMA (around 50%), and remained unaffected by heterogeneity. The presence of ecological bias increased the power of MAIT to 60%. PSMA showed a notable decrease in power as heterogeneity increased. In scenarios without ecological bias the power dropped from 60% (no heterogeneity) to 40% (large heterogeneity). The addition of qualitative bias had a stronger effect on the power of PSMA than quantitative bias. Specifically, in scenarios without heterogeneity, power dropped from 60 to 35% when qualitative bias was added, while power remained approximately the same with quantitative bias. Similar patterns are observed with heterogeneity. Metaregression resulted consistently in power below 10%, except for scenarios with qualitative ecological bias, where it showed increased percentages of statistically significant results (reaching 30%), but these were mainly related to effects in the opposite direction.
Continuous outcomes
Figure 3b shows the results of our simulations for continuous outcomes. All five statistical approaches showed less than 80% power to detect large subgroup effects. In particular, without ecological bias centred and naïve IPDMA approaches had similar consistent power results (approximately 80%) across all three heterogeneity levels, which slightly decreased in the scenarios with ecological bias. MAIT showed lower power than centred and naïve IPDMA approaches, i.e. approximately 75% and was unaffected by heterogeneity and ecological bias. PSMA showed a decrease in power as heterogeneity increased: in scenarios without ecological bias the power dropped from nearly 80% (no heterogeneity) to around 30% (large heterogeneity). Furthermore, PSMA showed decreased power in scenarios without heterogeneity but with quantitative and qualitative ecological bias, 70 and 45% respectively, which further dropped to 25 and 20% in scenarios with large heterogeneity. Metaregression in scenarios with no and quantitative ecological bias resulted in power less than 10% for all heterogeneity levels. In contrast, power increased in scenarios with qualitative ecological bias, but in the opposite direction of the subgroup effect.
Timetoevent outcomes
Figure 3c shows the results of our simulations for timetoevent outcomes. All five statistical approaches showed less than 70% power to detect large subgroup effects. In particular, in scenarios without ecological bias centred and naïve IPDMA approaches had similar consistent power (approximately 65%) across all three heterogeneity levels, with a slight increase in power in the scenarios with ecological bias. MAIT showed slightly lower power than centred and naïve IPDMA approaches, i.e. approximately 55% and this was unaffected by heterogeneity. Nevertheless, similar to the centred and naïve IPDMA methods, MAIT showed an increase in power in presence of ecological bias from 60 to 70%. PSMA showed a decrease in power as heterogeneity increased: in scenarios without ecological bias the power dropped from nearly 65% (no heterogeneity) to around 45% (large heterogeneity). Furthermore, PSMA showed decreased power in scenarios with qualitative compared to quantitative ecological bias, 35 and 60% respectively, and dropped to 30 and 45% when large heterogeneity was introduced. Metaregression resulted in power less than 10% for all heterogeneity levels in scenarios with no or quantitative ecological bias. In contrast, metaregression showed more power (approximately 15%) in presence of qualitative bias but in the opposite direction of the subgroup effect.
Bias in the estimates of the treatmenteffect modification
Detailed results of the bias are presented in Table 2, Table 3, Table 4, Table 5, and Table 6. The estimate bias of centred and naïve IPDMA was nearly zero and remained unaffected by addition of ecological bias and heterogeneity in all scenarios for all outcomes. The results were approximately zero, with a maximum bias of 0.017 and minimum of − 0.005 for centred and 0.1 and − 0.07 for the naïve IPDMA. MAIT showed slightly increased bias for binary outcomes especially in scenarios with low numbers of participants (50) and trials [5] and a high level of heterogeneity. PSMA amalgamated heterogeneity with ecological bias and showed bias in all scenarios and outcomes, especially on binary and survival outcomes. Finally, metaregression estimated in all scenarios the b_{A} term, see formula 1, rather than the estimand b_{w} we were interested in. Therefore, metaregression showed extreme bias in all scenarios with ecological bias, both quantitative and qualitative.
Discussion
We compared five common statistical approaches for subgroup detection in IPDMA in an extensive simulation study. Our results showed that overall the centred IPDMA described by Hua et al. [6] performed best in terms of power, false positive rates and estimate bias, particularly in scenarios with heterogeneity and ecological bias. Both (naive and centred) one stage IPDMA approaches reached high levels of power, with minimal bias, while retaining nominal and stable false positive rates around 5%. The MAIT approach was less powerful, particularly in scenarios with binary outcomes, small sample sizes (n = 50 participants), and few trials (5 studies). PSMA showed inconsistent power, decreasing as heterogeneity and ecological bias increased, and high FPRs in scenarios without heterogeneity. Furthermore, PSMA showed high levels of estimate bias. Metaregression showed a lack of power and often misidentified the direction of the interaction effect (i.e. negative rather than positive interaction coefficients). Although our findings were based on binary, continuous and survival endpoints, it is highly likely that they will be applicable to other type of outcomes, such as counts (using different link functions such as poisson link function) and rates (using logit or arcsin transformations of the outcome or fitting a betaregression model, which maximises the beta distribution likelihood). This supports the generalizability of our recommendations.
Comparison with literature
Our results are in agreement with the simulation study of Lambert et al. [10], who compared in a setting without betweenstudy heterogeneity metaregression and naive IPD metaanalysis and showed that metaregression is prone to a lack of power even when there is no heterogeneity in treatment effect. Our results confirm that metaregression and PSMA have limited usefulness to investigate subgroup effects when individual participant data are available and that centred and naïve IPDMA approaches or MAIT should be preferred [33]. Our findings are also in line with a previous overview study of Fisher et al., where MAIT, PSMA, naive and centred IPDMA were described and fitted in empirical studies [8]. Furthermore, our results are also in agreement with empirical studies for continuous and binary outcomes [12, 14], which pointed out the need of individual participant data and stressed that different results may emerge with different approaches. In agreement with literature we have detected a lack of power in the MAIT compared to centred and naïve IPDMA approaches in small sample sized binary outcomes scenarios [2]. Finally, we simulated data with ecological bias using a coefficient for the across studies differences in the treatment effect (b_{A}) see formula 1. This implies a linear association between the interaction of percentage of smokers and treatment with the outcome. Our results show that centred and naïve IPDMA have minor differences in power and bias, which seems contradictory to the results of Hua et al. [6]. Nevertheless, that is due to different data generating mechanisms. We generated ecological bias proportional to the percentage of smokers, while Hua generated constant ecological bias for all studies with percentage of smokers above 50%.
Strengths and limitations
The major strength of our simulation study is the variety of scenarios and approaches we investigated. In particular, we varied multiple elements relevant for metaanalysis, such as betweenstudy treatment effect heterogeneity, quantitative and qualitative ecological bias, size and number of trials, and subgroup effect magnitude. To our knowledge, this is the first study comparing power, FPRs and estimate bias of centred and naive IPDMA, MAIT, PSMA and metaregression when detecting subgroup effects. The fact that we applied all approaches and scenarios on three common outcomes allows us to draw conclusions that are widely applicable. The broad setup of the simulations allowed us also to generate similar heterogeneity, ecological bias and subgroup effects across all three types of outcome.
Some potential limitations should also be discussed. First, we only generated betweenstudy heterogeneity for the treatment effect even though we could have generated heterogeneity on other parameters in the model. Nevertheless, intercept and smoking can be considered nuisance terms when the main interest is in investigating treatment effects, thus generating heterogeneity on the intercept and/or the smoker’s effect would not have altered our results and our conclusions. We avoided the introduction of heterogeneity in the interaction term, as that would have combined the statistically significant results resulting from our scenarios with those resulting from random noise. As a consequence, the differences in power and FPR across the methods would have been underestimated. In addition, we assumed that betweenstudy heterogeneity was drawn from a normal distribution, such that modelling assumptions of the various methods were (partially) in agreement with the data generation mechanism. This may have led to optimistic estimates of the power of all evaluated methods. In these regards, researchers should be aware that all methods are likely to perform poorer in practice (further underlining the poor performance of metaregression and persubgroup analysis). Second, we assumed that trials were perfectly balanced with respect to treatment allocation. Because small trials are more likely to be imbalanced by chance, their power to detect subgroup effects and treatmentcovariate interactions would be somewhat lower than is the case in our simulation study. Finally, we simulated a binary effect modifier reflecting either binary variables such as smoking or the common practice where dichotomizing continuous covariates is typically conducted. Nevertheless, effect modification is also commonly present in continuous and multilevel categorical variables, which are not covered in our simulations.
Implications for practice
Centred and naïve IPDMA, MAIT and PSMA are considered well developed and powerful statistical tools for IPD analysis. Nevertheless, we would like to highlight that these methods should be used with caution and applied by a research team involving at least one member with the appropriate statistical expertise. Centred and naïve methods need careful modelling, as some assumptions they inherently make may not be applicable to all data. Furthermore, convergence issues may emerge especially in mixedeffects models with timetoevent outcomes. MAIT may be problematic when investigating subgroup effects across multiple trials because of lower power than the centred and naïve IPDMA approaches. Furthermore, since in practice single trials are rarely sufficiently powered to investigate subgroups or treatmentcovariate interactions, the MAIT approach may also be hampered. For instance, practitioners may omit trials due to single monotone likelihood issues (i.e. all participants in a subgroup having the same outcome), thus reducing the power of the metaanalysis. Firth’s bias correction may fix the sparsedata bias issue but should be used with prudence as this method introduces bias in the estimates shrinking them to zero. The suboptimal performance of the PSMA approach implies that this method should be avoided especially when between study heterogeneity and ecological bias are present, as these effects may be amalgamated within the per subgroup estimates, which may cause biased results and reduction of power.
Metaregression is a statistical approach for aggregated data and should be avoided in IPD. It is known to have low power due to inefficient use of information especially when the subgroup covariate has a small variance across trials. Furthermore, metaregression is severely prone to ecological bias.
MAIT is less complicated than onestage methods and in metaanalyses with large trials it produces similar results. Furthermore, results can be presented in terms of forest plots, in contrast to onestage methods, where this is less obvious. Nevertheless, assuming correctly fitted models for all methods, we advocate that centred IPDMA should be preferred. Our study demonstrates that the centred IPDMA method yields high power and appropriate FPR, with minimal estimate bias. Hence, we recommend this approach for IPDMA aimed at investigating subgroup effects.
Conclusions
Our results confirm the benefit of appropriately specified onestage metaanalysis methods to identify subgroup effects using individual participant data from multiple trials. Naive and centred IPDMAs performed approximately equally well in terms of power and FPR, but since centred IPDMA showed less estimate bias in the presence of ecological bias, we recommend the use of centred IPDMA [5].
Availability of data and materials
The simulated datasets used and analysis described in the current study are available from the corresponding author on reasonable request.
Abbreviations
 CPDMA:

Individual participant data metaanalysis
 IPD:

Individual participant data
 MAPS:

Persubgroup metaanalysis
 MAIT:

Metaanalysis of interaction terms
 GLM:

Generalized linear model
 Cox:

Cox proportional hazards
 MLE:

Maximum likelihood estimation
 EB:

Empirical Ayes
 HKSJ:

Shantung
 FPR:

False positive rate
References
 1.
Chalmers I. The Cochrane collaboration: preparing, maintaining, and disseminating systematic reviews of the effects of health care. Ann N Y Acad Sci. 1993;703:156–63 discussion 635.
 2.
Debray TP, Moons KG, van Valkenhoef G, Efthimiou O, Hummel N, Groenwold RH, et al. Get real in individual participant data (IPD) metaanalysis: a review of the methodology. Res Synth Methods. 2015;6(4):293–309.
 3.
Tierney JF, Vale C, Riley R, Smith CT, Stewart L, Clarke M, et al. Individual participant data (IPD) Metaanalyses of randomised controlled trials: guidance on their use. PLoS Med. 2015;12(7):e1001855.
 4.
Stewart LA, Tierney JF. To IPD or not to IPD? Advantages and disadvantages of systematic reviews using individual patient data. Eval Health Prof. 2002;25(1):76–97.
 5.
Simmonds M, Stewart G, Stewart L. A decade of individual participant data metaanalyses: a review of current practice. Contemp Clin Trials. 2015;45(Pt A:76–83.
 6.
Hua H, Burke DL, Crowther MJ, Ensor J, Tudur Smith C, Riley RD. Onestage individual participant data metaanalysis models: estimation of treatmentcovariate interactions must avoid ecological bias by separating out withintrial and acrosstrial information. Stat Med. 2017;36(5):772–89.
 7.
Burke DL, Ensor J, Riley RD. Metaanalysis using individual participant data: onestage and twostage approaches, and why they may differ. Stat Med. 2017;36(5):855–75.
 8.
Fisher DJ, Copas AJ, Tierney JF, Parmar MK. A critical review of methods for the assessment of patientlevel interactions in individual participant data metaanalysis of randomized trials, and guidance for practitioners. J Clin Epidemiol. 2011;64(9):949–67.
 9.
Berlin JA, Santanna J, Schmid CH, Szczech LA, Feldman HI. Antilymphocyte antibody induction therapy study G. individual patient versus grouplevel data metaregressions for the investigation of treatment effect modifiers: ecological bias rears its ugly head. Stat Med. 2002;21(3):371–87.
 10.
Lambert PC, Sutton AJ, Abrams KR, Jones DR. A comparison of summary patientlevel covariates in metaregression with individual patient data metaanalysis. J Clin Epidemiol. 2002;55(1):86–94.
 11.
Schmid CH, Stark PC, Berlin JA, Landais P, Lau J. Metaregression detected associations between heterogeneous treatment effects and studylevel, but not patientlevel, factors. J Clin Epidemiol. 2004;57(7):683–97.
 12.
Fisher DJ, Carpenter JR, Morris TP, Freeman SC, Tierney JF. Metaanalytical methods to identify who benefits most from treatments: daft, deluded, or deft approach? BMJ. 2017;356:j573.
 13.
Simmonds MC, Higgins JPT. Covariate heterogeneity in metaanalysis: criteria for deciding between metaregression and individual patient data. Stat Med. 2007;26(15):2982–99.
 14.
Koopman L, van der Heijden GJ, Hoes AW, Grobbee DE, Rovers MM. Empirical comparison of subgroup effects in conventional and individual patient data metaanalyses. Int J Technol Assess Health Care. 2008;24(3):358–61.
 15.
Simmonds M. Metaanalysis of individual patient data from randomized trials a review of methods used in practice. Clin Trials. 2005;2(3):209–17.
 16.
AboZaid G, Guo B, Deeks JJ, Debray TP, Steyerberg EW, Moons KG, et al. Individual participant data metaanalyses should not ignore clustering. J Clin Epidemiol. 2013;66(8):865–73 e4.
 17.
Riley RD, Lambert PC, Staessen JA, Wang J, Gueyffier F, Thijs L, et al. Metaanalysis of continuous outcomes combining individual patient data and aggregate data. Stat Med. 2008;27(11):1870–93.
 18.
Kontopantelis E. A comparison of onestage vs twostage individual patient data metaanalysis methods: a simulation study. Res Synth Methods. 2018;9(3):417–30.
 19.
IntHout J, Ioannidis JP, Borm GF, Goeman JJ. Small studies are more heterogeneous than large ones: a metametaanalysis. J Clin Epidemiol. 2015;68(8):860–9.
 20.
Team RC. R: a language and environment for statistical computing. 2016.
 21.
Bates D, Machler M, Bolker BM, Walker SC. Fitting linear mixedeffects models using lme4. J Stat Softw. 2015;67(1):1–48.
 22.
Therneau TM. Coxme: mixed effects cox models; 2018.
 23.
Viechtbauer W. Conducting Metaanalyses in R with the metafor package. J Stat Softw. 2010;36(3):1–48.
 24.
Heinze GMP. Logistf: Firth's Biasreduced logistic regression; 2016.
 25.
Heinze GMP. Coxphf: cox regression with Firth's penalized likelihood; 2016.
 26.
Maiti T, Pradhan V. A comparative study of the bias corrected estimates in logistic regression. Stat Methods Med Res. 2008;17(6):621–34.
 27.
Viechtbauer W, LopezLopez JA, SanchezMeca J, MarinMartinez F. A comparison of procedures to test for moderators in mixedeffects metaregression models. Psychol Methods. 2015;20(3):360–74.
 28.
Stijnen Th, Van Houwelingen JC. “Empirical Bayes Methods in Clinical Trials MetaAnalysis.” Biom J. 2007;32(3):335–46.
 29.
Hartung J, Knapp G. A refined method for the metaanalysis of controlled clinical trials with binary outcome. Stat Med. 2001;20(24):3875–89.
 30.
IntHout J, Ioannidis JP, Borm GF. The HartungKnappSidikJonkman method for random effects metaanalysis is straightforward and considerably outperforms the standard DerSimonianLaird method. BMC Med Res Methodol. 2014;14:25.
 31.
Heinze G, Schemper M. A solution to the problem of separation in logistic regression. Stat Med. 2002;21(16):2409–19.
 32.
Heinze G, Schemper M. A solution to the problem of monotone likelihood in cox regression. Biometrics. 2001;57(1):114–9.
 33.
Huang Y, Tang J, Tam WW, Mao C, Yuan J, Di M, et al. Comparing the overall result and interaction in aggregate data Metaanalysis and individual patient data Metaanalysis. Medicine (Baltimore). 2016;95(14):e3312.
Acknowledgements
Not applicable
Funding
This work was supported by a TOP grant of the Netherlands Organisation for Health Research and Development (grant number: 91215058). Thomas De bray also gratefully acknowledges the Netherlands Organization for Health Research and Development (grant number 91617050). All funding bodies had no role in the design of the study. All funding bodies had no role in the collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Affiliations
Contributions
JI, MR, TD, JBR and MB have developed this simulation study. All authors contributed in the design of the study. MB generated and analysed the simulated data and produced the figures for the illustration of the results. MB worked with MR and JI to develop the first draft of the manuscript. All authors critically revised the manuscript and the final content. All authors approved the final manuscript and all agreed to be accountable for all aspects of the work.
Corresponding author
Correspondence to Michail Belias.
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests associated with this publication and there has been no significant financial support for this work that could have influenced its outcome.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
Binary outcome (FPR and bias) (XLSX 14 kb)
Additional file 2:
Continuous outcome (FPR and bias) (XLSX 14 kb)
Additional file 3:
Large subgroup effect Binary outcome (power and bias) (XLSX 14 kb)
Additional file 4:
Large subgroup effect Continuous outcome (power and bias) (XLSX 14 kb)
Additional file 5:
Large subgroup effect Survival outcome (power and bias) (XLSX 14 kb)
Additional file 6:
Medium subgroup effect Binary outcome (power and bias) (XLSX 14 kb)
Additional file 7:
Medium subgroup effect Continuous outcome (power and bias) (XLSX 14 kb)
Additional file 8:
Medium subgroup effect Survival outcome (power and bias) (XLSX 14 kb)
Additional file 9:
Survival outcome (FPR and bias) (XLSX 14 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Subgroups
 Simulation
 Individual participant data
 Effect modification
 Metaanalysis
 Statistical approaches
 Comparison
 Ecological bias
 Heterogeneity