Skip to main content


Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

A comparison of analytic approaches for individual patient data meta-analyses with binary outcomes



Individual patient data meta-analyses (IPD-MA) are often performed using a one-stage approach-- a form of generalized linear mixed model (GLMM) for binary outcomes. We compare (i) one-stage to two-stage approaches (ii) the performance of two estimation procedures (Penalized Quasi-likelihood-PQL and Adaptive Gaussian Hermite Quadrature-AGHQ) for GLMMs with binary outcomes within the one-stage approach and (iii) using stratified study-effect or random study-effects.


We compare the different approaches via a simulation study, in terms of bias, mean-squared error (MSE), coverage and numerical convergence, of the pooled treatment effect (β 1) and between-study heterogeneity of the treatment effect (τ 1 2 ). We varied the prevalence of the outcome, sample size, number of studies and variances and correlation of the random effects.


The two-stage and one-stage methods produced approximately unbiased β 1 estimates. PQL performed better than AGHQ for estimating τ 1 2 with respect to MSE, but performed comparably with AGHQ in estimating the bias of β 1 and of τ 1 2 . The random study-effects model outperformed the stratified study-effects model in small size MA.


The one-stage approach is recommended over the two-stage method for small size MA. There was no meaningful difference between the PQL and AGHQ procedures. Though the random-intercept and stratified-intercept approaches can suffer from their underlining assumptions, fitting GLMM with a random-intercept are less prone to misfit and has good convergence rate.


Individual Patient Data (IPD) meta-analyses (MA) are regarded as the gold standard in evidence synthesis and are increasingly being used in current practice [1, 2]. However, the implementation of the analysis of IPD-MA requires additional expertise and choices [3], particularly when the outcome is binary. These include (i) should a one- or two-stage model be used [4, 5], (ii) what estimation procedure should be used to estimate the one-stage model [6, 7] and, (iii) should the study effect be fixed or random [8].

Although IPD-MA were conventionally analyzed via a two-stage approach [9], over the last decade, use of the one-stage approach has increased [10]. Recently, some have suggested that the two-stage and one-stage frameworks produce similar results for MA of large randomized controlled trials [5]. The literature suggests the one-stage method is particularly preferable when few studies or few events are available as it uses a more exact statistical approach than relying on a normality approximation [35].

When IPD are available and the outcome is binary, the one-stage approach consists of estimating Generalized Linear Mixed Models (GLMMs) with a random slope for the exposure, to allow the exposure effect to vary across studies. Penalized quasi-likelihood (PQL) introduced by Breslow and Clayton is a popular method for estimating the parameters in GLMMs [11]. However, regression parameters can be badly biased for some GLMMs, especially with binary outcomes with few observations per cluster, low outcome rates, or high between cluster variability [12, 13]. Adaptive Gaussian Hermite quadrature (AGHQ) is the current favored competitor to PQL, which approximates the maximum likelihood by numerical integration [14]. Although estimation becomes more precise as the number of quadrature points increases, it often gives rise to computational difficulties for high-dimension random effects and convergence problems where variances are close to zero or cluster sizes are small [14].

The heterogeneity between studies is an important aspect to consider when carrying out IPD-MA. Such heterogeneity may arise due to differences in study design, treatment protocols or patient populations [8]. When such heterogeneity is present, the convention is to include a random slope in the model as it captures the variability of the exposure across studies. However, there are corresponding assumptions in regards to the study effect being modelled as stratified or random [4, 15].

Few comparisons of GLMMs have been reported in the context of IPD-MA with binary outcomes [4, 15], that is, when the number of studies and the number of subjects within each study is small, study sizes are imbalanced, in the presence of large between-study heterogeneity and small exposure effects and there is an interest in the variance parameter of the random treatment effect. According to previous literature, these factors have all been identified as influencing model performance [6]. While several simulation studies have been published, these have mainly limited their attention to simple models with only random intercepts [13, 16]. Thus, the performance of the random effects models including both a random intercept and a random slope are less well known.

Our objective was to assess and compare via simulation studies, (i) one-stage approaches to conventional two-stage approaches (ii) the performance of different estimation procedures for GLMMs with binary outcomes, and (iii) using stratified study-effect or random study-effects in a randomized trial setting. We use our results to develop guidelines on the choice of methods for analyzing data from IPD-MA with binary outcomes and to understand explicitly the trade-offs between computational and statistical complexity.

Methods section introduces the models we are considering, the design of the simulation study and the assessment criteria. In Results section, results for the different methods under varying conditions are presented and discussed. Discussion section concludes with a discussion.


We conducted a simulation study to compare various analytic approaches to analyze data from IPD-MA with binary outcomes. Hereto, our methods all assume that between-study heterogeneity exists, as it is likely in practice, and so only random treatment-effects IPD meta-analysis models are considered.

Data Generation

The data generation algorithm was developed to generate two-level data sets (e.g. patients grouped into studies). We generated a binary outcome (Y ij ) and a single binary exposure (X ij ). We denote the number of studies j = 1, 2 …, K and i = 1, 2 …, n j denotes the individuals per study. Therefore, Y ij is the outcome observed for the i th individual from the j th study.

The dichotomous exposure variable, X ij , was generated from a Bernoulli distribution with probability = 0.5 and recoded \( \pm {\scriptscriptstyle \raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.} \) to indicate control/treatment group [15]. To generate the binary outcome variable Y ij , first the probability of the outcome was calculated from the random-study and –treatment effects logistic regression model (Eq. 1), or the stratified-study effects model (Eq. 2):

$$ logit\left({\pi}_{ij}\right)=\left({\beta}_0+{b}_{0 j}\right)+\left({\beta}_1+{b}_{1 j}\right){x}_{ij} $$
$$ logit\left({\pi}_{ij}\right)={\beta}_j+\left({\beta}_1+{b}_{1 j}\right){x}_{ij} $$

Here π ij is the true probability of the outcome for the i th individual from the j th study, β 0denotes the mean log-odds of the outcome (study-effect) and β 1 the pooled treatment effect (log odds ratio). The random effects (b 0j and b 1j ) were generated from a bivariate normal distribution with mean = 0 and variance-covariance matrix \( \Sigma =\left(\begin{array}{cc}\hfill {\sigma}^2\hfill & \hfill \rho \sigma \tau \hfill \\ {}\hfill \rho \sigma \tau \hfill & \hfill {\tau}^2\hfill \end{array}\right) \) for the random study-effect case. In the stratified study effects case, (i.e. Eq. (2)), β j , were generated from a uniform distribution and b 1j was generated from a normal distribution with zero mean and variance, τ 2.

A Bernoulli distribution with probability π ij from Eqs. (1) and (2) was used to generate the binary outcome Y ij .

The number of studies, study size, total sample size, variances and correlation of the random effects, and average conditional probability were all varied, with levels described in Table 1. For each distinct combination (n = 480) of simulation parameters, 1000 IPD-MA were generated from each Eqs. (1) and (2), allowing us to investigate a wide range of scenarios. The heterogeneity was set at I2 = 0.01, 0.23 and 0.55 as defined by τ 2/(τ 2 + π 2/3) for a binary outcome using an odds ratio [17]. The levels correspond to: little or no, low and moderate heterogeneity respectively [18].

Table 1 Summary of Simulation Parametersa

A sensitivity analysis was also considered to explore the performance of different methods when just 5% of observation had a positive outcome.


Two-stage IPD methods

In the two-stage approach, each study in the IPD was analyzed separately via logistic regression

$$ {y}_i\sim Bernoulli\left({p}_i\right) $$
$$ logit\left({p}_{i j}\right)={\gamma}_0+{\gamma}_1{x}_i $$

The first step estimated the study-specific intercept and slope and their associated within-study covariance matrix (consisting of the variances of the intercept and slope, as well as the covariance) for each study. This model reduces the IPD to its relative treatment effect estimate and variance for each study then at the second stage these aggregate data (AD) are synthesized (described below).

Model 1- Bivariate meta-analysis

The AD were combined via a bivariate random-effects model that simultaneously synthesized the estimates whilst accounting for their correlation, and the within-study correlation [4]. The model assumes that the true effects follow a bivariate normal distribution and is estimated via restricted maximum likelihood with the following marginal distributions of the estimates [19]:

$$ \left[\begin{array}{c}\hfill \widehat{\gamma_{0 J}}\hfill \\ {}\hfill \widehat{\gamma_{1 J}}\hfill \end{array}\right]\sim N\left(\left(\begin{array}{c}\hfill {\gamma}_0\hfill \\ {}\hfill {\gamma}_1\hfill \end{array}\right),\varSigma +{C}_j\right),\varSigma =\left(\begin{array}{c}\hfill {\tau}_0^2\hfill \\ {}\hfill {\tau}_{01}^2\hfill \end{array}\begin{array}{c}\hfill {\tau}_{01}^2\hfill \\ {}\hfill {\tau}_1^2\hfill \end{array}\right) $$

where Σ is the unknown between-study variance-covariance matrix of the true effects (γ 0 and γ 1) and C j  (j = 1, …, K) the with-in study variance-covariance matrix with the variances of the estimates.

Model 2: Conventional DerSimonian and Laird approach

The with-in study and between-study covariance estimates are often times not estimated since most researchers assumed that studies are independent, and instead a univariate meta-analysis of the logit of the odds ratios is performed [20]. The marginal distribution of the pooled estimated treatment effect under this approach is easily obtained as:

$$ \widehat{\gamma_{1 J}}\sim N\left({\gamma}_1,{\tau}_1^2+ var\left(\widehat{\gamma_J}\right)\right) $$

with unknown parameters γ 1 and τ 1 2 , estimated via the inverse variance weighted non-iterative method (method-of-moments) [21].

One-stage IPD methods

The one-stage approach analyzes the IPD from all studies simultaneously, while accounting for clustering of subjects within studies [4]. The one-stage model is a form of GLMM. Two different specifications are considered.

Model 3- Random intercept and random slope

We estimated a GLMM with a random study effect u 0j and a random treatment effect u 1j via PQL and AGHQ, and allowed the random effects to be correlated, which implies that the between-study covariance between u 0j and u 1j is fully estimated.

$$ \begin{array}{l} logit\left({p}_{ij}\right) = {\gamma}_0+{u}_{0 j}+\left({\gamma}_1+{u}_{1 j}\right){x}_{ij}\\ {}\left[\begin{array}{c}\hfill {u}_{0 j}\hfill \\ {}\hfill {u}_{1 j}\hfill \end{array}\right]\sim N\left(\left(\begin{array}{c}\hfill 0\hfill \\ {}\hfill 0\hfill \end{array}\right),{\Sigma}_j\right),\ {\Sigma}_j=\left(\begin{array}{cc}\hfill {\tau}_0^2\hfill & \hfill {\tau}_{01}^2\hfill \\ {}\hfill {\tau}_{01}^2\hfill & \hfill {\tau}_1^2\hfill \end{array}\right)\end{array} $$

Model 4-Stratified intercept one-stage

Finally, the stratified one-stage approach estimates a separate intercept for each study rather than constraining the intercepts to follow a normal or other distribution. Therefore, there is no need for the normality assumption for the study membership, hence, the between-study covariance term is no longer estimated. The model is defined as follows:

$$ logit\left({p}_{ij}\right) = {\displaystyle \sum_{k=1}^K}\left({\gamma}_k{I}_{k= j}\right)+\left({\gamma}_1+{u}_{1 j}\right){x}_{ij} $$

where I k = j indicates that a separate intercept should be estimated for each study j = 1, …, K and u 1j  ~ N(0, τ 1 2 ). Parameters of both Models 3 and 4 were estimated via PQL and AGHQ.

Estimation Procedures and Approximations

The parameters of the one-stage models were estimated using PQL and AGHQ. For the two-stage approach, a logistic regression was first estimated for each study via maximum likelihood. The parameters of the two-stage model were estimated via method-of-moments (MOM) (Model 2) and restricted maximum likelihood (REML) (Model 1) [2123] at the second stage.

Both likelihood-based methods (PQL and AGHQ) were implemented on SAS version 9.4 using PROC GLIMMIX with default options [24]. The number of quadrature points in AGHQ was selected automatically [25], the absolute value for parameter convergence criterion was 10–8 and the maximum number of iterations was 100.

Therefore, for each generated data set the following models were fit.

  • Two-stage approach (Models 1 and 2)

  • One-stage approach via GLMMs (Models 3 and 4) estimated with PQL.

  • One-stage approach via GLMMs (Models 3 and 4) estimated with AGHQ.

Assessment criteria

The performance of the estimation methods was evaluated using: a) numerical convergence, b) absolute bias; c) root mean square error (RMSE); and d) coverage probability - of the pooled treatment effect and its between-study variability.

Numerical convergence

The convergence rate was estimated for all models fit, as the number of simulation repetitions that did converge (without returning a warning message) divided by the total attempted (M = 1000). Models that returned a warning message specifying that the estimated variance-covariance matrix was not positive definite or that the optimality condition was violated were considered not to have converged.


The Monte Carlo bias of the pooled treatment effect and its between-study heterogeneity is defined as the average of the bias in the estimates provided by each method as compared to the truth, across the 1000 IPD-MA in each scenario. The Monte Carlo estimate of the bias is computed as

$$ bias=\frac{1}{1000}{\displaystyle \sum_{j=1}^{1000}}\ {\widehat{\theta}}_j-\theta, $$

where \( {\hat{\theta}}_j \) were the parameter estimates and θ was the true parameter of the pooled treatment effect or its between-study variance. We also reported the mean absolute bias (AB).

Mean square error

The mean square error (MSE) is a useful measure of the overall accuracy, because it penalizes an estimate for both bias and inefficiency. The Monte Carlo estimate of the MSE is:

$$ M S E\left(\widehat{\theta}\right)=\frac{1}{1000}{\displaystyle \sum_{j=1}^{1000}}{\widehat{\Big(\theta}}_j-\theta \Big){}^2, $$

For each scenario, the RMSE of the pooled treatment effect and its between-study heterogeneity was reported, as this measure is on the same scale as the parameter.

Coverage probability

We estimated coverage for the pooled treatment effect and its between-study heterogeneity for the various methods. Gaussian coverage was estimated, where if \( \left|\hat{\theta}-\theta \right|\le 1.96\times S E\left(\hat{\theta}\right) \) the true value was covered, and if \( \left|\hat{\theta}-\theta \right|>1.96\times S E\left(\hat{\theta}\right) \) it was not.

We reported the median, the 25th and 75th percentiles of the AB and RMSE of the pooled treatment effect and its between-study heterogeneity but reported percentages for the numerical convergence and coverage rate.


Tables 2, 3, 4, 5, 6 and 7 present the median and interquartile range of the AB, RMSE, coverage and convergence of the pooled treatment effect and its between-study variance, respectively, as estimated via two- and one-stage; AGHQ and PQL; random-intercept and stratified-intercept methods. We reported results for data generated with imbalances in study sizes (different sample size in all studies) for both the random-intercept and stratified-intercept data generation (Eqs. 1 and 2) with correlated random effects (ρ = 0.5), as this scenario is likely the closest to real-life.

Table 2 Performance of the one- and two-stage approaches in small data setsa with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effectsb
Table 3 Performance of the one- and two-stage approaches in large data setsa with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effectsb
Table 4 Performance of Penalized Quasi-likelihood and Adaptive Gaussian Hermite Quadrature estimation approaches in small data setsa with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effectsb
Table 5 Performance of Penalized Quasi-likelihood and Adaptive Gaussian Hermite Quadrature estimation approaches in large data setsa with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effectsb
Table 6 Performance of the stratified- and random-intercepta models approaches in small data setsb with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effectsc
Table 7 Performance of the stratified- and random-intercepta models approaches in large data setsb with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effectsc

We did not exclude results from meta-analyses that returned a warning message (imperfect convergence). These meta-analyses were included as non-convergence and although these models failed to produce proper parameter estimates, these estimates were included in the calculation of the bias and the MSE.

One- versus Two-stage

In Tables 2 and 3, results for the absolute bias (AB) of the estimates for the pooled treatment effect β 1 are given. Recalling that the true parameter value was 0.18, we see that the biases were identical and under 0.05 in the one-stage and the two-stage approaches in both small and large data sets. Results were very comparable when the outcome rate was reduced from 30 to 5% (Additional file 1: Table S1). For both the one- and the two-stage, results depended on the true τ 2, and the sample size.

For the larger sample size, root mean square error (RMSE) in β 1 was generally slightly larger when the one-stage method was used than when the two-stage was used. The picture was similar across all heterogeneity levels (Tables 2 and 3) and when the outcome rate was reduced (Additional file 2: Table S3).

Neither one-stage nor two-stage methods yielded coverage of β 1 close to nominal levels (Tables 2 and 3). Increasing sample size had a positive effect on percent coverage, and increasing the true heterogeneity made estimation more difficult, hence decreasing the coverage (Table 3).

Absolute bias of the between-study heterogeneity, τ 1 2 was usually slightly lower when the one-stage approach was used than when the two-stage approach was (Tables 2 and 3), particularly, when the sample size was small (Table 2) and when greater amount of heterogeneity exist in the random effects (Bottom panel of Table 2). Regarding the effects of the simulation parameters, AB decreased when data was generated with equal study sizes and increased when the rate of occurrence was reduced (Additional file 3: Table S2). In these cases, the one-stage approach was most biased.

The RMSE of τ 1 2 for the one-stage estimates was mostly smaller than the RMSE of the two-stage method estimates. For increased sample size or reduction in the level of heterogeneity in the random effects, RMSE of τ 1 decreased at least by a factor of three across both methods. While the RMSE of τ 1 was inflated when the outcome rate was reduced, the one-stage method continued to outperform that of the two-stage method (Additional file 4: Table S4).

Convergence was not a problem for the two-stage approach while convergence of the one-stage method varied from 90 to 100% (Tables 2 and 3).

AGHQ versus PQL

One-stage models estimated via PQL and AGHQ methods often yielded similar AB in β 1. There was no observed difference in the AB (β 1) between the methods when the outcome rate was reduced (Additional file 1: Table S1).

RMSE of β 1 were generally greater when AGHQ was used than when PQL was used (Tables 4 and 5). Decreasing sample size, increasing the variances of the random effects or reducing the event rate (Additional file 2: Table S3) made precise estimation more difficult, hence RMSE increased.

When the true heterogeneity was large and total sample was small (Top panel of Table 4), AGHQ provided coverage for β 1closer to nominal levels than PQL, while both methods provided comparable coverage when the sample size was increased (Table 5). Note that across both methods, levels of coverage were higher as heterogeneity increased and similar coverage was observed when the outcome rate was reduced (Additional file 5: Table S5).

AB in τ 1 2 , was very comparable but slightly lower when PQL was used rather than AGHQ (Tables 4 and 5). The AB decreased with increasing sample size, particularly, when PQL was used (Table 5). There was substantial bias in τ 1 2 estimates when the event rate was reduced (Additional file 3: Table S2).

On account of a better overall performance of PQL with regards to AB, RMSE of τ 1 was generally lower with PQL than with AGHQ (Tables 4 and 5). RMSE decreased with decreased variability in the random effects, and with increased sample size. In addition, PQL-estimates continued to yield smaller RMSE than AGHQ-estimates when the outcome rate was reduced (Additional file 4: Table S4).

We found important under-coverage of the estimates for τ 1 for both estimation methods, particularly when PQL was used (Tables 4 and 5). The percent coverage was usually fair for both estimation methods when sample size increased, but was poor when the outcome rate was reduced (Additional file 6: Table S6).

Convergence occurred more often when AGHQ was used than when PQL was used (Tables 4 and 5). Convergence was problematic for PQL, particularly when true heterogeneity was low and sample size was small (Bottom panel of Table 4). Comparable convergence was seen when the event rate was reduced (Additional file 5: Table S5).

Random- intercept versus stratified-intercept

The results of the simulation studies, modeling the intercept as random or fixed (random slope was always considered) via PQL estimation are summarized in the Tables 6 and 7.

The convergence was markedly low (14-97%) for the fixed intercept & random slope method (Tables 6 and 7). Convergence was only reasonable for the approach when the sample size was large and heterogeneity was small, whereas convergence was always greater than 80% for the random intercept and slope approach.

In general, AB in β 1 was similar for both stratified-intercept (random-slope only) and random intercept & slope methods. Regarding the simulation parameters, sample size and variability of the random effects, were not influential in reducing the AB in β 1.

The RMSE in β 1 was smaller when estimated via the random intercept and slope model than when only a random slope was fit (Tables 6 and 7).

Increased sample size and level of heterogeneity in the random effect was most influential in determining coverage probability.

Absolute bias in τ 1 2 was clearly comparable when fit with a random intercept & slope approach or a random slope only (Tables 6 and 7). For lower outcome rate, there was a trend towards less pronounced bias when a random slope only was fit (Additional file 3: Table S2).

We observed lower RMSE of τ 1 2 when a random intercept was fit, especially when the true heterogeneity was large (Top panels of Tables 6 and 7). Comparable results were seen when both models were fit in large sample and the true heterogeneity was small (Bottom panel of Table 7)- also when outcome rate was reduced (Additional file 4: Table S4).

We found significant under coverage of τ 1 when both models were fit, however, this was more severe when a random slope only model was fit (Tables 6 and 7). When the generated values of τ 0 2  or τ 1 2 were low (i.e. low variability in the random effects) and sample size was increased, we had less difficulty to estimate the coverage of τ 1 2 when both models were fit. The coverage probability continued to be an issue when the rate of occurrence was reduced (Additional file 6: Table S6).



Our simulation results indicate that when the number of subjects per study is large, the one- and two-stage methods yield very similar results. Our results also confirm the finding of previous empirical studies [5, 26, 27] that in some cases, the one-stage and two-stage IPD-MA results coincide. However, we found discrepancies between these methods, with a slight preference towards the one-stage method when the number of subjects per study is small. In these situations, neither method produced accurate estimates for the between-study heterogeneity associated with the treatment-effect; however, the biases were larger for the two-stage approach. Furthermore, one-stage methods produced less biased and more precise estimates of the variance parameter and had slightly higher coverage probabilities, though these differences may be due to using the REML estimate of τ 1 2 instead of the der Simonian and Laird estimator used in the two-stage approach.

Estimation of GLMMs with binary outcomes continues to pose challenges, with many methods producing biased regression coefficients and variance components [7]. AGHQ has been shown to overestimate the variance component with few clusters or few subjects [17]. On the contrary, PQL has been found to underestimate the variance component while the standard errors are overestimated [12]. In the context of IPD-MA, we found similar absolute bias of the PQL- and AGHQ-estimated pooled treatment effect, while the PQL-estimates of the between-study variance had greater precision when study sizes were small and random effects were correlated. This somewhat confirms previous results, which found that PQL suffers from large biases but performs better in terms of MSE than AGHQ [6]. Both estimation methods experienced difficulty in attaining nominal coverage of the between-study heterogeneity associated with the treatment effect in two situations: (i) when the number of studies included was small and/or (ii) the true variances of the random effects were small. We also found that convergence was not an important problem for AGHQ when meta-analyses included studies with less than 50 individuals per study. However, convergence was poor when the prevalence of the outcome was reduced to 5% and the true heterogeneity was close to zero.

Stratification of the intercept in one-stage models avoids the need to estimate the random effect for the intercept and the correlation between the random effects. This approach may be preferable in situations not investigated in this work (e.g. when the distribution of the random effects is skewed). However, this approach suffered from marked convergence rates when fit to small data sets (15 studies and on average 500 subjects).

Strengths and Limitations

We used simulation studies to compare various analytic strategies to analyze data arising from IPD-MA across a wide range of data generation scenarios but made some simplifications. We only considered binary outcomes, one dichotomous treatment variable, a two-level data structure, and no confounders. Moreover, we estimated GLMMs via PQL and AGHQ, but did not compare Bayesian or other estimation methods, which might be particularly useful in sparse scenarios [28]. We have made the assumption throughout that IPD were available. Certainly, the time and cost associated with collecting IPD are considerable. However, once such data is in hand, we have addressed several open questions relating to the best way to analyze it. We should also note that methods exist for combining IPD and aggregated data [7]. Further study is needed to investigate alternative confidence intervals (or coverage probability) for the between-study heterogeneity that can be used to remedy the under-coverage of Gaussian intervals. The normality-based intervals (coverage rate) we studied greatly underperformed in most scenarios because the constructions of the confidence interval are likely to be invalid. A further simplification that limits the generalizability of this work is that it is restricted to only two-arm trials. The extension to three or more arms would require careful consideration of more complicated correlation structures in treatment effects across arms and within studies [29].

One important comparison we have not addressed is, computational speed where the two-stage method had a distinct advantage over the one-stage; PQL was faster than AGHQ and the stratified-intercept model run-time was quicker than the random-intercept model.

As far as we know, this simulation study is the first to simultaneously generate data with normally distributed and stratified random intercepts. This study also compares approaches that include a random intercept for study membership to those that do not. Furthermore, the use of simulation - to systematically investigate the robustness of the approaches to variation in sample size, study number, outcome rate, magnitude of correlation and variances. As a result, our scenarios have allowed us to assess performance without being too exhaustive.

Guidelines for Best Practice

On the basis of these findings, we can make several recommendations. When the IPD-MA included many studies and the outcome rate was not too low, this work supports the conclusion of a previous study [5] that the conventional two-stage method by DerSimonian and Laird [21] is a good choice under the data conditions simulated here. Cornell et al. found that the DL method produced too-narrow confidence bounds and p values that were too small when the number of studies was small or there was high between-study heterogeneity [30]. In such cases, a modification such as the Hartung-Knapp approach may be preferable [31]. Further, while the bivariate two-stage approach is very rarely used in practice, we found that it tended to yield good overall model performance, comparable with that of the one-stage models when study sizes are small. In addition, our results also suggest that the one-stage method can be used in IPD-MA where study sizes are less than 50 subjects per study or few events were recorded in most studies (outcome rate of 5%). In these cases, the one-stage approach is more appropriate as it models the exact binomial distribution of the data and offers more flexibility in model specification over the two-stage approach [32].

If interest lies in estimation of the pooled treatment effect or the between-study heterogeneity of the treatment effect, estimation using PQL appeared to be a better choice due to its lower bias and mean square error for the settings considered. On the contrary, computational issues such as convergence occurred less with this technique than with AGHQ. However, it is important to note that convergence and coverage in τ 2 was an issue in small and large total sample sizes and also, when level of true heterogeneity was large.

For these simulated data, the results of both the random-intercept and stratified-intercept models were not importantly different. However, under both data generations, fitting a GLMM with the random-intercept was overall less sensitive to misspecification in small sample sizes with large between-study heterogeneity than the stratified-intercept GLMM since we have observed high rates of non-convergence via the stratified-intercept model.

There are four important caveats to these recommendations. First, our simulations show greater accuracy of the pooled odds ratio as the number of studies increase. Therefore, an IPD-MA with more studies will provide more accurate estimates. Secondly, our results show that the estimation of the between-study heterogeneity of the treatment effect is highly biased regardless of the sample size and number of studies. Therefore, we should always expect that the variance parameter be estimated with some error. Thirdly, small overall samples mark the trade-off under which a meta-analyst might consistently choose precision over bias and our simulations show that PQL estimation may be preferred in these situations. Finally, large overall sample size can eliminate the lack of statistical power present in small overall samples. In such cases, comparable results are seen for one- and two-stage methods and fitting a two-stage analysis as a first step may be advisable. This could aid as a quick and efficient investigation of heterogeneity and treatment-outcome association.


To summarize, the one- and two-stage methods consistently produced similar results when the number of studies and overall sample are large. Although the PQL and AGHQ estimation procedures produced similar bias of the pooled log odds ratios, PQL-estimates had lower RMSE than the AGHQ-estimates. Both the random-intercept and stratified-intercept models yielded precise and similar estimates for the pooled log odds ratios. However, the random-intercept models gave good coverage probabilities of the between-study heterogeneity in small sample sizes and yielded overall good convergence rate as compared to the random slope only model.



Absolute bias


Adaptive Gaussian hermite quadrature


Generalized linear mixed model


Individual patient data meta-analysis




Mean-squared error


Penalized quasi-likelihood


Restricted maximum likelihood


  1. 1.

    Riley RD, Simmonds MC, Look MP. Evidence synthesis combining individual patient data and aggregate data: a systematic review identified current practice and possible methods. J Clin Epidemiol. 2007;60(5):431–9. doi:10.1016/j.jclinepi.2006.09.009. [published Online First: Epub Date]|.

  2. 2.

    Stewart LA, Parmar MK. Meta-analysis of the literature or of individual patient data: is there a difference? Lancet. 1993;341(8842):418–22.

  3. 3.

    Debray T, Moons K, Valkenhoef G, et al. Get real in individual participant data (IPD) meta‐analysis: a review of the methodology. Res Synth Methods. 2015;6(4):293–309.

  4. 4.

    Debray TPA, Moons KGM, Abo-Zaid GMA, et al. Individual participant data meta-analysis for a binary outcome: One-stage or Two-stage? PLoS ONE. 2013;8(4):e60650. doi:10.1371/journal.pone.0060650. [published Online First: Epub Date]|.

  5. 5.

    Stewart GB, Altman DG, Askie LM, et al. Statistical analysis of individual participant data meta-analyses: a comparison of methods and recommendations for practice. PLoS ONE. 2012;7(10):e46042. doi:10.1371/journal.pone.0046042. [published Online First: Epub Date]|.

  6. 6.

    Callens M, Croux C. Performance of likelihood-based estimation methods for multilevel binary regression models. J Stat Comput Simul. 2005;75(12):1003–17. doi:10.1080/00949650412331321070. [published Online First: Epub Date]|.

  7. 7.

    Capanu M, Gönen M, Begg CB. An assessment of estimation methods for generalized linear mixed models with binary outcomes. Stat Med. 2013;32(26):4550–66. doi:10.1002/sim.5866. [published Online First: Epub Date]|.

  8. 8.

    Rondeau V, Michiels S, Liquet B, et al. Investigating trial and treatment heterogeneity in an individual patient data meta-analysis of survival data by means of the penalized maximum likelihood approach. Stat Med. 2008;27(11):1894–910. doi:10.1002/sim.3161. [published Online First: Epub Date]|.

  9. 9.

    Simmonds MC, Higgins JP, Stewart LA, et al. Meta-analysis of individual patient data from randomized trials: a review of methods used in practice. Clinical trials (London, England). 2005;2(3):209–17.

  10. 10.

    Thomas D, Radji S, Benedetti A. Systematic review of methods for individual patient data meta-analysis with binary outcomes. BMC Med Res Methodol. 2014;14:79.

  11. 11.

    Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. J Am Stat Assoc. 1993;88(421):9–25. doi:10.2307/2290687. [published Online First: Epub Date]|.

  12. 12.

    Breslow NE, Lin X. Bias correction in generalised linear mixed models with a single component of dispersion. Biometrika. 1995;82(1):81–91. doi:10.2307/2337629. [published Online First: Epub Date]|.

  13. 13.

    Jang W, Lim J. A numerical study of PQL estimation biases in generalized linear mixed models under heterogeneity of random effects. Commun Stat Simul Comput. 2009;38(4):692–702. doi:10.1080/03610910802627055. [published Online First: Epub Date]|.

  14. 14.

    Pinheiro JC, Bates DM. Approximations to the Log-likelihood function in the nonlinear mixed-effects model. J Comput Graph Stat. 1995;4(1):12–35. doi:10.2307/1390625. [published Online First: Epub Date]|.

  15. 15.

    Turner RM, Omar RZ, Yang M, et al. A multilevel model framework for meta-analysis of clinical trials with binary outcomes. Stat Med. 2000;19(24):3417–32.

  16. 16.

    Benedetti A, Platt R, Atherton J. Generalized linear mixed models for binary data: Are matching results from penalized quasi-likelihood and numerical integration less biased? PLoS ONE. 2014;9(1):e84601. doi:10.1371/journal.pone.0084601. [published Online First: Epub Date]|.

  17. 17.

    Moineddin R, Matheson FI, Glazier RH. A simulation study of sample size for multilevel logistic regression models. BMC Med Res Methodol. 2007;7:34. doi:10.1186/1471-2288-7-34. [published Online First: Epub Date]|.

  18. 18.

    Higgins JP, Thompson SG, Deeks JJ, et al. Measuring inconsistency in meta-analyses. BMJ (Clin Res Ed). 2003;327(7414):557–60. doi:10.1136/bmj.327.7414.557. [published Online First: Epub Date]|.

  19. 19.

    van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in meta-analysis: multivariate approach and meta-regression. Stat Med. 2002;21(4):589–624. doi:10.1002/sim.1040. [published Online First: Epub Date].

  20. 20.

    Riley RD. Multivariate meta-analysis: the effect of ignoring within-study correlation. J R Stat Soc A Stat Soc. 2009;172(4):789–811. doi:10.1111/j.1467-985X.2008.00593.x. [published Online First: Epub Date]|.

  21. 21.

    DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials. 1986;7(3):177–88.

  22. 22.

    Chen H, Manning AK, Dupuis J. A method of moments estimator for random effect multivariate meta-analysis. Biometrics. 2012;68(4):1278–84. doi:10.1111/j.1541-0420.2012.01761.x. [published Online First: Epub Date]|.

  23. 23.

    Hardy RJ, Thompson SG. A Likelihood approach to meta-analysis with random effects. Stat Med. 1996;15(6):619–29. doi:10.1002/(SICI)1097-0258(19960330)15:6<619::AID-SIM188>3.0.CO;2-A. [published Online First: Epub Date]|.

  24. 24.

    Littell RC, Milliken GA, Stroup WW, Wolfinger DR. SAS system for mixed models. Cary: SAS Institute, Inc.; 1996.

  25. 25.

    Proc Glimmix. Maximum Likelihood Estimation Based on Adaptive Quadrature, SAS Institute Inc., SAS 9.4 Help and Documentation. Cary: SAS Institute Inc; 2002–2004.

  26. 26.

    Abo-Zaid G, Guo B, Deeks JJ, et al. Individual participant data meta-analyses should not ignore clustering. J Clin Epidemiol. 2013;66(8):865–73.e4. doi:10.1016/j.jclinepi.2012.12.017. [published Online First: Epub Date].

  27. 27.

    Mathew T, Nordström K. Comparison of One-step and Two-step meta-analysis models using individual patient data. Biom J. 2010;52(2):271–87. doi:10.1002/bimj.200900143. [published Online First: Epub Date]|.

  28. 28.

    Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press; 2007.

  29. 29.

    Lu G, Ades AE. Combination of direct and indirect evidence in mixed treatment comparisons. Stat Med. 2004;23(20):3105–24. doi:10.1002/sim.1875. [published Online First: Epub Date]|.

  30. 30.

    Cornell JE, Mulrow CD, Localio R, Stack CB, Meibohm AR, Guallar E, et al. Random-effects meta-analysis of inconsistent effects: a time for change. Ann Intern Med. 2014;160(4):267–70.

  31. 31.

    IntHout J, Iaonnidis JPA, Borm GF. The the hartung-knapp-sidik-jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-laird method. BMC Med Res Methodol. 2014;14:25.

  32. 32.

    Noh M, Lee Y. REML estimation for binary data in GLMMs. J Multivar Anal. 2007;98(5):896–915. [published Online First: Epub Date].

Download references


We have no acknowledgements.


This work was supported by an operating grant from the Canadian Institutes of Health Research. Andrea Benedetti is supported by the FRQ-S.

Availability of data and materials

Data are available upon request.

Authors’ contributions

DT led this project in the study design, performed simulation of data and statistical analyses, and also led the writing of the manuscripts. AB participated in the study design, guided statistical analyses and edited the final draft. RP helped draft and revised the manuscript. All authors read and approved the final manuscript.

Competing interest

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable. This article reports a simulation study and does not involve human participants.

Author information

Correspondence to Andrea Benedetti.

Additional files

Additional file 1:

Median (Interquartile range (IQR)) absolute bias (%) for treatment effect, β1 for different approach, by number of studies, total average sample size, mixture of studies sizes and degree of random effects variances - data generated from random study- and treatment effect: Eq. 1 with 5% outcome rate. (DOC 72 kb)

Additional file 2:

Median (Interquartile range (IQR)) (%) root mean square error for treatment effect, β1 for different approach, by number of studies, total average sample size, mixture of studies sizes and degree of random effects variances - data generated from random study- and treatment effect: Eq. 1 with 5% outcome rate. (DOC 70 kb)

Additional file 3:

Median (Interquartile range (IQR)) absolute bias (%) for random treatment-effect variance, τ2 1 for different approach, by number of studies, total average sample size, mixture of studies sizes and degree of random effects variances - data generated from random study- and treatment effect: Eq. 1 with 5% outcome rate. (DOC 73 kb)

Additional file 4:

Median (Interquartile range (IQR)) (%) root mean square error for random treatment-effect variance, τ2 1 for different approach, by number of studies, total average sample size, mixture of studies sizes and degree of random effects variances - data generated from random study- and treatment effect: Eq. 1 with 5% outcome rate. (DOC 63 kb)

Additional file 5:

Percent Coverage (percent convergence rate) for treatment effect, β1 for different approach, by number of studies, total average sample size, mixture of studies sizes and degree of random effects variances - data generated from random study- and treatment effect: Eq. 1 with 5% outcome rate. (DOC 62 kb)

Additional file 6:

Percent Coverage for random treatment-effect variance, τ2 1 for different approach, by number of studies, total average sample size, mixture of studies sizes and degree of random effects variances - data generated from random study- and treatment effect: Eq. 1 with 5% outcome rate. (DOC 63 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Thomas, D., Platt, R. & Benedetti, A. A comparison of analytic approaches for individual patient data meta-analyses with binary outcomes. BMC Med Res Methodol 17, 28 (2017).

Download citation


  • Individual patient data meta-analyses
  • One- and two-stage models
  • Generalized linear mixed models
  • Penalized quasi-likelihood
  • Adaptive gauss-hermite quadrature
  • Fixed and random study-effects