A comparison of analytic approaches for individual patient data metaanalyses with binary outcomes
 Doneal Thomas^{1},
 Robert Platt^{1} and
 Andrea Benedetti^{1, 2, 3}Email author
DOI: 10.1186/s1287401703077
© The Author(s). 2017
Received: 29 March 2016
Accepted: 2 February 2017
Published: 16 February 2017
Abstract
Background
Individual patient data metaanalyses (IPDMA) are often performed using a onestage approach a form of generalized linear mixed model (GLMM) for binary outcomes. We compare (i) onestage to twostage approaches (ii) the performance of two estimation procedures (Penalized QuasilikelihoodPQL and Adaptive Gaussian Hermite QuadratureAGHQ) for GLMMs with binary outcomes within the onestage approach and (iii) using stratified studyeffect or random studyeffects.
Methods
We compare the different approaches via a simulation study, in terms of bias, meansquared error (MSE), coverage and numerical convergence, of the pooled treatment effect (β _{1}) and betweenstudy heterogeneity of the treatment effect (τ _{1} ^{2} ). We varied the prevalence of the outcome, sample size, number of studies and variances and correlation of the random effects.
Results
The twostage and onestage methods produced approximately unbiased β _{1} estimates. PQL performed better than AGHQ for estimating τ _{1} ^{2} with respect to MSE, but performed comparably with AGHQ in estimating the bias of β _{1} and of τ _{1} ^{2} . The random studyeffects model outperformed the stratified studyeffects model in small size MA.
Conclusion
The onestage approach is recommended over the twostage method for small size MA. There was no meaningful difference between the PQL and AGHQ procedures. Though the randomintercept and stratifiedintercept approaches can suffer from their underlining assumptions, fitting GLMM with a randomintercept are less prone to misfit and has good convergence rate.
Keywords
Individual patient data metaanalyses One and twostage models Generalized linear mixed models Penalized quasilikelihood Adaptive gausshermite quadrature Fixed and random studyeffectsBackground
Individual Patient Data (IPD) metaanalyses (MA) are regarded as the gold standard in evidence synthesis and are increasingly being used in current practice [1, 2]. However, the implementation of the analysis of IPDMA requires additional expertise and choices [3], particularly when the outcome is binary. These include (i) should a one or twostage model be used [4, 5], (ii) what estimation procedure should be used to estimate the onestage model [6, 7] and, (iii) should the study effect be fixed or random [8].
Although IPDMA were conventionally analyzed via a twostage approach [9], over the last decade, use of the onestage approach has increased [10]. Recently, some have suggested that the twostage and onestage frameworks produce similar results for MA of large randomized controlled trials [5]. The literature suggests the onestage method is particularly preferable when few studies or few events are available as it uses a more exact statistical approach than relying on a normality approximation [3–5].
When IPD are available and the outcome is binary, the onestage approach consists of estimating Generalized Linear Mixed Models (GLMMs) with a random slope for the exposure, to allow the exposure effect to vary across studies. Penalized quasilikelihood (PQL) introduced by Breslow and Clayton is a popular method for estimating the parameters in GLMMs [11]. However, regression parameters can be badly biased for some GLMMs, especially with binary outcomes with few observations per cluster, low outcome rates, or high between cluster variability [12, 13]. Adaptive Gaussian Hermite quadrature (AGHQ) is the current favored competitor to PQL, which approximates the maximum likelihood by numerical integration [14]. Although estimation becomes more precise as the number of quadrature points increases, it often gives rise to computational difficulties for highdimension random effects and convergence problems where variances are close to zero or cluster sizes are small [14].
The heterogeneity between studies is an important aspect to consider when carrying out IPDMA. Such heterogeneity may arise due to differences in study design, treatment protocols or patient populations [8]. When such heterogeneity is present, the convention is to include a random slope in the model as it captures the variability of the exposure across studies. However, there are corresponding assumptions in regards to the study effect being modelled as stratified or random [4, 15].
Few comparisons of GLMMs have been reported in the context of IPDMA with binary outcomes [4, 15], that is, when the number of studies and the number of subjects within each study is small, study sizes are imbalanced, in the presence of large betweenstudy heterogeneity and small exposure effects and there is an interest in the variance parameter of the random treatment effect. According to previous literature, these factors have all been identified as influencing model performance [6]. While several simulation studies have been published, these have mainly limited their attention to simple models with only random intercepts [13, 16]. Thus, the performance of the random effects models including both a random intercept and a random slope are less well known.
Our objective was to assess and compare via simulation studies, (i) onestage approaches to conventional twostage approaches (ii) the performance of different estimation procedures for GLMMs with binary outcomes, and (iii) using stratified studyeffect or random studyeffects in a randomized trial setting. We use our results to develop guidelines on the choice of methods for analyzing data from IPDMA with binary outcomes and to understand explicitly the tradeoffs between computational and statistical complexity.
Methods section introduces the models we are considering, the design of the simulation study and the assessment criteria. In Results section, results for the different methods under varying conditions are presented and discussed. Discussion section concludes with a discussion.
Methods
We conducted a simulation study to compare various analytic approaches to analyze data from IPDMA with binary outcomes. Hereto, our methods all assume that betweenstudy heterogeneity exists, as it is likely in practice, and so only random treatmenteffects IPD metaanalysis models are considered.
Data Generation
The data generation algorithm was developed to generate twolevel data sets (e.g. patients grouped into studies). We generated a binary outcome (Y _{ ij }) and a single binary exposure (X _{ ij }). We denote the number of studies j = 1, 2 …, K and i = 1, 2 …, n _{ j } denotes the individuals per study. Therefore, Y _{ ij } is the outcome observed for the i ^{ th } individual from the j ^{ th } study.
Here π _{ ij } is the true probability of the outcome for the i ^{ th } individual from the j ^{ th } study, β _{0}denotes the mean logodds of the outcome (studyeffect) and β _{1} the pooled treatment effect (log odds ratio). The random effects (b _{0j } and b _{1j }) were generated from a bivariate normal distribution with mean = 0 and variancecovariance matrix \( \Sigma =\left(\begin{array}{cc}\hfill {\sigma}^2\hfill & \hfill \rho \sigma \tau \hfill \\ {}\hfill \rho \sigma \tau \hfill & \hfill {\tau}^2\hfill \end{array}\right) \) for the random studyeffect case. In the stratified study effects case, (i.e. Eq. (2)), β _{ j }, were generated from a uniform distribution and b _{1j } was generated from a normal distribution with zero mean and variance, τ ^{2}.
A Bernoulli distribution with probability π _{ ij } from Eqs. (1) and (2) was used to generate the binary outcome Y _{ ij }.
Summary of Simulation Parameters^{a}
Parameters  Values 

IPDMetaanalyses generated:  M = 1000 
(Number of studies, number of subjects per study, total average sample sizes)^{b}:  (K, n _{ i }, N) ∈ {(5,100,500), (15,33,500), (15,200,3000), (5,357,500), (15,98,500), (15,588,3000)} 
Fixed effects (intercepts):  β _{0} = − 0.85 
Prevalence of the outcome  π = 30% 
Fixed effects (Slopes):  β _{1} = 0.18 
Random effects distribution:  Normal 
Random effects variances:  {τ _{0} ^{2} , τ _{1} ^{2} } ∈ (0.05, 1, 4) 
Correlation between random effects:  ρ ∈ (0,0.5) 
A sensitivity analysis was also considered to explore the performance of different methods when just 5% of observation had a positive outcome.
Models
Twostage IPD methods
The first step estimated the studyspecific intercept and slope and their associated withinstudy covariance matrix (consisting of the variances of the intercept and slope, as well as the covariance) for each study. This model reduces the IPD to its relative treatment effect estimate and variance for each study then at the second stage these aggregate data (AD) are synthesized (described below).
Model 1 Bivariate metaanalysis
where Σ is the unknown betweenstudy variancecovariance matrix of the true effects (γ _{0} and γ _{1}) and C _{ j } (j = 1, …, K) the within study variancecovariance matrix with the variances of the estimates.
Model 2: Conventional DerSimonian and Laird approach
with unknown parameters γ _{1} and τ _{1} ^{2} , estimated via the inverse variance weighted noniterative method (methodofmoments) [21].
Onestage IPD methods
The onestage approach analyzes the IPD from all studies simultaneously, while accounting for clustering of subjects within studies [4]. The onestage model is a form of GLMM. Two different specifications are considered.
Model 3 Random intercept and random slope
Model 4Stratified intercept onestage
where I _{ k = j } indicates that a separate intercept should be estimated for each study j = 1, …, K and u _{1j } ~ N(0, τ _{1} ^{2} ). Parameters of both Models 3 and 4 were estimated via PQL and AGHQ.
Estimation Procedures and Approximations
The parameters of the onestage models were estimated using PQL and AGHQ. For the twostage approach, a logistic regression was first estimated for each study via maximum likelihood. The parameters of the twostage model were estimated via methodofmoments (MOM) (Model 2) and restricted maximum likelihood (REML) (Model 1) [21–23] at the second stage.
Both likelihoodbased methods (PQL and AGHQ) were implemented on SAS version 9.4 using PROC GLIMMIX with default options [24]. The number of quadrature points in AGHQ was selected automatically [25], the absolute value for parameter convergence criterion was 10–8 and the maximum number of iterations was 100.

Twostage approach (Models 1 and 2)

Onestage approach via GLMMs (Models 3 and 4) estimated with PQL.

Onestage approach via GLMMs (Models 3 and 4) estimated with AGHQ.
Assessment criteria
The performance of the estimation methods was evaluated using: a) numerical convergence, b) absolute bias; c) root mean square error (RMSE); and d) coverage probability  of the pooled treatment effect and its betweenstudy variability.
Numerical convergence
The convergence rate was estimated for all models fit, as the number of simulation repetitions that did converge (without returning a warning message) divided by the total attempted (M = 1000). Models that returned a warning message specifying that the estimated variancecovariance matrix was not positive definite or that the optimality condition was violated were considered not to have converged.
Bias
where \( {\hat{\theta}}_j \) were the parameter estimates and θ was the true parameter of the pooled treatment effect or its betweenstudy variance. We also reported the mean absolute bias (AB).
Mean square error
For each scenario, the RMSE of the pooled treatment effect and its betweenstudy heterogeneity was reported, as this measure is on the same scale as the parameter.
Coverage probability
We estimated coverage for the pooled treatment effect and its betweenstudy heterogeneity for the various methods. Gaussian coverage was estimated, where if \( \left\hat{\theta}\theta \right\le 1.96\times S E\left(\hat{\theta}\right) \) the true value was covered, and if \( \left\hat{\theta}\theta \right>1.96\times S E\left(\hat{\theta}\right) \) it was not.
We reported the median, the 25^{th} and 75^{th} percentiles of the AB and RMSE of the pooled treatment effect and its betweenstudy heterogeneity but reported percentages for the numerical convergence and coverage rate.
Results
Performance of the one and twostage approaches in small data sets^{a} with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effects^{b}
Data generation  

Performance measures^{c}  Randomstudy and treatment effect (Eq. 1)  Stratifiedstudy effect (Eq. 2)  
Twostage^{d}  Onestage  Twostage  Onestage  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (4, 4) ^{e}  AB (β _{1})  0.04 (0.02 0.06)  0.04 (0.02, 0.07)  0.04 (0.02, 0.06)  0.04 (0.01, 0.07) 
RMSE (β _{1})  1.11 (0.49, 1.94)  1.19 (0.53, 2.12)  1.18 (0.59, 1.96)  1.23 (0.61, 2.14)  
Coverage (β _{1})  89.3  91.8  92  92.6  
AB (τ _{1} ^{2} )  0.23 (0.14,0.30)  0.16 (0.08, 0.24)  0.15 (0.08,0.24)  0.24 (0.20, 0.27)  
RMSE (τ _{1} ^{2} )  7.26 (4.39,7.51)  4.93 (2.56, 7.51)  4.81 (2.38,7.42)  7.47 (6.28, 8.64)  
Coverage (τ _{1} ^{2} ) ^{f}  NA  NA  NA  NA  
Convergence  100  97.7  100  99.8  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (1, 1)  AB (β _{1})  0.02 (0.01, 0.04)  0.02 (0.01, 0.04)  0.03 (0.01, 0.04)  0.03 (0.01, 0.04) 
RMSE (β _{1})  0.73 (0.35, 1.29)  0.75 (0.37, 1.33)  0.80 (0.37, 1.30)  0.79 (0.39, 1.34)  
Coverage (β _{1})  89.1  90.6  91.1  91.6  
AB (τ _{1} ^{2} )  0.06 (0.03, 0.08)  0.04 (0.02, 0.1)  0.05 (0.03, 0.08)  0.03 (0.01, 0.07)  
RMSE (τ _{1} ^{2} )  1.73 (0.85, 2.65)  1.22 (0.53, 3.16)  1.59 (0.80, 2.46)  1.06 (0.46, 2.06)  
Coverage (τ _{1} ^{2} )  NA  NA  NA  NA  
Convergence  100  90.4  100  100 
Performance of the one and twostage approaches in large data sets^{a} with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effects^{b}
Data generation  

Performance measures^{c}  Randomstudy and treatment effect (Eq. 1)  Stratifiedstudy effect (Eq. 2)  
Twostage^{d}  Onestage  Twostage  Onestage  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (4, 4) ^{e}  AB (β _{1})  0.03 (0.02 0.06)  0.03 (0.02, 0.06)  0.04 (0.01, 0.06)  0.04 (0.01, 0.06) 
RMSE (β _{1})  1.02 (0.50, 1.85)  1.07 (0.49, 1.84)  1.15 (0.57, 1.93)  1.11 (0.58, 1.87)  
Coverage (β _{1})  91.9  92.3  92.4  93.6  
AB (τ _{1} ^{2} )  0.14 (0.07,0.22)  0.12 (0.06, 0.20)  0.11 (0.05,0.17)  0.22 (0.20, 0.25)  
RMSE (τ _{1} ^{2} )  4.36 (2.22,6.80)  3.87 (1.81, 6.21)  3.40 (1.70,5.47)  6.99 (6.20, 7.80)  
Coverage (τ _{1} ^{2} ) ^{f}  NA  NA  NA  NA  
Convergence  100  98.3  100  89.9  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (1, 1)  AB (β _{1})  0.02 (0.01, 0.03)  0.02 (0.01, 0.03)  0.02 (0.01, 0.03)  0.02 (0.01, 0.03) 
RMSE (β _{1})  0.61 (0.30, 1.04)  0.59 (0.29, 1.05)  0.61 (0.30, 1.02)  0.63 (0.30, 1.03)  
Coverage (β _{1})  91.2  91.9  93  93.3  
AB (τ _{1} ^{2} )  0.03 (0.02, 0.06)  0.03 (0.02, 0.05)  0.03 (0.02, 0.05)  0.02 (0.01, 0.03)  
RMSE (τ _{1} ^{2} )  1.08 (0.53, 1.73)  1.03 (0.49, 1.68)  1.00 (0.51, 1.68)  0.57 (0.27, 1.00)  
Coverage (τ _{1} ^{2} )  NA  NA  NA  NA  
Convergence  100  96.5  100  88.8 
Performance of Penalized Quasilikelihood and Adaptive Gaussian Hermite Quadrature estimation approaches in small data sets^{a} with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effects^{b}
Performance measures^{c}  Data generation  

Randomstudy and treatment effect (Eq. 1)  Stratifiedstudy effect (Eq. 2)  
AGHQ^{d}  PQL^{d}  AGHQ  PQL  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (4, 4)^{e}  AB (\( \beta \) _{1})  0.05 (0.02, 0.08)  0.04 (0.02, 0.07)  0.04 (0.02, 0.07)  0.04 (0.02, 0.07) 
RMSE (\( \beta \) _{1})  1.42 (0.64, 2.52)  1.19 (0.53, 2.12)  1.35 (0.65, 2.27)  1.23 (0.61, 2.14)  
Coverage (\( \beta \) _{1})  93.2  91.8  91.7  92.6  
AB (τ _{1} ^{2} )  0.18 (0.09,0.29)  0.16 (0.08, 0.24)  0.15 (0.08,0.24)  0.24 (0.20, 0.27)  
RMSE (τ _{1} ^{2} )  5.76 (2.80,9.07)  4.93 (2.56, 7.51)  4.79 (2.37,7.62)  7.47 (6.28, 8.64)  
Coverage (τ _{1} ^{2} )  85.5  76.2  81.4  4.6  
Convergence  99  97.7  96.7  99.8  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (1, 1)  AB (\( \beta \) _{1})  0.03 (0.01, 0.05)  0.02 (0.01, 0.04)  0.03 (0.01, 0.04)  0.03 (0.01, 0.04) 
RMSE (\( \beta \) _{1})  0.79 (0.41, 1.42)  0.75 (0.37, 1.33)  0.84 (0.42, 1.38)  0.79 (0.39, 1.34)  
Coverage (\( \beta \) _{1})  92.3  90.6  93.4  91.6  
AB (τ _{1} ^{2} )  0.06 (0.03, 0.09)  0.04 (0.02, 0.1)  0.05 (0.02, 0.08)  0.03 (0.02, 0.07)  
RMSE (τ _{1} ^{2} )  1.76 (0.84, 2.70)  1.22 (0.53, 3.16)  1.54 (0.72, 2.40)  1.06 (0.46, 2.06)  
Coverage (τ _{1} ^{2} )  74.5  81.1  71.6  77.2  
Convergence  96.8  90.4  85.8  100 
Performance of Penalized Quasilikelihood and Adaptive Gaussian Hermite Quadrature estimation approaches in large data sets^{a} with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effects^{b}
Performance measures^{c}  Data generation  

Randomstudy and treatment effect (Eq. 1)  Stratifiedstudy effect (Eq. 2)  
AGHQ^{d}  PQL^{d}  AGHQ  PQL  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (4, 4)^{e}  AB (\( \beta \) _{1})  0.04 (0.02, 0.06)  0.03 (0.02, 0.06)  0.04 (0.01, 0.06)  0.04 (0.01, 0.06) 
RMSE (\( \beta \) _{1})  1.20 (0.55, 1.99)  1.07 (0.49, 1.84)  1.16 (0.58, 1.95)  1.11 (0.58, 1.87)  
Coverage (\( \beta \) _{1})  92.2  92.3  92.1  93.6  
AB (τ _{1} ^{2} )  0.13 (0.07,0.21)  0.12 (0.06, 0.20)  0.11 (0.05,0.18)  0.22 (0.20, 0.25)  
RMSE (τ _{1} ^{2} )  4.12 (2.06,6.77)  3.87 (1.81, 6.21)  3.42 (1.71,5.77)  6.99 (6.20, 7.80)  
Coverage (τ _{1} ^{2} )  81.9  78.9  84.7  1.0  
Convergence  100  98.3  100  89.9  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (1, 1)  AB (\( \beta \) _{1})  0.02 (0.01, 0.03)  0.02 (0.01, 0.03)  0.02 (0.01, 0.03)  0.02 (0.01, 0.03) 
RMSE (\( \beta \) _{1})  0.60 (0.30, 1.06)  0.59 (0.29, 1.05)  0.63 (0.30, 1.05)  0.63 (0.30, 1.03)  
Coverage (\( \beta \) _{1})  91.7  91.9  92.4  93.3  
AB (τ _{1} ^{2} )  0.04 (0.02, 0.06)  0.03 (0.02, 0.05)  0.03 (0.02, 0.05)  0.02 (0.01, 0.03)  
RMSE (τ _{1} ^{2} )  1.09 (0.52, 1.75)  1.03 (0.49, 1.68)  1.01 (0.49, 1.69)  0.57 (0.27, 1.00)  
Coverage (τ _{1} ^{2} )  83.6  82.5  82.5  76.5  
Convergence  99.5  96.5  99.1  88.8 
Performance of the stratified and randomintercept^{a} models approaches in small data sets^{b} with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effects^{c}
Performance measures^{d}  Data generation  

Randomstudy and treatment effect (Eq. 1)  Stratifiedstudy effect (Eq. 2)  
Stratifiedintercept  Randomintercept  Stratifiedintercept  Randomintercept  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (4, 4)^{e}  AB (\( \beta \) _{1})  0.04 (0.02, 0.08)  0.04 (0.02, 0.07)  0.05 (0.02, 0.07)  0.04 (0.01, 0.07) 
RMSE (\( \beta \) _{1})  1.24 (0.49, 2.44)  1.19 (0.53, 2.12)  1.43 (0.70, 2.32)  1.23 (0.61, 2.14)  
Coverage (\( \beta \) _{1})  99.1  91.8  97.4  92.6  
AB (τ _{1} ^{2} )  0.16 (0.07, 0.25)  0.16 (0.08, 0.24)  0.15 (0.07,0.24)  0.24 (0.20, 0.27)  
RMSE (τ _{1} ^{2} )  5.01 (2.35,7.95)  4.93 (2.56, 7.51)  4..75 (2.23,7.64)  7.47 (6.28, 8.64)  
Coverage (τ _{1} ^{2} )  11.6  76.2  28.4  4.6  
Convergence  13.8  97.7  32.3  99.8  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (1, 1)  AB (\( \beta \) _{1})  0.03 (0.01, 0.04)  0.02 (0.01, 0.04)  0.03 (0.01, 0.05)  0.03 (0.01, 0.04) 
RMSE (\( \beta \) _{1})  0.83 (0.41, 1.38)  0.75 (0.37, 1.33)  0.90 (0.42, 1.47)  0.79 (0.39, 1.34)  
Coverage (\( \beta \) _{1})  96.4  90.6  94  91.6  
AB (τ _{1} ^{2} )  0.05 (0.03, 0.09)  0.04 (0.02, 0.1)  0.05 (0.02, 0.08)  0.03 (0.01, 0.07)  
RMSE (τ _{1} ^{2} )  1.72 (0.85, 2.78)  1.22 (0.53, 3.16)  1.55 (0.75, 1.61)  1.06 (0.46, 2.06)  
Coverage (τ _{1} ^{2} )  37.3  81.1  54.4  77.2  
Convergence  42.6  90.4  62.3  100 
Performance of the stratified and randomintercept^{a} models approaches in large data sets^{b} with greater (Top panel) and lesser (Bottom panel) heterogeneity of random effects^{c}
Performance measures^{d}  Data generation  

Randomstudy and treatment effect (Eq. 1)  Stratifiedstudy effect (Eq. 2)  
Stratifiedintercept  Randomintercept  Stratifiedintercept  Randomintercept  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (4, 4)^{e}  AB (\( \beta \) _{1})  0.04 (0.02, 0.06)  0.03 (0.02, 0.06)  0.04 (0.02, 0.06)  0.04 (0.01, 0.06) 
RMSE (\( \beta \) _{1})  1.11 (0.55, 1.94)  1.07 (0.49, 1.84)  1.15 (0.58, 1.98)  1.11 (0.58, 1.87)  
Coverage (\( \beta \) _{1})  95.2  92.3  92.3  93.6  
AB (τ _{1} ^{2} )  0.13 (0.06, 0.20)  0.12 (0.06, 0.20)  0.11 (0.05,0.18)  0.22 (0.20, 0.25)  
RMSE (τ _{1} ^{2} )  4.05 (1.85,6.25)  3.87 (1.81, 6.21)  3.39 (1.57,5.56)  6.99 (6.20, 7.80)  
Coverage (τ _{1} ^{2} )  53.7  78.9  89.8  1.0  
Convergence  63.8  98.3  99.7  89.9  
(τ _{0} ^{2} , τ _{1} ^{2} ) = (1, 1)  AB (\( \beta \) _{1})  0.02 (0.01, 0.03)  0.02 (0.01, 0.03)  0.02 (0.01, 0.03)  0.02 (0.01, 0.03) 
RMSE (\( \beta \) _{1})  0.63 (0.29, 1.07)  0.59 (0.29, 1.05)  0.63 (0.30, 1.05)  0.63 (0.30, 1.03)  
Coverage (\( \beta \) _{1})  91.8  91.9  93.1  93.3  
AB (τ _{1} ^{2} )  0.03 (0.02, 0.06)  0.03 (0.02, 0.05)  0.03 (0.02, 0.05)  0.02 (0.01, 0.03)  
RMSE (τ _{1} ^{2} )  1.06 (0.52, 1.74)  1.03 (0.49, 1.68)  0.98 (0.48, 1.69)  0.57 (0.27, 1.00)  
Coverage (τ _{1} ^{2} )  86.3  82.5  87.9  76.5  
Convergence  95.3  96.5  99.2  88.8 
We did not exclude results from metaanalyses that returned a warning message (imperfect convergence). These metaanalyses were included as nonconvergence and although these models failed to produce proper parameter estimates, these estimates were included in the calculation of the bias and the MSE.
One versus Twostage
In Tables 2 and 3, results for the absolute bias (AB) of the estimates for the pooled treatment effect β _{1} are given. Recalling that the true parameter value was 0.18, we see that the biases were identical and under 0.05 in the onestage and the twostage approaches in both small and large data sets. Results were very comparable when the outcome rate was reduced from 30 to 5% (Additional file 1: Table S1). For both the one and the twostage, results depended on the true τ ^{2}, and the sample size.
For the larger sample size, root mean square error (RMSE) in β _{1} was generally slightly larger when the onestage method was used than when the twostage was used. The picture was similar across all heterogeneity levels (Tables 2 and 3) and when the outcome rate was reduced (Additional file 2: Table S3).
Neither onestage nor twostage methods yielded coverage of β _{1} close to nominal levels (Tables 2 and 3). Increasing sample size had a positive effect on percent coverage, and increasing the true heterogeneity made estimation more difficult, hence decreasing the coverage (Table 3).
Absolute bias of the betweenstudy heterogeneity, τ _{1} ^{2} was usually slightly lower when the onestage approach was used than when the twostage approach was (Tables 2 and 3), particularly, when the sample size was small (Table 2) and when greater amount of heterogeneity exist in the random effects (Bottom panel of Table 2). Regarding the effects of the simulation parameters, AB decreased when data was generated with equal study sizes and increased when the rate of occurrence was reduced (Additional file 3: Table S2). In these cases, the onestage approach was most biased.
The RMSE of τ _{1} ^{2} for the onestage estimates was mostly smaller than the RMSE of the twostage method estimates. For increased sample size or reduction in the level of heterogeneity in the random effects, RMSE of τ _{1} ^{2 } decreased at least by a factor of three across both methods. While the RMSE of τ _{1} ^{2 } was inflated when the outcome rate was reduced, the onestage method continued to outperform that of the twostage method (Additional file 4: Table S4).
Convergence was not a problem for the twostage approach while convergence of the onestage method varied from 90 to 100% (Tables 2 and 3).
AGHQ versus PQL
Onestage models estimated via PQL and AGHQ methods often yielded similar AB in β _{1}. There was no observed difference in the AB (β _{1}) between the methods when the outcome rate was reduced (Additional file 1: Table S1).
RMSE of β _{1} were generally greater when AGHQ was used than when PQL was used (Tables 4 and 5). Decreasing sample size, increasing the variances of the random effects or reducing the event rate (Additional file 2: Table S3) made precise estimation more difficult, hence RMSE increased.
When the true heterogeneity was large and total sample was small (Top panel of Table 4), AGHQ provided coverage for β _{1}closer to nominal levels than PQL, while both methods provided comparable coverage when the sample size was increased (Table 5). Note that across both methods, levels of coverage were higher as heterogeneity increased and similar coverage was observed when the outcome rate was reduced (Additional file 5: Table S5).
AB in τ _{1} ^{2} , was very comparable but slightly lower when PQL was used rather than AGHQ (Tables 4 and 5). The AB decreased with increasing sample size, particularly, when PQL was used (Table 5). There was substantial bias in τ _{1} ^{2} estimates when the event rate was reduced (Additional file 3: Table S2).
On account of a better overall performance of PQL with regards to AB, RMSE of τ _{1} ^{2 } was generally lower with PQL than with AGHQ (Tables 4 and 5). RMSE decreased with decreased variability in the random effects, and with increased sample size. In addition, PQLestimates continued to yield smaller RMSE than AGHQestimates when the outcome rate was reduced (Additional file 4: Table S4).
We found important undercoverage of the estimates for τ _{1} ^{2 } for both estimation methods, particularly when PQL was used (Tables 4 and 5). The percent coverage was usually fair for both estimation methods when sample size increased, but was poor when the outcome rate was reduced (Additional file 6: Table S6).
Convergence occurred more often when AGHQ was used than when PQL was used (Tables 4 and 5). Convergence was problematic for PQL, particularly when true heterogeneity was low and sample size was small (Bottom panel of Table 4). Comparable convergence was seen when the event rate was reduced (Additional file 5: Table S5).
Random intercept versus stratifiedintercept
The results of the simulation studies, modeling the intercept as random or fixed (random slope was always considered) via PQL estimation are summarized in the Tables 6 and 7.
The convergence was markedly low (1497%) for the fixed intercept & random slope method (Tables 6 and 7). Convergence was only reasonable for the approach when the sample size was large and heterogeneity was small, whereas convergence was always greater than 80% for the random intercept and slope approach.
In general, AB in β _{1} was similar for both stratifiedintercept (randomslope only) and random intercept & slope methods. Regarding the simulation parameters, sample size and variability of the random effects, were not influential in reducing the AB in β _{1}.
The RMSE in β _{1} was smaller when estimated via the random intercept and slope model than when only a random slope was fit (Tables 6 and 7).
Increased sample size and level of heterogeneity in the random effect was most influential in determining coverage probability.
Absolute bias in τ _{1} ^{2} was clearly comparable when fit with a random intercept & slope approach or a random slope only (Tables 6 and 7). For lower outcome rate, there was a trend towards less pronounced bias when a random slope only was fit (Additional file 3: Table S2).
We observed lower RMSE of τ _{1} ^{2} when a random intercept was fit, especially when the true heterogeneity was large (Top panels of Tables 6 and 7). Comparable results were seen when both models were fit in large sample and the true heterogeneity was small (Bottom panel of Table 7) also when outcome rate was reduced (Additional file 4: Table S4).
We found significant under coverage of τ _{1} ^{2 } when both models were fit, however, this was more severe when a random slope only model was fit (Tables 6 and 7). When the generated values of τ _{0} ^{2} or τ _{1} ^{2} were low (i.e. low variability in the random effects) and sample size was increased, we had less difficulty to estimate the coverage of τ _{1} ^{2} when both models were fit. The coverage probability continued to be an issue when the rate of occurrence was reduced (Additional file 6: Table S6).
Discussion
Findings
Our simulation results indicate that when the number of subjects per study is large, the one and twostage methods yield very similar results. Our results also confirm the finding of previous empirical studies [5, 26, 27] that in some cases, the onestage and twostage IPDMA results coincide. However, we found discrepancies between these methods, with a slight preference towards the onestage method when the number of subjects per study is small. In these situations, neither method produced accurate estimates for the betweenstudy heterogeneity associated with the treatmenteffect; however, the biases were larger for the twostage approach. Furthermore, onestage methods produced less biased and more precise estimates of the variance parameter and had slightly higher coverage probabilities, though these differences may be due to using the REML estimate of τ _{1} ^{2} instead of the der Simonian and Laird estimator used in the twostage approach.
Estimation of GLMMs with binary outcomes continues to pose challenges, with many methods producing biased regression coefficients and variance components [7]. AGHQ has been shown to overestimate the variance component with few clusters or few subjects [17]. On the contrary, PQL has been found to underestimate the variance component while the standard errors are overestimated [12]. In the context of IPDMA, we found similar absolute bias of the PQL and AGHQestimated pooled treatment effect, while the PQLestimates of the betweenstudy variance had greater precision when study sizes were small and random effects were correlated. This somewhat confirms previous results, which found that PQL suffers from large biases but performs better in terms of MSE than AGHQ [6]. Both estimation methods experienced difficulty in attaining nominal coverage of the betweenstudy heterogeneity associated with the treatment effect in two situations: (i) when the number of studies included was small and/or (ii) the true variances of the random effects were small. We also found that convergence was not an important problem for AGHQ when metaanalyses included studies with less than 50 individuals per study. However, convergence was poor when the prevalence of the outcome was reduced to 5% and the true heterogeneity was close to zero.
Stratification of the intercept in onestage models avoids the need to estimate the random effect for the intercept and the correlation between the random effects. This approach may be preferable in situations not investigated in this work (e.g. when the distribution of the random effects is skewed). However, this approach suffered from marked convergence rates when fit to small data sets (15 studies and on average 500 subjects).
Strengths and Limitations
We used simulation studies to compare various analytic strategies to analyze data arising from IPDMA across a wide range of data generation scenarios but made some simplifications. We only considered binary outcomes, one dichotomous treatment variable, a twolevel data structure, and no confounders. Moreover, we estimated GLMMs via PQL and AGHQ, but did not compare Bayesian or other estimation methods, which might be particularly useful in sparse scenarios [28]. We have made the assumption throughout that IPD were available. Certainly, the time and cost associated with collecting IPD are considerable. However, once such data is in hand, we have addressed several open questions relating to the best way to analyze it. We should also note that methods exist for combining IPD and aggregated data [7]. Further study is needed to investigate alternative confidence intervals (or coverage probability) for the betweenstudy heterogeneity that can be used to remedy the undercoverage of Gaussian intervals. The normalitybased intervals (coverage rate) we studied greatly underperformed in most scenarios because the constructions of the confidence interval are likely to be invalid. A further simplification that limits the generalizability of this work is that it is restricted to only twoarm trials. The extension to three or more arms would require careful consideration of more complicated correlation structures in treatment effects across arms and within studies [29].
One important comparison we have not addressed is, computational speed where the twostage method had a distinct advantage over the onestage; PQL was faster than AGHQ and the stratifiedintercept model runtime was quicker than the randomintercept model.
As far as we know, this simulation study is the first to simultaneously generate data with normally distributed and stratified random intercepts. This study also compares approaches that include a random intercept for study membership to those that do not. Furthermore, the use of simulation  to systematically investigate the robustness of the approaches to variation in sample size, study number, outcome rate, magnitude of correlation and variances. As a result, our scenarios have allowed us to assess performance without being too exhaustive.
Guidelines for Best Practice
On the basis of these findings, we can make several recommendations. When the IPDMA included many studies and the outcome rate was not too low, this work supports the conclusion of a previous study [5] that the conventional twostage method by DerSimonian and Laird [21] is a good choice under the data conditions simulated here. Cornell et al. found that the DL method produced toonarrow confidence bounds and p values that were too small when the number of studies was small or there was high betweenstudy heterogeneity [30]. In such cases, a modification such as the HartungKnapp approach may be preferable [31]. Further, while the bivariate twostage approach is very rarely used in practice, we found that it tended to yield good overall model performance, comparable with that of the onestage models when study sizes are small. In addition, our results also suggest that the onestage method can be used in IPDMA where study sizes are less than 50 subjects per study or few events were recorded in most studies (outcome rate of 5%). In these cases, the onestage approach is more appropriate as it models the exact binomial distribution of the data and offers more flexibility in model specification over the twostage approach [32].
If interest lies in estimation of the pooled treatment effect or the betweenstudy heterogeneity of the treatment effect, estimation using PQL appeared to be a better choice due to its lower bias and mean square error for the settings considered. On the contrary, computational issues such as convergence occurred less with this technique than with AGHQ. However, it is important to note that convergence and coverage in τ ^{2} was an issue in small and large total sample sizes and also, when level of true heterogeneity was large.
For these simulated data, the results of both the randomintercept and stratifiedintercept models were not importantly different. However, under both data generations, fitting a GLMM with the randomintercept was overall less sensitive to misspecification in small sample sizes with large betweenstudy heterogeneity than the stratifiedintercept GLMM since we have observed high rates of nonconvergence via the stratifiedintercept model.
There are four important caveats to these recommendations. First, our simulations show greater accuracy of the pooled odds ratio as the number of studies increase. Therefore, an IPDMA with more studies will provide more accurate estimates. Secondly, our results show that the estimation of the betweenstudy heterogeneity of the treatment effect is highly biased regardless of the sample size and number of studies. Therefore, we should always expect that the variance parameter be estimated with some error. Thirdly, small overall samples mark the tradeoff under which a metaanalyst might consistently choose precision over bias and our simulations show that PQL estimation may be preferred in these situations. Finally, large overall sample size can eliminate the lack of statistical power present in small overall samples. In such cases, comparable results are seen for one and twostage methods and fitting a twostage analysis as a first step may be advisable. This could aid as a quick and efficient investigation of heterogeneity and treatmentoutcome association.
Conclusion
To summarize, the one and twostage methods consistently produced similar results when the number of studies and overall sample are large. Although the PQL and AGHQ estimation procedures produced similar bias of the pooled log odds ratios, PQLestimates had lower RMSE than the AGHQestimates. Both the randomintercept and stratifiedintercept models yielded precise and similar estimates for the pooled log odds ratios. However, the randomintercept models gave good coverage probabilities of the betweenstudy heterogeneity in small sample sizes and yielded overall good convergence rate as compared to the random slope only model.
Abbreviations
 AB:

Absolute bias
 AGHQ:

Adaptive Gaussian hermite quadrature
 GLMM:

Generalized linear mixed model
 IPDMA:

Individual patient data metaanalysis
 MOM:

Methodofmoments
 MSE:

Meansquared error
 PQL:

Penalized quasilikelihood
 REML:

Restricted maximum likelihood
Declarations
Acknowledgement
We have no acknowledgements.
Funding
This work was supported by an operating grant from the Canadian Institutes of Health Research. Andrea Benedetti is supported by the FRQS.
Availability of data and materials
Data are available upon request.
Authors’ contributions
DT led this project in the study design, performed simulation of data and statistical analyses, and also led the writing of the manuscripts. AB participated in the study design, guided statistical analyses and edited the final draft. RP helped draft and revised the manuscript. All authors read and approved the final manuscript.
Competing interest
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable. This article reports a simulation study and does not involve human participants.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Riley RD, Simmonds MC, Look MP. Evidence synthesis combining individual patient data and aggregate data: a systematic review identified current practice and possible methods. J Clin Epidemiol. 2007;60(5):431–9. doi:10.1016/j.jclinepi.2006.09.009. [published Online First: Epub Date].View ArticlePubMedGoogle Scholar
 Stewart LA, Parmar MK. Metaanalysis of the literature or of individual patient data: is there a difference? Lancet. 1993;341(8842):418–22.View ArticlePubMedGoogle Scholar
 Debray T, Moons K, Valkenhoef G, et al. Get real in individual participant data (IPD) meta‐analysis: a review of the methodology. Res Synth Methods. 2015;6(4):293–309.
 Debray TPA, Moons KGM, AboZaid GMA, et al. Individual participant data metaanalysis for a binary outcome: Onestage or Twostage? PLoS ONE. 2013;8(4):e60650. doi:10.1371/journal.pone.0060650. [published Online First: Epub Date].View ArticlePubMedPubMed CentralGoogle Scholar
 Stewart GB, Altman DG, Askie LM, et al. Statistical analysis of individual participant data metaanalyses: a comparison of methods and recommendations for practice. PLoS ONE. 2012;7(10):e46042. doi:10.1371/journal.pone.0046042. [published Online First: Epub Date].View ArticlePubMedPubMed CentralGoogle Scholar
 Callens M, Croux C. Performance of likelihoodbased estimation methods for multilevel binary regression models. J Stat Comput Simul. 2005;75(12):1003–17. doi:10.1080/00949650412331321070. [published Online First: Epub Date].View ArticleGoogle Scholar
 Capanu M, Gönen M, Begg CB. An assessment of estimation methods for generalized linear mixed models with binary outcomes. Stat Med. 2013;32(26):4550–66. doi:10.1002/sim.5866. [published Online First: Epub Date].View ArticlePubMedGoogle Scholar
 Rondeau V, Michiels S, Liquet B, et al. Investigating trial and treatment heterogeneity in an individual patient data metaanalysis of survival data by means of the penalized maximum likelihood approach. Stat Med. 2008;27(11):1894–910. doi:10.1002/sim.3161. [published Online First: Epub Date].View ArticlePubMedGoogle Scholar
 Simmonds MC, Higgins JP, Stewart LA, et al. Metaanalysis of individual patient data from randomized trials: a review of methods used in practice. Clinical trials (London, England). 2005;2(3):209–17.View ArticleGoogle Scholar
 Thomas D, Radji S, Benedetti A. Systematic review of methods for individual patient data metaanalysis with binary outcomes. BMC Med Res Methodol. 2014;14:79.View ArticlePubMedPubMed CentralGoogle Scholar
 Breslow NE, Clayton DG. Approximate inference in generalized linear mixed models. J Am Stat Assoc. 1993;88(421):9–25. doi:10.2307/2290687. [published Online First: Epub Date].Google Scholar
 Breslow NE, Lin X. Bias correction in generalised linear mixed models with a single component of dispersion. Biometrika. 1995;82(1):81–91. doi:10.2307/2337629. [published Online First: Epub Date].View ArticleGoogle Scholar
 Jang W, Lim J. A numerical study of PQL estimation biases in generalized linear mixed models under heterogeneity of random effects. Commun Stat Simul Comput. 2009;38(4):692–702. doi:10.1080/03610910802627055. [published Online First: Epub Date].View ArticleGoogle Scholar
 Pinheiro JC, Bates DM. Approximations to the Loglikelihood function in the nonlinear mixedeffects model. J Comput Graph Stat. 1995;4(1):12–35. doi:10.2307/1390625. [published Online First: Epub Date].Google Scholar
 Turner RM, Omar RZ, Yang M, et al. A multilevel model framework for metaanalysis of clinical trials with binary outcomes. Stat Med. 2000;19(24):3417–32.View ArticlePubMedGoogle Scholar
 Benedetti A, Platt R, Atherton J. Generalized linear mixed models for binary data: Are matching results from penalized quasilikelihood and numerical integration less biased? PLoS ONE. 2014;9(1):e84601. doi:10.1371/journal.pone.0084601. [published Online First: Epub Date].View ArticlePubMedPubMed CentralGoogle Scholar
 Moineddin R, Matheson FI, Glazier RH. A simulation study of sample size for multilevel logistic regression models. BMC Med Res Methodol. 2007;7:34. doi:10.1186/14712288734. [published Online First: Epub Date].View ArticlePubMedPubMed CentralGoogle Scholar
 Higgins JP, Thompson SG, Deeks JJ, et al. Measuring inconsistency in metaanalyses. BMJ (Clin Res Ed). 2003;327(7414):557–60. doi:10.1136/bmj.327.7414.557. [published Online First: Epub Date].View ArticleGoogle Scholar
 van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in metaanalysis: multivariate approach and metaregression. Stat Med. 2002;21(4):589–624. doi:10.1002/sim.1040. [published Online First: Epub Date].View ArticlePubMedGoogle Scholar
 Riley RD. Multivariate metaanalysis: the effect of ignoring withinstudy correlation. J R Stat Soc A Stat Soc. 2009;172(4):789–811. doi:10.1111/j.1467985X.2008.00593.x. [published Online First: Epub Date].View ArticleGoogle Scholar
 DerSimonian R, Laird N. Metaanalysis in clinical trials. Control Clin Trials. 1986;7(3):177–88.View ArticlePubMedGoogle Scholar
 Chen H, Manning AK, Dupuis J. A method of moments estimator for random effect multivariate metaanalysis. Biometrics. 2012;68(4):1278–84. doi:10.1111/j.15410420.2012.01761.x. [published Online First: Epub Date].View ArticlePubMedPubMed CentralGoogle Scholar
 Hardy RJ, Thompson SG. A Likelihood approach to metaanalysis with random effects. Stat Med. 1996;15(6):619–29. doi:10.1002/(SICI)10970258(19960330)15:6<619::AIDSIM188>3.0.CO;2A. [published Online First: Epub Date].View ArticlePubMedGoogle Scholar
 Littell RC, Milliken GA, Stroup WW, Wolfinger DR. SAS system for mixed models. Cary: SAS Institute, Inc.; 1996.
 Proc Glimmix. Maximum Likelihood Estimation Based on Adaptive Quadrature, SAS Institute Inc., SAS 9.4 Help and Documentation. Cary: SAS Institute Inc; 2002–2004.
 AboZaid G, Guo B, Deeks JJ, et al. Individual participant data metaanalyses should not ignore clustering. J Clin Epidemiol. 2013;66(8):865–73.e4. doi:10.1016/j.jclinepi.2012.12.017. [published Online First: Epub Date].
 Mathew T, Nordström K. Comparison of Onestep and Twostep metaanalysis models using individual patient data. Biom J. 2010;52(2):271–87. doi:10.1002/bimj.200900143. [published Online First: Epub Date].PubMedGoogle Scholar
 Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press; 2007.Google Scholar
 Lu G, Ades AE. Combination of direct and indirect evidence in mixed treatment comparisons. Stat Med. 2004;23(20):3105–24. doi:10.1002/sim.1875. [published Online First: Epub Date].View ArticlePubMedGoogle Scholar
 Cornell JE, Mulrow CD, Localio R, Stack CB, Meibohm AR, Guallar E, et al. Randomeffects metaanalysis of inconsistent effects: a time for change. Ann Intern Med. 2014;160(4):267–70.View ArticlePubMedGoogle Scholar
 IntHout J, Iaonnidis JPA, Borm GF. The the hartungknappsidikjonkman method for random effects metaanalysis is straightforward and considerably outperforms the standard DerSimonianlaird method. BMC Med Res Methodol. 2014;14:25.View ArticlePubMedPubMed CentralGoogle Scholar
 Noh M, Lee Y. REML estimation for binary data in GLMMs. J Multivar Anal. 2007;98(5):896–915. http://dx.doi.org/10.1016/j.jmva.2006.11.009. [published Online First: Epub Date].