Skip to main content

Power difference in a χ2 test vs generalized linear mixed model in the presence of missing data – a simulation study



Longitudinal randomized controlled trials (RCTs) often aim to test and measure the effect of treatment between arms at a single time point. A two-sample χ2 test is a common statistical approach when outcome data are binary. However, only complete outcomes are used in the analysis. Missing responses are common in longitudinal RCTs and by only analyzing complete data, power may be reduced and estimates could be biased. Generalized linear mixed models (GLMM) with a random intercept can be used to test and estimate the treatment effect, which may increase power and reduce bias.


We simulated longitudinal binary RCT data to compare the performance of a complete case χ2 test to a GLMM in terms of power, type I error, relative bias, and coverage under different missing data mechanisms (missing completely at random and missing at random). We considered how the baseline probability of the event, within subject correlation, and dropout rates under various missing mechanisms impacted each performance measure.


When outcome data were missing completely at random, both χ2 and GLMM produced unbiased estimates; however, the GLMM returned an absolute power gain up to from 12.0% as compared to the χ2 test. When outcome data were missing at random, the GLMM yielded an absolute power gain up to 42.7% and estimates were unbiased or less biased compared to the χ2 test.


Investigators wishing to test for a treatment effect between treatment arms in longitudinal RCTs with binary outcome data in the presence of missing data should use a GLMM to gain power and produce minimally unbiased estimates instead of a complete case χ2 test.

Peer Review reports


Complete binary outcomes

Association of response and treatment at a single time point in a randomized clinical trial (RCT) with binary outcomes can be analyzed by using a χ2test of association, methods of moments generalized estimating equations (GEE), or likelihood based generalized linear mixed models (GLMM). When the data are complete, estimates of the treatment effect are unbiased [1] for any of these methods. However, in longitudinal RCTs, complete data rarely exist. A recent review found that 95% of RCTs in the top 4 medical journals had reported some amount of missing outcome data [2]. Potential consequences of missing data include decreased precision, lower power, and biased estimates [3], which can lead to improper inferences of between and within arm effects.

Mechanisms for missing data

Rubin [4] has defined the probability mechanisms for missing data into three categories: missing completely at random (MCAR), missing at random (MAR), and not missing at random (MNAR). Under MCAR, the probability of missingness does not depend on the unobserved or observed data. An example of MCAR data would be if a researcher lost part of their data set to a computer crash. For MAR, the probability of data being missing depends on the set of prior observed responses but is unrelated to the responses that would have been obtained. An example of MAR data would be that a participant misses an appointment because he or she was too sick (or healthy) at the previous appointment and didn’t want to attend the current one. MNAR data are data where the probability of missingness is related to the missing data itself, even after taking account of observed data.

Statistical approaches to longitudinal binary data with missing data

Weighted GEE (wGEE) or multiple imputation (MI) with GEE are valid options to analyze incomplete longitudinal binary data, as long as models are correctly specified. The unbiasedness of a χ2test and GEE is based on the assumption that missing outcome data are MCAR [1], but this strong assumption is often unrealistic in longitudinal RCTs. A more appropriate assumption for missingness is MAR, where unbiased estimates can be obtained by GLMM, wGEE, and MI with GEE [1].

Researchers have studied the effects of MCAR and MAR missing data on the analyses of binary data from longitudinal RCTs for GEE, extensions to GEE including MI with GEE, and GLMM. A simulation study done by Beunckens et al. showed that in small to moderate sample size (n = 50 per treatment arm) with MAR data, MI with GEE was less biased and more precise compared to wGEE [5]. Lipkovich et al. demonstrated that under MAR, MI with GEE resulted in less biased estimates, higher power, and smaller Type I error rates compared to unweighted GEE and GLMM [6]. A more recent paper by Liu and Zhan simulated repeated binary responses with missing data under MCAR and MAR, and compared the Type I error rate and power obtained from GLMM (full and pseudo restricted likelihood), unweighted GEE, and several MI approaches. Under MAR, pseudo-likelihood GLMM performed better than GEE and MI with logistic regression in terms of controlling Type I error rate and power [7]. Additionally, bias under GLMM and MI with logistic regression was well controlled.

Although the effects of missing data on power and between arm estimates in longitudinal RCTs with binary outcomes have been investigated with statistical analyses such as MI with GEE, GEE, wGEE, and GLMM, little research has been done to compare power and bias of treatment effects between the χ2 method and GLMM. We chose χ2for our analysis because it is the most common statistical analysis for between arm treatment effect in RCTs with binary outcomes [2]. The χ2 is the most commonly used statistical analysis of longitudinal binary RCT outcome data because we desire the unadjusted between arm treatment effect at time point j. However, the χ2 test ignores the within subject correlation that is inherent to longitudinal RCT data and excludes any outcome data that are missing from the analysis. In contrast, generalized linear mixed models (GLMM), which are an extension of generalized linear models, account for the within-subject association by introduction a subset of regression coefficients (random effects) that vary randomly across individuals. Inferences of fixed effects are conditional on random effects and have subject specific interpretations. We used GLMM (which produces subject-specific treatment effects by default) as a comparative method instead of GEE (which produces marginal treatment effects) because it is known that under MAR, non-likelihood based GEE will yield biased estimates of the mean response [1].

We hypothesized that power would be higher and estimates of the between arm treatment effect would be less biased for pseudo-likelihood GLMM compared to the χ2 test when missing data are present. Our rationale comes from a simulation study done by Ashbeck et al., where the effects of missing data at a fixed time point in a longitudinal RCT with continuous outcomes was studied. Power and bias were compared between a complete case two sample t-test and mixed model for repeated measures (MMRM) in the presence of missing data. When data were MCAR, estimates remained unbiased for both tests; however, MMRM had an absolute power gain up to 12%. MMRM outperformed t-test when data were MAR in terms of less biased estimates and higher power [8]. We suspect that power and bias obtained from complete case χ2 and GLMM will behave in a similar manner as presented by Ashbeck et al.

This paper follows best practices for simulation studies as outlined by Morris et al. [9] and is organized as follows. Section 2 describes the aims, data generation, statistical analysis, and performance measures used in this paper. Analytic results of performance measures from each simulation scenario under each missing mechanism are presented in Section 3. Section 4 discusses results and concluding remarks are presented in Section 5.



The primary aims of this simulation study were to evaluate the impact of MCAR and MAR data on the a) power to detect a treatment effect between two treatment arms at the final time point and b) estimated treatment effect between arms obtained from χ2 and GLMM analyses. We investigated the sensitivity of our results to the control arm proportion and correlation between repeated measures.

Data generation

A two-arm randomized controlled trial with a 1:1 treatment allocation ratio was used to generate three complete repeated binary outcomes for each individual, where the third measurement was the primary time point for analysis. Let Yij be an observed binary outcome for the ith (i = 1, …,N) subject at the j = 1,..,3 occasion in a longitudinal RCT with treatment variable Xi = 0 (control) or 1 (treatment), where the number of visits was fixed and the same for all participants. Responses between subjects were assumed to be independent, but responses within subjects were correlated. Let pij be the probability of the event for the ith person at the jth time point and Yij = 1 represent that a positive outcome has occurred; conversely, Yij = 0 indicates the participant experienced a negative outcome.

At baseline, a random pi1 (baseline probability of event-also known as prevalence rate) for each individual was generated from a Beta(a, b) distribution. While pi1 varied between participants, average pi1 across all participants was 0.1 or 0.5, regardless of treatment arm (i.e. pi1 = 0.1 or 0.5). Next, a random yi1 was generated from a Binomial(1, pi1) distribution. The expected value of Yi1 is \( {p}_{i1}=\frac{a}{a+b} \) and correlation between two repeated measures is ρ \( =\frac{1}{1+a+b} \). The values for a and b were fixed to produce a prevalence rate of 0.1 or 0.5 and ρ of 0.3 or 0.7. The first N/2 participants were assigned to the treatment and the latter half were assigned to the control arm.

For each subsequent time point j (j = 2, 3), the probability of Yij = 1 was derived from the following formula:

$$ {p}_{ij}={p}_{i\left(j-1\right)}+0.05\ast {X}_i $$

The fixed coefficient 0.05 induced a linear increase of pi1 between arms at j. A random yij was then generated from a binomial (1, pij) distribution. The simulation produced a fixed end-of-study (j = 3) risk difference of 0.10. Probabilities generated from [1] for the treatment arm increased linearly but remained constant for the control arm (see Fig. 1). In order for the χ2 test to have 80% power to detect an end-of-study risk difference of 0.1, a total sample size of 398 (pi1 = 0.1) or 776 (pi1 = 0.5) was needed, assuming no continuity correction. Additionally, we generated data under the null hypothesis of no risk difference at any time point. We used the same data generation technique as described above to generate Yij, except pij was the same for each time point. Therefore, there was no difference in the within and between arm treatment effect at any time point.

Fig. 1
figure 1

Simulated trajectories of probability of Yij for pi1 = 0.1 (solid line) and pi1 = 0.5 (dashed line)

Missing data generation

In order to investigate the effects of missing data on power and bias of estimates obtained from χ2 and GLMM analyses, observations from complete data sets at end-of-study were deleted under MCAR and MAR mechanisms. Rates of missingness were broken into two categories: equal and unequal. Equal dropout rates had 30% of observations deleted per arm. Unequal dropout rates were further separated into two subcategories – unequal dropouts 1 (20% missing from control arm and 40% missing from treatment arm) and unequal dropouts 2 (40% missing from control arm and 20% missing from treatment arm). Data sets under each pi1 and ρ combination were constructed under the following scenarios:

  1. 1)

    No missing data. One thousand complete data sets were generated under each pi1 (0.1 and 0.5) with corresponding ρ (0.3 and 0.7) using the data generation method described in the Data Generation section.

  2. 2)

    Missing completely at random (MCAR). The probability of missingness was unrelated to previous or current values of Y and treatment arm. Observations from each complete dataset at the third time point were selected for deletion via simple random sampling.

  3. 3)

    Missing at random (MAR). Outcomes at time point three were deleted based on the previous value of Y (Yi2) and treatment arm. The MAR mechanism was modeled as

    $$ logit\left(\Pr \left({y}_{i3}=.|{y}_{i2},{X}_i\right)\right)={\gamma}_1+{\gamma}_2{y}_{i2}+{\gamma}_3{X}_i $$

where Pr(yi3 = .| yi2, Xi) is the probability of an observation being deleted at the third time point. Therefore, the probability of missingness depended on the previous observation (yi2 = 0 or 1) and treatment assignment (Xi = 0 or 1). The coefficient γ2 was fixed prior to data generation while γ1 and γ3 were varied to achieve desired rates of missingness. After obtaining Pr(yi3 = .| yi2, Xi), a Bernoulli random variable was generated to determine if the participant had dropped out at the third time point. We varied the values of γ2 to determine its impact on power and bias of treatment effect.

  1. a.

    MAR1 - The value of γ2 was set to 1.5. Individuals who responded positively at time point 2 (yi2 = 1) would have 4.5 times the odds of having a missing observation at time point 3 compared to those who responded poorly (yi2 = 0). Therefore, a positive response at time point 2 indicates that the participant’s third observation was more likely to be deleted.

  2. b.

    MAR2 - The value of γ2 was set to − 1.5. Individuals who responded with yi2 = 1 have 0.22 times the odds of having a missing response at time point 3 compared to individuals who responded yi2 = 0. Thus, participants who responded negatively at the prior time point were more likely to have their third observation deleted from the data set.

MNAR data were excluded from the data generating process. When data are assumed to be MNAR, focus shifts from primary analysis (χ2 and GLMM) to a sensitivity analysis, which can require complex models such as selection or pattern mixture models [10]. This is not the focus of this paper and examples of the analysis for longitudinal binary data with MNAR missing outcomes can be found elsewhere [11].

Statistical analyses

We tested the null hypothesis of no treatment effect between treatment arms at the final time point and estimated the treatment effect through a log-odds ratio (log-OR), which captured the likelihood of Yi3 = 1 in the treatment arm compared to the control. The log-ORs associated with χ2 were calculated from a 2 × 2 contingency table and through GLMM with the use of a contrast. Estimates obtained from the complete data sets were defined as the “true” treatment effect and used for comparison purposes.

χ2 test for association

Let Yij = 1 represent the occurrence of event for the ith patient at time point j and Yij = 0 otherwise. For each replicate, observed counts of responses for each treatment arm at time point j = 3 were organized into a 2 × 2 contingency table, which was used to calculate a Pearson χ2 test statistic and other measures (explained below in the Performance Measures section).

Generalized linear mixed model

Let Yij be a binary response for the ith individual at time point j. Then the response probability Pr(Yij = 1|bi) can be modeled as follows:

$$ logit\left(\Pr \left({Y}_{ij}=1|{b}_i\right)\right)={\beta}_1+{\beta}_2{X}_i+{\beta}_3{t}_{ij}+{\beta}_4{t}_{ij}{X}_i+{b}_i $$

where tij represent the continuous time measurement for the ith individual (i = 1, …,N) at the jth time point (j = 1,..,3), bi is the between-subject random effect, and β4 is the log-OR between arms at time point j. It is assumed that \( {b}_i\sim N\Big(0,{\sigma}_b^2 \)) and denotes the variability between individuals in the baseline probability.

Performance measures

We ran 1000 simulations for all combinations of pi1 and ρ, under each missing mechanism and dropout rate. From each analysis, we estimated the treatment effect (log-OR), the Wald statistic and its associated p-value, standard error, and 95% confidence intervals. We then calculated the mean treatment effect as the average log-OR (sum of log-OR/1000). We used this information to construct the following performance measures for each statistical test.

  • Power and Type I error. Given that the data were simulated under the alternative (risk difference = 0.10) and null (risk difference = 0) hypotheses, we assessed power and type I error rate, which was calculated as the percentage of p-values < 0.05 (# of p-values < 0.05/1000 * 100%) under each hypothesis.

  • Treatment effect and relative bias. Treatment effects from χ2 test were calculated from 2 × 2 contingency tables while estimates from the GLMM were obtained by the use of a contrast. Relative bias was defined as \( \frac{\overline{\hat{\theta}}-{\theta}_T}{\theta_T}\ast 100\% \), where θT is the mean log-odds ratio derived from the complete data set and \( \overline{\hat{\theta}} \) is the mean log-odds ratio obtained from the incomplete data sets.

  • Coverage was defined as the percentage of confidence intervals that contained θT, the mean log-OR obtained from the complete data set.

  • Model-based and empirical standard error, which measure the precision of the treatment effect. Model SE was calculated as the square root of the average of the variances from all 1000 replicates and empirical SE was derived by the traditional standard deviation calculation with no distinction between replicates. We expect these standard errors to be equivalent.

The main performance measures were relative bias and power for the log-odds comparing treatment versus control. We consider a statistical method superior if it has higher power (compared to other method) and minimal bias. We will tolerate small biases if power gained is substantial. While there is no definite range of bias that is acceptable, we will consider an estimate minimally biased if relative bias falls between ±10% [7]. If a situation arises where power gained is moderate but estimates are clearly biased, preference of test will default to the one that reduces bias.

All simulations and analyses were performed in SAS version 9.4 and are available through the online Appendix. The χ2 method and its associated performance measures were implemented via PROC FREQ. Model parameters (covariance matrix and fixed effects) were estimated using restricted pseudo-likelihood (RPL) GLMM with random intercept and compound symmetric variance-covariance matrix through PROC GLIMMIX with the ESTIMATE statement.

We chose a compound symmetric model as it is appropriate when the mean response derived from a generalized linear mixed model depends on the population parameters β and a single random effect bi. We believe that the use of an unstructured model would not substantially change the results and would unnecessarily increase computation time due to the increase in matrix parameters.

Other likelihood estimation techniques, such as Laplace and adaptive quadrature, are available in SAS. However, we chose RPL because it is the default estimation method used in PROC GLIMMIX [12]. Although other estimation techniques may be used that maximize the full-likelihood, these techniques can be computationally intensive because the integrals are approximated numerically. Since we are not maximizing a true likelihood function, a Wald statistic was used to test for treatment effect instead of the likelihood ratio test.


For data simulated under the alternative hypothesis, summary of performance measures for χ2 and GLMM methods for each prevalence and correlation combination under each missing mechanism and rates of missingness are presented in Tables 1234. We present the Type I error rates in Table 5.

Table 1 Summary of empirical estimates of log-OR and relative bias (%), power (%), coverage (%), model standard error and empirical standard error derived from χ2 method and GLMM test with compound symmetric variance-covariance matrix from 1000 simulations with a total sample size N = 398
Table 2 Summary of empirical estimates of log-OR and relative bias (%), power (%) and MC standard error, coverage (%), model standard error and empirical standard error derived from χ2 method and GLMM test with compound symmetric variance-covariance matrix from 1000 simulations with a total sample size N = 398
Table 3 Summary of empirical estimates of log-OR and relative bias (%), power (%) and MC standard error, coverage (%), model standard error and empirical standard error derived from χ2 method and GLMM test with compound symmetric variance-covariance matrix from 1000 simulations with a total sample size N = 776
Table 4 Summary of empirical estimates of log-OR and relative bias (%), power (%) and MC standard error, coverage (%), model standard error and empirical standard error derived from χ2 method and GLMM test with compound symmetric variance-covariance matrix from 1000 simulations with a total sample size N = 776
Table 5 Summary of empirical Type I error rates (%) derived from χ2 method and GLMM test with compound symmetric variance-covariance matrix from 1000 simulations with a total sample size N = 398 (pi1 = 0.1) and N = 776 (pi1 = 0.5)


The GLMM method, on average, performed better than the χ2 test in terms of power gain and reduction of bias of the estimates under MCAR. In most missing scenarios, coverage was at or above the expected 95%, and empirical and model-based standard errors were roughly equivalent for both methods. Type I error rates ranged from 1.7 to 6.2%, where all rates obtained from the GLMM analyses fell under the anticipated value of 5% (1.7–3.8%). While estimates derived from both 2 × 2 tables and GLMM under MCAR were unbiased, a maximum absolute power gain of 12.0% was obtained from GLMM method over the χ2 test.


When outcome data were MAR, the GLMM outperformed χ2 in terms of absolute power gain, controlling the Type I error rate, and reduction of bias. We saw absolute power gain peak at 42.7% (pi1 = 0.5, ρ = 0.7, MAR1, unequal dropout 1). While not quite as high, the absolute power gained by using the GLMM test over χ2 when correlation was lower (ρ = 0.3) was still substantial at 25.5% (pi1 = 0.5, MAR1, unequal dropout 1). The GLMM analysis had the lowest Type I error rate (1.8%) and none of the values were above 5%. The χ2 test inflated the Type I error rate under unequal dropout rates, with the most severe cases (maximum 27.6%) occurring when pi1 = 0.5. Biased estimates were obtained from the GLMM, ranging from − 13.3 to 18.3%. However, the χ2 test was unable to control bias as well as the GLMM, with relative bias ranging from − 58.5 to 64.6%.

MAR, low prevalence

Under MAR simulation where the prevalence rate was low (pi1 = 0.1), the advantage of the GLMM test over the χ2 was apparent when ρ = 0.7, with an absolute power gain of 21.1%, while only a 1.0% power gain occurred when ρ = 0.3. In addition to higher power, the GLMM estimates were typically unbiased while log-ORs calculated from the 2 × 2 tables were not. The magnitude of the relative bias increased as ρ did, while the directionality of the bias (over or underestimated treatment effect) was affected by the missing mechanisms (MAR1 vs MAR2) and dropout rates, with relative bias ranging from − 12.0 to 18.0% from the GLMM and − 27.2 to 50.3% from the χ2 method. Treatment effects from both methods were overestimated when ρ = 0.7 under MAR1, equal and unequal dropout1 and underestimated under MAR2, equal and unequal dropout2. Biased estimates affected the percentage of confidence intervals that contained the true log-ORs, specifically when the analysis was performed with the χ2 method, with coverage as low as 89.1%. Standard errors were approximately equal under both methods. The GLMM better controlled the Type I error rates (1.8 to 3.0%) compared to the χ2 test (4.7 to 8.3%). See Tables 12 and 5.

MAR, high prevalence

As baseline probability of event increased from 0.1 to 0.5, so did the need to use GLMM to control for bias and increase power. Relative bias was near acceptable limits, ranging from − 13.3 to 13.7%, while relative bias from χ2 analysis ranged from − 58.5 to 64.3%. The bounds of these intervals occurred when ρ = 0.7 and dropout rates were unequal. The power gained from the GLMM analysis over χ2 was substantial, with an absolute power difference of 42.7% when ρ = 0.7 (MAR1, unequal dropouts 1). In general, power increased by using the GLMM over χ2. However, there were scenarios where power was lost, specifically under MAR1 unequal dropouts 2 and MAR2 unequal dropouts 1. We caution the reader to not focus solely on power gain and instead look at the whole picture, as treatment effects obtained from the χ2 analysis are severely biased, coverage is not adequate, and type I error are severely inflated for these scenarios. See Tables 34, and 5.


We compared bias and power of the GLMM and χ2 methods with binary outcomes from longitudinal RCT data in the presence of missing data. We found that the GLMM, on average, produced the least biased estimates, had higher power to detect a treatment effect compared to the χ2 test without increasing Type I error rate, and coverage was at or above the anticipated 95%. We saw similar conclusions presented by Liu et al. [7] when discussing relative bias and coverage between a GLMM analysis compared to GEE for longitudinal binary outcomes with MCAR and MAR data.

Standard errors within and between each method were roughly equivalent when the prevalence rate was low, but the GLMM standard errors were higher than the χ2 standard errors when the data were simulated under a high prevalence rate. We would expect the standard errors from the GLMM analysis to be smaller than the standard errors calculated from the χ2 method, as the GLMM accounts for the within-subject correlation. However, estimates derived from the GLMM analysis have a subject-specific interpretation, which are larger in absolute value than the marginal estimates [1] obtained from the χ2 test. The standard error is a function of the estimate; therefore it is not unexpected that the standard errors associated with the GLMM were larger than those obtained from the χ2 analysis. That said, it is not appropriate to compare standard errors between the two tests due to the interpretation of these estimates.

It should be noted that the GLMM model is a more flexible model that can include covariates; therefore, we might expect it to perform better than the χ2 method, which does not include covariates. However, the primary analysis of the between arm treatment effect in a longitudinal RCT is typically unadjusted. We felt it was appropriate to consider the χ2 model as the standard to compare the GLMM method to, as it is the most commonly used statistical analysis in RCTs [2].


As expected, unbiased estimates of treatment effect were produced from both GLMM and χ2 tests for all parameter combinations of pi1 and ρ when missing outcome data were MCAR. Given that the observations selected for deletion were chosen at random, both complete case and mixed model methods provided accurate log-ORs [1]. GLMM was able to incorporate correlation between measurements (via random effects), resulting in higher power compared to the χ2 test.


Estimates of log-ORs at the final time point calculated from a 2 × 2 table were biased under both MAR1 and MAR2. Treatment effects obtained from the GLMM analysis were less biased and few exceeded the relative bias threshold of ±10%.Extreme bias occurred when correlation between measurements was high. We note that the directionality of bias was not consistent in MAR1 and MAR2 when prevalence rate fluctuated between 0.1 and 0.5.However, this does not change our recommendation that GLMM be used as the primary statistical method to estimate treatment effect. While in practice we cannot adequately state why an observation is missing unless detailed follow up occurred, we can clearly see that GLMM minimizes bias under MAR for all dropout scenarios.

Substantial power was gained by using GLMM over χ2 in most missing scenarios, with values ranging from 1.0 to 42.7%. However, GLMM lost power compared to χ2 when pi1 = 0.5 and a) MAR1 and dropouts unequal 1 and b) MAR2 and dropouts unequal 2. This increase in power from the χ2 test directly corresponds to the inflated treatment effect. We cannot recommend a test that overestimates the effect of a treatment, no matter how powerful the test may be.

Strengths and limitations

Liu and Zhan [7] reported on the bias and power gained comparing a variety of statistical methods, such as logistic regression, GEE, and GLMM, for the analysis of repeated binary data in the presence of missing data. However, they did not consider the impact of different directions of dropout under MAR, which we consider an important strength to our simulation study.

A limitation to this study is the interpretation of treatment effects from each method. Odds ratios associated with χ2 analysis have population-specific interpretations while those obtained from GLMM are subject-specific. It is known that subject-specific estimates will be larger in absolute value than population-specific estimates [1]. Relative bias of population-specific estimates from GLMM may not be equivalent to what was obtained from subject-specific analysis. As is true for all simulation studies, we could not investigate every possible scenario. We investigated one possible trajectory for the treatment and control arms. We did not consider how other estimation methods (e.g., Laplace and adaptive quadrature) and tests for treatment effects (e.g., likelihood ratio test) impact bias and other performance measures. Finally, we did not investigate covariance patterns beyond compound symmetry.

A possible extension to this area of study would be to include the outcome of event at baseline (Yi1) as a covariate in the analysis of repeated binary data with missing outcomes, sometimes referred to as ANCOVA. Jiang et al [13]. evaluated the effect of baseline adjusted and unadjusted analyses on longitudinal binary data in the presence of missing data. Analyses included logistic regression with last observation carried forward, GLMM, GEE, wGEE, and MI with GEE. On average, adjusted analyses yielded less biased estimates and increased power compared to their unadjusted counterparts. Future analyses of incomplete repeated binary data may include a logistic regression model and a GLMM, both with adjustment for baseline outcome.

A limitation, raised by a reviewer, is that we did not consider Bayesian models for the analysis of longitudinal binary RCT outcome data. We acknowledge that while Bayesian analyses may result in higher power and less bias of estimates compared to frequentist methods, they are less commonly used in analysis of longitudinal RCT. We leave the investigation of this and other potential factors to future research.


In this paper, we considered the impact of missing outcome data on power, bias, coverage, and standard errors obtained from the χ2 test and generalized linear mixed model. Based on the results from these simulations, we recommend the use of a GLMM for the primary analysis of longitudinal binary data in the presence of missing data, which was robust against missing data and provided less bias and higher power than χ2 methods.

Availability of data and materials

This was a simulation study, therefore there are no real data. However, simulation code can be found in the appendix (see Additional file 1).



Generalized estimating equations


Generalized linear mixed models


Log-odds ratios


Missing at random


Missing completely at random


Multiple imputation


Mixed model for repeated measures


Not missing at random


Randomized clinical trial


Restricted pseudo-likelihood


weighted GEE


  1. Fitzmaurice GM, Laird NM, Ware JH. Applied Longitudinal Analysis. Hoboken: Wiley; 2011.

  2. Bell ML, Fiero M, Horton NJ, Hsu C-H. Handling missing data in RCTs; a review of the top medical journals. BMC Med Res Methodol. 2014;14(1):118.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Bell ML, Fairclough DL. Practical and statistical issues in missing data for longitudinal patient-reported outcomes. Stat Methods Med Res. 2014;23(5):440–59.

    Article  PubMed  Google Scholar 

  4. Rubin DB. Biometrika Trust Inference and Missing Data Author ( s ): Donald B . Rubin Published by : Oxford University Press on behalf of Biometrika Trust Stable URL : Accessed : 12-06-2016 21 : 34 UTC. Biometrika. 1976;63(3):581–592.

  5. Beunckens C, Sotto C, Molenberghs G. A simulation study comparing weighted estimating equations with multiple imputation based estimating equations for longitudinal binary data. Comput Stat Data Anal. 2008;52(3):1533–48.

    Article  Google Scholar 

  6. Lipkovich I, Duan Y, Ahmed S. Multiple imputation compared with restricted pseudo-likelihood and generalized estimating equations for analysis of binary repeated measures in clinical studies. Pharm Stat. 2005;4(4):267–85.

    Article  Google Scholar 

  7. Frank Liu G, Zhan X. Comparisons of methods for analysis of repeated binary responses with missing data. J Biopharm Stat. 2011;21(3):371–92.

    Article  CAS  PubMed  Google Scholar 

  8. Ashbeck EL, Bell ML. Single time point comparisons in longitudinal randomized controlled trials: power and bias in the presence of missing data. BMC Med Res Methodol. 2016;16(1):1–8.

    Article  Google Scholar 

  9. Morris TP, White IR, Crowther MJ. Using simulation studies to evaluate statistical methods. Stat Med. 2019;(November 2017).

  10. Molenberghs G, Kenward MG. Missing data in clinical studies. Chichester: Wiley; 2007.

    Book  Google Scholar 

  11. Parzen M, Lipsitz SR, Fitzmaurice GM, Ibrahim JG, Troxel A. Pseudo-likelihood methods for longitudinal binary data with non-ignorable missing responses and covariates. Stat Med. 2006;25(16):2784–96.

    Article  PubMed  Google Scholar 

  12. PROC GLIMMIX: Default Estimation Techniques :: SAS/STAT(R) 9.2 User’s Guide, Second Edition Accessed 24 Mar 2019.

  13. Jiang H, Kulkarni PM, Mallinckrodt CH, Shurzinske L, Molenberghs G, Lipkovich I. Adjusting for baseline on the analysis of repeated binary responses with missing data. Stat Biopharm Res. 2015;7(3):238–50.

    Article  Google Scholar 

Download references





Author information

Authors and Affiliations



MLB conceived of the idea, supervised simulations, and edited the manuscript. MLM undertook the simulation studies and wrote the majority of the manuscript. DJR, CH provided contextual advice and edited the manuscript. All authors read and approved of the final manuscript.

Authors’ information

MLM was a graduate student in biostatistics, at the time of submission.

MLB, DJR, CH are Professors of Biostatistics

Corresponding author

Correspondence to Mary L. Miller.

Ethics declarations

Ethics approval and consent to participate

Not applicable

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Supplemental SAS simulation and analysis.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Miller, M.L., Roe, D.J., Hu, C. et al. Power difference in a χ2 test vs generalized linear mixed model in the presence of missing data – a simulation study. BMC Med Res Methodol 20, 50 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Complete-case
  • Missing data
  • Binary data
  • Generalized linear mixed model
  • Chi-squared test
  • Power
  • Relative bias
  • Longitudinal