Separation and Firth’s logistic regression
Logistic regression models the probability that Y_{i} = 1 for independent binary outcomes Y_{i}, i = 1, …N, with values y_{i} ∈ {0, 1}, given (p + 1)dimensional row vectors of covariate values x_{i} = (1, x_{i1}, …, x_{ip}) by assuming
$$P\left({Y}_i=1{x}_i\right)={\left(1+\exp \left({x}_i\beta \right)\right)}^{1}={\pi}_i\left(\beta \right),$$
where β = (β_{0}, β_{1}, …, β_{p})^{′} is a vector of regression coefficients. Maximum likelihood estimates for β can be obtained by solving the score equations. Albert and Anderson found that finite maximum likelihood estimates exist if and only if the data are not separable, i.e. if there exists no hyperplane defined by a linear combination of covariates that separates events and nonevents [1]. There are two main drivers for the occurrence of separation: the presence of strong effects and a low amount of information available in the data, manifested, e.g., by a small sample size or rare occurrence of one of the two levels of the outcome variable or of a binary covariate. FL, which originally was proposed to reduce the bias of maximum likelihood estimates [2], has been propagated as a reliable alternative to maximum likelihood estimation [5] as it always provides finite coefficient estimates. With FL, the likelihood is penalized by the Jeffreys invariant prior, i.e. by the square root of the determinant of the Fisher information matrix I(β)^{1/2}. On the level of score eqs. FL adds 1/2 trace(I(β)^{−1} (∂I(β)/∂β)) to the first derivative of the loglikelihood. By proving that the logarithm of the Jeffreys prior \(\frac{1}{2}\ \log \left(I\left(\beta \right)\right)\) tends to −∞ if one of the components of β approaches ∞, Kosmidis and Firth showed that the penalized log likelihood is always maximized by finite β, i.e. FL always provides finite coefficient estimates [6]. Another convenient property of FL, which is for instance not shared by other proposals for penalized likelihood logistic regression, is that FL is transformation invariant. This means that if G is an invertible (p + 1) × (p + 1)matrix, \({\hat{\beta}}_X\) the FL coefficient estimate for the design matrix X with rows x_{i}, i = 1, …, N, and \({\hat{\beta}}_{XG}\) the FL coefficient estimate for the design matrix X · G, then we have \({\hat{\beta}}_{XG}={G}^{1}\cdot {\hat{\beta}}_X\).
FL is equivalent to maximum likelihood estimation with an appropriately augmented and weighted data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\) containing 3N observations, basically obtained by stacking three copies of the original data set [7]. For j ∈ {1, …, 3N} we denote by i_{j} the integer in {1, …, N} such that j ∈ {i_{j}, N + i_{j}, 2N + i_{j}}. The covariate row vectors \(\tilde{x}_j,\) outcomes \(\tilde{y}_j\) and the weights \(\tilde{w}_j\), j ∈ {1, …, 3N}, of the augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\) are defined by \(\tilde{x}_j:= {x}_{i_j},\)
$$\tilde{y}_j:= \left\{\begin{array}{c}{y}_{i_j},\kern0.5em \mathrm{if}\ j={i}_j\ \mathrm{or}\ j=N+{i}_j,\\ {}1{y}_{i_j},\kern0.5em \mathrm{if}\ j=2N+{i}_j\end{array}\right.$$
and
$$\tilde{w}_j:= \left\{\begin{array}{c}1,\kern0.5em \mathrm{if}\ j={i}_j,\\ {}{h}_{i_j}/2,\kern0.5em \mathrm{if}\ j=N+{i}_j\ \mathrm{or}\ j=2N+{i}_j,\end{array}\right.$$
where h_{i} denotes the ith diagonal element of the hat matrix \(\overset{\sim }{H}={W}^{1/2}X\ {\left({X}^{\prime } WX\right)}^{1}{X}^{\prime }{W}^{1/2}\), with X the N × (p + 1) design matrix and W the diagonal matrix with elements π_{i}(1 − π_{i}).
Since the contribution of the data augmentation is asymptotically negligible, approximate standard errors of the estimates can be obtained as the square roots of the diagonal elements of (X^{′}WX)^{−1}, where the elements \({\hat{\pi}}_i\left(1{\hat{\pi}}_i\right)\) in the N × N matrix W are obtained using the coefficient estimates from the final iteration.
Generalized estimating equations
We will now slightly adapt our notation to the situation of clustered data. Let N be the number of clusters and let n_{i} be the number of observations in the ith cluster. We denote by \({y}_i={\left({y}_{i1},\dots, {y}_{in_i}\right)}^{\prime }\) the n_{i}dimensional vector of binary outcomes in the ith cluster, by x_{ij} = (1, x_{ij1}, …, x_{ijp})^{′} the (p + 1)dimensional vector of covariates for the jth observation in the ith cluster and by X_{i} the (n_{i} × (p + 1))dimensional matrix with rows x_{ij}, i = 1, …, N and j = 1, …, n_{i}. The marginal logistic regression model relates the conditional expectation P(Y_{ij} = 1 x_{ij}) to the vector of covariates by assuming P(Y_{ij} = 1 x_{ij}) = (1 + exp(x_{ij}β))^{−1} for a (p + 1)dimensional vector of parameters β = (β_{0}, β_{1}, …, β_{p})^{′} [8]. Using a working correlation structure expressed as a (n_{i} × n_{i})dimensional matrix R_{i}(α) with some parameter vector α, Liang and Zeger [9] showed that the parameters β can be consistently estimated by solving the generalized estimating equations
$$\sum_{i=1}^N{X_i}^{\prime }{W}_i\ {\left({W}_i^{1/2}{R}_i\left(\alpha \right){W}_i^{1/2}\right)}^{1}\left({y}_i{\pi}_i\right)=0,\kern0.5em (1)$$
where π_{i} is the n_{i}dimensional vector with components π_{ij} = (1 + exp(x_{ij}β))^{−1} and W_{i} is the (n_{i} × n_{i})dimensional diagonal matrix with elements π_{ij}(1 − π_{ij}). Assuming an independent working correlation, i.e. setting R_{i}(α) to the identity matrix, the estimating eqs. (1) reduce to the score equations for logistic regression with independent observations.
Transferring Firth’s likelihood penalization to generalized estimating equations
The considerations on FL above imply three possible extensions of FL to GEE.
Penalized GEE (penGEE)
First, considering GEE as score equations leads to the following penalized GEE
$${\displaystyle \begin{array}{c}{U}^{\ast}\left(\beta, \alpha \right)=U\left(\beta, \alpha \right)+\frac{1}{2}\mathrm{trace}\ \left(I{\left(\beta, \alpha \right)}^{1}\left(\frac{\partial I\left(\beta, \alpha \right)}{\partial \beta}\right)\right)=0, (2)\end{array}}$$
where U(β, α) is the generalized estimating function given by the left hand side of eq. (1) and I(β, α)= \(E\left(\frac{\partial U\left(\beta, \alpha \right)}{\partial \beta}\right)\)[3, 4]. As with ordinary GEE, a solution to eq. (2) can be obtained by iterating between solving for β by the first step of a NewtonRaphson fitting algorithm given a value α, and updating α using the method of momentsbased estimators. This motivates the following ‘penalized GEE’ algorithm:

1.
As initial estimate \({\hat{\beta}}^0\) for β use the coefficient estimate obtained by applying FL to the data ignoring the clustering structure.

2.
Given the estimate \({\hat{\beta}}^k\) for β, k ∈ ℕ, calculate the momentbased estimate \({\hat{\alpha}}^{k+1}\) for α. For example, in the setting of an exchangeable working correlation structure set
$${\hat{\alpha}}^{k+1}=\frac{1}{N}\sum_{i=1}^N\frac{1}{n_i\left({n}_i1\right)}\sum_{j\ne k}^{n_i}{\hat{e}}_{ij}{\hat{e}}_{ik},$$
where \({\hat{e}}_{ij}=\left({y}_{ij}{\hat{\pi}}_{ij}\right)/\sqrt{\left({\hat{\pi}}_{ij}\left(1{\hat{\pi}}_{ij}\right)\right)}\). See Molenberghs and Verbeke [10] (p.157) for the formulas for independent, AR(1) and unstructured working correlation structure.

3.
Update the coefficient estimate by
$${\hat{\beta}}^{k+1}={\hat{\beta}}^k+I{\left({\hat{\beta}}^k,{\hat{\alpha}}^k\right)}^{1}\cdot {U}^{\ast}\left({\hat{\beta}}^k,{\hat{\alpha}}^k\right).$$

4.
Repeat steps 2 and 3 until convergence is achieved.
Note that penalized GEE with independent working correlation structure results in the same coefficient estimates as FL.
Iterated augmented GEE (augGEE)
Second, we suggest extending FL to GEE by mimicking the data augmentation approach. The idea is to iterate between augmenting the data given the current estimates of β and α, and resolving the GEE using the augmented data to obtain new estimates. This gives rise to the following ‘iterated augmented GEE’ algorithm:

1.
As initial estimate \({\hat{\beta}}^0\) for β use the coefficient estimate obtained by applying FL to the data ignoring the clustering structure. As initial estimate \({\hat{\alpha}}^0\) for α use the value corresponding to a working correlation structure \(R\left({\hat{\alpha}}^0\right)\) equal to the identity matrix.

2.
Given the current estimate \({\hat{\beta}}^k\) of β and \({\hat{\alpha}}^k\) of α for k ∈ ℕ, calculate the block diagonal matrix \({H}^k=\operatorname{diag}\left({H}_1^k,\dots, {H}_N^k\right)\) with blocks \({H}_i^k={\Omega}_{\mathrm{i}}^{1/2}{X}_i{\left({X}_i^{\prime}\Omega {X}_i\right)}^{1}{X}_i^{\prime }{\Omega}_{\mathrm{i}}^{1/2}\), where \({\Omega}_{\mathrm{i}}={W}_i^{1/2}{R}_i{\left({\hat{\alpha}}^k\right)}^{1}{W}_i^{1/2}\)[11]. This block diagonal matrix generalizes the hat matrix \(\overset{\sim }{H}\) defined for independent data in Section 2.1.

3.
Similarly as in Section 2.1 create an augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\), consisting of \(\overset{\sim }{N}=3N\) clusters with \(\tilde{n}_l={n}_{i_l}\) observations in the lth cluster, where i_{l} is the integer in {1, …, N } such that l ∈ {i_{l}, N + i_{l}, 2N + i_{l}}. The vector of covariates \(\tilde{x}_{lj},l=1,\dots, \overset{\sim }{N},j=1,\dots, \tilde{n}_l,\) for the jth observation in the lth cluster in the augmented data is set to \({x}_{i_lj}\). The outcome \(\tilde{y}_{lj}\) and the weights \(\tilde{w}_{lj}\) for this augmented data set are given by
$$\tilde{y}_{lj}=\left\{\begin{array}{c}{y}_{i_lj},\kern0.5em \mathrm{if}\ l={i}_l\ \mathrm{or}\ l=N+{i}_l,\\ {}1{y}_{i_lj},\kern0.5em \mathrm{if}\ l=2N+{i}_l\end{array}\right.$$
and
$$\tilde{w}_{lj}=\left\{\begin{array}{c}1,\kern0.5em \mathrm{if}\ l={i}_l,\\ {}{h}_{i_lj}/2,\kern0.5em \mathrm{if}\ l=N+{i}_l\ \mathrm{or}\ l=2N+{i}_l,\end{array}\right.$$
where h_{ij} denotes the jth diagonal element of the ith block \({H}_i^k\) of H^{k}.

4.
Solve the GEE on the augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\). Set \({\hat{\beta}}^{k+1}\) to the coefficient estimate and \({\hat{\alpha}}^{k+1}\) to the estimated parameter for the working correlation structure.

5.
Repeat steps 2, 3 and 4 until convergence is achieved. The final coefficient estimate \(\hat{\beta}\) and correlation parameter estimate \(\hat{\alpha}\) are the estimates from the last iteration.
In step 3 of the iterated augmented GEE algorithm one might think of alternative ways of defining the clustering structure for the augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\ \right)\). For instance, one could combine the ith, (N + i)th and (2N + i)th clusters, i ∈ {1, …, N}, resulting in an augmented data set consisting of only N instead of 3N clusters. However, creating two ‘pseudo clusters’ for each original cluster as suggested in the iterated augmented GEE algorithm best reflects the data augmentation approach in the situation of independent data, where the pseudo observations are also treated as independent observations representing the Jeffreys prior.
As the trace of the generalized hat matrix H calculated in step 2 of the iterated augmented GEE algorithm is always equal to p + 1 [11], i.e. the total weight of the ‘pseudo observations’ is p + 1, we conclude that the relative contribution of the pseudo observations becomes negligible with increasing sample size and that the algorithm yields consistent estimates.
Singlestep augmented GEE (augGEE1)
Third, we investigated a simpler version of the augmented GEE algorithm which can be easily implemented whenever fitting algorithms for FL and weighted GEE are available. This ‘singlestep augmented GEE’ algorithm consists of the following steps:

1.
Apply FL to the data ignoring the clustering and construct the corresponding augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\) as described in section 2.1, i.e. using the ordinary hat matrix \(\overset{\sim }{H}\) in defining the weights, not the hat matrix H^{k} as in the iterated augmented GEE algorithm.
2. Define the clustering on \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\) by treating each of the three copies of the original data as independent data sets, similarly as in the iterated augmented GEE algorithm. In this way, the augmented data \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\) contains 3N (clusters)
3. Solve the GEE on the augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\), this gives the final regression coefficient estimate\(\hat{\beta}\) and correlation parameter estimate \(\hat{\alpha}\).
In other words, singlestep augmented GEE can be performed by stopping the iterated augmented GEE algorithm after the first outer iteration when using the ordinary hat matrix \(\overset{\sim }{H}\) instead of H^{0}. Note that applying iterated or singlestep augmented GEE with an independent working correlation structure gives the same coefficient estimates as FL. The consistency of the singlestep augmented GEE can be shown analogously to the consistency of the iterated algorithm.
Estimating the variancecovariance matrix
With GEE a consistent estimate of the asymptotic variancecovariance matrix of the regression coefficients is given by the socalled sandwich estimate
$$S\left(\hat{\beta}\right)={I}_0{\left(\hat{\beta}\right)}^{1}\ {I}_1\left(\hat{\beta}\right){I}_0{\left(\hat{\beta}\right)}^{1},$$
where
$${I}_0\left(\hat{\beta}\right)=\sum_{i=1}^N{X}_i^{\prime }{\hat{W}}_i\ {\hat{V}}_i^{1}{\hat{W}}_i{X}_i,\kern7.75em {I}_1\left(\hat{\beta}\right)=\sum_{i=1}^N{d}_i{d}_i^{\prime },$$
\({d}_i={X}_i^{\prime }{\hat{W}}_i{\hat{V}}_i^{1}\left({y}_i{\hat{\uppi}}_i\right)\) and \({V}_i={W}_i^{1/2}{R}_i\left(\alpha \right){W}_i^{1/2}\) is the working variancecovariance matrix [10]. With small samples, the sandwich estimator is known to underestimate the variance [12]. We will use the smallsample correction proposed by Morel et al. [13] to correct for this underestimation. The corrected variancecovariance matrix \({S}^{\ast}\left(\hat{\beta}\right)\) is defined as
$${S}^{\ast}\left(\hat{\beta}\right)={I}_0{\left(\hat{\beta}\right)}^{1}\ {I}_1^{\ast}\left(\hat{\beta}\right){I}_0{\left(\hat{\beta}\right)}^{1}+\delta \cdot \phi \cdot {I}_0{\left(\hat{\beta}\right)}^{1}$$
with
$${I}_1^{\ast}\left(\hat{\beta}\right)=\frac{N^{\ast }1}{N^{\ast }p1}\cdot \frac{N}{N1}\cdot \sum_{i=1}^N\left({d}_i\overline{d}\right){\left({d}_i\overline{d}\right)}^{\prime },$$
$$\overline{d}=\frac{1}{N}\sum_{i=1}^N{d}_i,$$
N* = ∑_{i}n_{i} the number of observations,\(\delta =\min \left(0.5,\frac{p+1}{Np1}\right)\) and \(\phi =\max \left(1,\frac{\mathrm{trace}\left(\alpha \cdot {I}_0{\left(\hat{\beta}\right)}^{1}{I}_1^{\ast}\left(\hat{\beta}\right)\right)}{p+1}\right)\). We recommend to calculate the variancecovariance matrices for the two augmented GEE approaches by applying the sandwich estimator with smallsample correction to the original data (y, X) using the parameter estimates \(\hat{\beta}\) and \(\hat{\alpha}\). Note that this is different from the sandwich estimates of the variancecovariance matrix which come with the last GEE fits in the two augmented GEE algorithms, as there the sandwich estimator is applied to the augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\).
Any 95% confidence intervals reported in the following for ordinary GEE, penalized GEE, singlestep and iterated augmented GEE were calculated by multiplying the standard error derived from the smallsample corrected sandwich estimates of the variance covariance matrix with the 0.025th and 0.975th quantile of the tdistribution with the number of clusters as degree of freedom.
Implementation
For penalized GEE, we modified the R package ‘geefirthr’ available on GitHub at https://github.com/mhmondol/geefirthr. The version of ‘geefirthr’ available on GitHub cannot handle clusters consisting of single observations and additionally estimates a scale parameter. Our modified version of ‘geefirthr’, which can also deal with clusters containing one observation and with the scale parameter set to 1, can be found at https://github.com/heogden/geefirthr. To implement the singlestep and the iterated augmented GEE algorithms, we combined the function logistf in the R package ‘logistf’ version 1.24.1 [14], which implements FL, and the function geem2 in the R package ‘mmmgee’ version 1.20 [15], which implements weighted GEE. One should be aware that there are different ways of implementing weighted GEE. In the function geem2 the weights resemble a scale factor for each observation, similarly as implemented in PROC GEE in SAS^{©} [16]. For the ‘outer loop’ of the iterated augmented GEE approach we used the same convergence criterion as implemented in the R package ‘geefirthr’, i.e. we declared the algorithm as converged if \(\underset{\mathrm{m}=0,\dots \mathrm{p}}{\max}\mid {\hat{\beta}}_m^{k+1}{\hat{\beta}}_m^k\mid <0.001\) within k ≤ 20 iterations. We also used this convergence criterion for all GEE fits performed within the two augmented GEE approaches, allowing for 30 iterations. To make the singlestep augmented GEE algorithm faster and to facilitate its convergence, we used the FL coefficient estimate obtained in the first step of the algorithm as starting value for the GEE fit performed in the third step. Similarly, with iterated augmented GEE we used the coefficient estimate from the previous iteration as starting value for the GEE fit in the current iteration. R code for the singlestep and iterated augmented GEE can be found at https://github.com/AngelikaGeroldinger/augGEE.