Separation and Firth’s logistic regression
Logistic regression models the probability that Yi = 1 for independent binary outcomes Yi, i = 1, …N, with values yi ∈ {0, 1}, given (p + 1)-dimensional row vectors of covariate values xi = (1, xi1, …, xip) by assuming
$$P\left({Y}_i=1|{x}_i\right)={\left(1+\exp \left(-{x}_i\beta \right)\right)}^{-1}={\pi}_i\left(\beta \right),$$
where β = (β0, β1, …, βp)′ is a vector of regression coefficients. Maximum likelihood estimates for β can be obtained by solving the score equations. Albert and Anderson found that finite maximum likelihood estimates exist if and only if the data are not separable, i.e. if there exists no hyperplane defined by a linear combination of covariates that separates events and non-events [1]. There are two main drivers for the occurrence of separation: the presence of strong effects and a low amount of information available in the data, manifested, e.g., by a small sample size or rare occurrence of one of the two levels of the outcome variable or of a binary covariate. FL, which originally was proposed to reduce the bias of maximum likelihood estimates [2], has been propagated as a reliable alternative to maximum likelihood estimation [5] as it always provides finite coefficient estimates. With FL, the likelihood is penalized by the Jeffreys invariant prior, i.e. by the square root of the determinant of the Fisher information matrix |I(β)|1/2. On the level of score eqs. FL adds 1/2 trace(I(β)−1 (∂I(β)/∂β)) to the first derivative of the log-likelihood. By proving that the logarithm of the Jeffreys prior \(\frac{1}{2}\ \log \left(I\left(\beta \right)\right)\) tends to −∞ if one of the components of β approaches ∞, Kosmidis and Firth showed that the penalized log likelihood is always maximized by finite β, i.e. FL always provides finite coefficient estimates [6]. Another convenient property of FL, which is for instance not shared by other proposals for penalized likelihood logistic regression, is that FL is transformation invariant. This means that if G is an invertible (p + 1) × (p + 1)-matrix, \({\hat{\beta}}_X\) the FL coefficient estimate for the design matrix X with rows xi, i = 1, …, N, and \({\hat{\beta}}_{XG}\) the FL coefficient estimate for the design matrix X · G, then we have \({\hat{\beta}}_{XG}={G}^{-1}\cdot {\hat{\beta}}_X\).
FL is equivalent to maximum likelihood estimation with an appropriately augmented and weighted data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\) containing 3N observations, basically obtained by stacking three copies of the original data set [7]. For j ∈ {1, …, 3N} we denote by ij the integer in {1, …, N} such that j ∈ {ij, N + ij, 2N + ij}. The covariate row vectors \(\tilde{x}_j,\) outcomes \(\tilde{y}_j\) and the weights \(\tilde{w}_j\), j ∈ {1, …, 3N}, of the augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\) are defined by \(\tilde{x}_j:= {x}_{i_j},\)
$$\tilde{y}_j:= \left\{\begin{array}{c}{y}_{i_j},\kern0.5em \mathrm{if}\ j={i}_j\ \mathrm{or}\ j=N+{i}_j,\\ {}1-{y}_{i_j},\kern0.5em \mathrm{if}\ j=2N+{i}_j\end{array}\right.$$
and
$$\tilde{w}_j:= \left\{\begin{array}{c}1,\kern0.5em \mathrm{if}\ j={i}_j,\\ {}{h}_{i_j}/2,\kern0.5em \mathrm{if}\ j=N+{i}_j\ \mathrm{or}\ j=2N+{i}_j,\end{array}\right.$$
where hi denotes the i-th diagonal element of the hat matrix \(\overset{\sim }{H}={W}^{1/2}X\ {\left({X}^{\prime } WX\right)}^{-1}{X}^{\prime }{W}^{1/2}\), with X the N × (p + 1)- design matrix and W the diagonal matrix with elements πi(1 − πi).
Since the contribution of the data augmentation is asymptotically negligible, approximate standard errors of the estimates can be obtained as the square roots of the diagonal elements of (X′WX)−1, where the elements \({\hat{\pi}}_i\left(1-{\hat{\pi}}_i\right)\) in the N × N matrix W are obtained using the coefficient estimates from the final iteration.
Generalized estimating equations
We will now slightly adapt our notation to the situation of clustered data. Let N be the number of clusters and let ni be the number of observations in the i-th cluster. We denote by \({y}_i={\left({y}_{i1},\dots, {y}_{in_i}\right)}^{\prime }\) the ni-dimensional vector of binary outcomes in the i-th cluster, by xij = (1, xij1, …, xijp)′ the (p + 1)-dimensional vector of covariates for the j-th observation in the i-th cluster and by Xi the (ni × (p + 1))-dimensional matrix with rows xij, i = 1, …, N and j = 1, …, ni. The marginal logistic regression model relates the conditional expectation P(Yij = 1| xij) to the vector of covariates by assuming P(Yij = 1| xij) = (1 + exp(xijβ))−1 for a (p + 1)-dimensional vector of parameters β = (β0, β1, …, βp)′ [8]. Using a working correlation structure expressed as a (ni × ni)-dimensional matrix Ri(α) with some parameter vector α, Liang and Zeger [9] showed that the parameters β can be consistently estimated by solving the generalized estimating equations
$$\sum_{i=1}^N{X_i}^{\prime }{W}_i\ {\left({W}_i^{1/2}{R}_i\left(\alpha \right){W}_i^{1/2}\right)}^{-1}\left({y}_i-{\pi}_i\right)=0,\kern0.5em (1)$$
where πi is the ni-dimensional vector with components πij = (1 + exp(xijβ))−1 and Wi is the (ni × ni)-dimensional diagonal matrix with elements πij(1 − πij). Assuming an independent working correlation, i.e. setting Ri(α) to the identity matrix, the estimating eqs. (1) reduce to the score equations for logistic regression with independent observations.
Transferring Firth’s likelihood penalization to generalized estimating equations
The considerations on FL above imply three possible extensions of FL to GEE.
Penalized GEE (penGEE)
First, considering GEE as score equations leads to the following penalized GEE
$${\displaystyle \begin{array}{c}{U}^{\ast}\left(\beta, \alpha \right)=U\left(\beta, \alpha \right)+\frac{1}{2}\mathrm{trace}\ \left(I{\left(\beta, \alpha \right)}^{-1}\left(\frac{\partial I\left(\beta, \alpha \right)}{\partial \beta}\right)\right)=0, (2)\end{array}}$$
where U(β, α) is the generalized estimating function given by the left hand side of eq. (1) and I(β, α)= \(-E\left(\frac{\partial U\left(\beta, \alpha \right)}{\partial \beta}\right)\)[3, 4]. As with ordinary GEE, a solution to eq. (2) can be obtained by iterating between solving for β by the first step of a Newton-Raphson fitting algorithm given a value α, and updating α using the method of moments-based estimators. This motivates the following ‘penalized GEE’ algorithm:
-
1.
As initial estimate \({\hat{\beta}}^0\) for β use the coefficient estimate obtained by applying FL to the data ignoring the clustering structure.
-
2.
Given the estimate \({\hat{\beta}}^k\) for β, k ∈ ℕ, calculate the moment-based estimate \({\hat{\alpha}}^{k+1}\) for α. For example, in the setting of an exchangeable working correlation structure set
$${\hat{\alpha}}^{k+1}=\frac{1}{N}\sum_{i=1}^N\frac{1}{n_i\left({n}_i-1\right)}\sum_{j\ne k}^{n_i}{\hat{e}}_{ij}{\hat{e}}_{ik},$$
where \({\hat{e}}_{ij}=\left({y}_{ij}-{\hat{\pi}}_{ij}\right)/\sqrt{\left({\hat{\pi}}_{ij}\left(1-{\hat{\pi}}_{ij}\right)\right)}\). See Molenberghs and Verbeke [10] (p.157) for the formulas for independent, AR(1) and unstructured working correlation structure.
-
3.
Update the coefficient estimate by
$${\hat{\beta}}^{k+1}={\hat{\beta}}^k+I{\left({\hat{\beta}}^k,{\hat{\alpha}}^k\right)}^{-1}\cdot {U}^{\ast}\left({\hat{\beta}}^k,{\hat{\alpha}}^k\right).$$
-
4.
Repeat steps 2 and 3 until convergence is achieved.
Note that penalized GEE with independent working correlation structure results in the same coefficient estimates as FL.
Iterated augmented GEE (augGEE)
Second, we suggest extending FL to GEE by mimicking the data augmentation approach. The idea is to iterate between augmenting the data given the current estimates of β and α, and re-solving the GEE using the augmented data to obtain new estimates. This gives rise to the following ‘iterated augmented GEE’ algorithm:
-
1.
As initial estimate \({\hat{\beta}}^0\) for β use the coefficient estimate obtained by applying FL to the data ignoring the clustering structure. As initial estimate \({\hat{\alpha}}^0\) for α use the value corresponding to a working correlation structure \(R\left({\hat{\alpha}}^0\right)\) equal to the identity matrix.
-
2.
Given the current estimate \({\hat{\beta}}^k\) of β and \({\hat{\alpha}}^k\) of α for k ∈ ℕ, calculate the block diagonal matrix \({H}^k=\operatorname{diag}\left({H}_1^k,\dots, {H}_N^k\right)\) with blocks \({H}_i^k={\Omega}_{\mathrm{i}}^{1/2}{X}_i{\left({X}_i^{\prime}\Omega {X}_i\right)}^{-1}{X}_i^{\prime }{\Omega}_{\mathrm{i}}^{1/2}\), where \({\Omega}_{\mathrm{i}}={W}_i^{1/2}{R}_i{\left({\hat{\alpha}}^k\right)}^{-1}{W}_i^{1/2}\)[11]. This block diagonal matrix generalizes the hat matrix \(\overset{\sim }{H}\) defined for independent data in Section 2.1.
-
3.
Similarly as in Section 2.1 create an augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\), consisting of \(\overset{\sim }{N}=3N\) clusters with \(\tilde{n}_l={n}_{i_l}\) observations in the l-th cluster, where il is the integer in {1, …, N } such that l ∈ {il, N + il, 2N + il}. The vector of covariates \(\tilde{x}_{lj},l=1,\dots, \overset{\sim }{N},j=1,\dots, \tilde{n}_l,\) for the j-th observation in the l-th cluster in the augmented data is set to \({x}_{i_lj}\). The outcome \(\tilde{y}_{lj}\) and the weights \(\tilde{w}_{lj}\) for this augmented data set are given by
$$\tilde{y}_{lj}=\left\{\begin{array}{c}{y}_{i_lj},\kern0.5em \mathrm{if}\ l={i}_l\ \mathrm{or}\ l=N+{i}_l,\\ {}1-{y}_{i_lj},\kern0.5em \mathrm{if}\ l=2N+{i}_l\end{array}\right.$$
and
$$\tilde{w}_{lj}=\left\{\begin{array}{c}1,\kern0.5em \mathrm{if}\ l={i}_l,\\ {}{h}_{i_lj}/2,\kern0.5em \mathrm{if}\ l=N+{i}_l\ \mathrm{or}\ l=2N+{i}_l,\end{array}\right.$$
where hij denotes the j-th diagonal element of the i-th block \({H}_i^k\) of Hk.
-
4.
Solve the GEE on the augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\). Set \({\hat{\beta}}^{k+1}\) to the coefficient estimate and \({\hat{\alpha}}^{k+1}\) to the estimated parameter for the working correlation structure.
-
5.
Repeat steps 2, 3 and 4 until convergence is achieved. The final coefficient estimate \(\hat{\beta}\) and correlation parameter estimate \(\hat{\alpha}\) are the estimates from the last iteration.
In step 3 of the iterated augmented GEE algorithm one might think of alternative ways of defining the clustering structure for the augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\ \right)\). For instance, one could combine the i-th, (N + i)-th and (2N + i)-th clusters, i ∈ {1, …, N}, resulting in an augmented data set consisting of only N instead of 3N clusters. However, creating two ‘pseudo clusters’ for each original cluster as suggested in the iterated augmented GEE algorithm best reflects the data augmentation approach in the situation of independent data, where the pseudo observations are also treated as independent observations representing the Jeffreys prior.
As the trace of the generalized hat matrix H calculated in step 2 of the iterated augmented GEE algorithm is always equal to p + 1 [11], i.e. the total weight of the ‘pseudo observations’ is p + 1, we conclude that the relative contribution of the pseudo observations becomes negligible with increasing sample size and that the algorithm yields consistent estimates.
Single-step augmented GEE (augGEE1)
Third, we investigated a simpler version of the augmented GEE algorithm which can be easily implemented whenever fitting algorithms for FL and weighted GEE are available. This ‘single-step augmented GEE’ algorithm consists of the following steps:
-
1.
Apply FL to the data ignoring the clustering and construct the corresponding augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\) as described in section 2.1, i.e. using the ordinary hat matrix \(\overset{\sim }{H}\) in defining the weights, not the hat matrix Hk as in the iterated augmented GEE algorithm.
2. Define the clustering on \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\) by treating each of the three copies of the original data as independent data sets, similarly as in the iterated augmented GEE algorithm. In this way, the augmented data \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\) contains 3N (clusters)
3. Solve the GEE on the augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\), this gives the final regression coefficient estimate\(\hat{\beta}\) and correlation parameter estimate \(\hat{\alpha}\).
In other words, single-step augmented GEE can be performed by stopping the iterated augmented GEE algorithm after the first outer iteration when using the ordinary hat matrix \(\overset{\sim }{H}\) instead of H0. Note that applying iterated or single-step augmented GEE with an independent working correlation structure gives the same coefficient estimates as FL. The consistency of the single-step augmented GEE can be shown analogously to the consistency of the iterated algorithm.
Estimating the variance-covariance matrix
With GEE a consistent estimate of the asymptotic variance-covariance matrix of the regression coefficients is given by the so-called sandwich estimate
$$S\left(\hat{\beta}\right)={I}_0{\left(\hat{\beta}\right)}^{-1}\ {I}_1\left(\hat{\beta}\right){I}_0{\left(\hat{\beta}\right)}^{-1},$$
where
$${I}_0\left(\hat{\beta}\right)=\sum_{i=1}^N{X}_i^{\prime }{\hat{W}}_i\ {\hat{V}}_i^{-1}{\hat{W}}_i{X}_i,\kern7.75em {I}_1\left(\hat{\beta}\right)=\sum_{i=1}^N{d}_i{d}_i^{\prime },$$
\({d}_i={X}_i^{\prime }{\hat{W}}_i{\hat{V}}_i^{-1}\left({y}_i-{\hat{\uppi}}_i\right)\) and \({V}_i={W}_i^{1/2}{R}_i\left(\alpha \right){W}_i^{1/2}\) is the working variance-covariance matrix [10]. With small samples, the sandwich estimator is known to underestimate the variance [12]. We will use the small-sample correction proposed by Morel et al. [13] to correct for this underestimation. The corrected variance-covariance matrix \({S}^{\ast}\left(\hat{\beta}\right)\) is defined as
$${S}^{\ast}\left(\hat{\beta}\right)={I}_0{\left(\hat{\beta}\right)}^{-1}\ {I}_1^{\ast}\left(\hat{\beta}\right){I}_0{\left(\hat{\beta}\right)}^{-1}+\delta \cdot \phi \cdot {I}_0{\left(\hat{\beta}\right)}^{-1}$$
with
$${I}_1^{\ast}\left(\hat{\beta}\right)=\frac{N^{\ast }-1}{N^{\ast }-p-1}\cdot \frac{N}{N-1}\cdot \sum_{i=1}^N\left({d}_i-\overline{d}\right){\left({d}_i-\overline{d}\right)}^{\prime },$$
$$\overline{d}=\frac{1}{N}\sum_{i=1}^N{d}_i,$$
N* = ∑ini the number of observations,\(\delta =\min \left(0.5,\frac{p+1}{N-p-1}\right)\) and \(\phi =\max \left(1,\frac{\mathrm{trace}\left(\alpha \cdot {I}_0{\left(\hat{\beta}\right)}^{-1}{I}_1^{\ast}\left(\hat{\beta}\right)\right)}{p+1}\right)\). We recommend to calculate the variance-covariance matrices for the two augmented GEE approaches by applying the sandwich estimator with small-sample correction to the original data (y, X) using the parameter estimates \(\hat{\beta}\) and \(\hat{\alpha}\). Note that this is different from the sandwich estimates of the variance-covariance matrix which come with the last GEE fits in the two augmented GEE algorithms, as there the sandwich estimator is applied to the augmented data set \(\left(\overset{\sim }{y},\overset{\sim }{X}\right)\).
Any 95% confidence intervals reported in the following for ordinary GEE, penalized GEE, single-step and iterated augmented GEE were calculated by multiplying the standard error derived from the small-sample corrected sandwich estimates of the variance covariance matrix with the 0.025-th and 0.975-th quantile of the t-distribution with the number of clusters as degree of freedom.
Implementation
For penalized GEE, we modified the R package ‘geefirthr’ available on GitHub at https://github.com/mhmondol/geefirthr. The version of ‘geefirthr’ available on GitHub cannot handle clusters consisting of single observations and additionally estimates a scale parameter. Our modified version of ‘geefirthr’, which can also deal with clusters containing one observation and with the scale parameter set to 1, can be found at https://github.com/heogden/geefirthr. To implement the single-step and the iterated augmented GEE algorithms, we combined the function logistf in the R package ‘logistf’ version 1.24.1 [14], which implements FL, and the function geem2 in the R package ‘mmmgee’ version 1.20 [15], which implements weighted GEE. One should be aware that there are different ways of implementing weighted GEE. In the function geem2 the weights resemble a scale factor for each observation, similarly as implemented in PROC GEE in SAS© [16]. For the ‘outer loop’ of the iterated augmented GEE approach we used the same convergence criterion as implemented in the R package ‘geefirthr’, i.e. we declared the algorithm as converged if \(\underset{\mathrm{m}=0,\dots \mathrm{p}}{\max}\mid {\hat{\beta}}_m^{k+1}-{\hat{\beta}}_m^k\mid <0.001\) within k ≤ 20 iterations. We also used this convergence criterion for all GEE fits performed within the two augmented GEE approaches, allowing for 30 iterations. To make the single-step augmented GEE algorithm faster and to facilitate its convergence, we used the FL coefficient estimate obtained in the first step of the algorithm as starting value for the GEE fit performed in the third step. Similarly, with iterated augmented GEE we used the coefficient estimate from the previous iteration as starting value for the GEE fit in the current iteration. R code for the single-step and iterated augmented GEE can be found at https://github.com/AngelikaGeroldinger/augGEE.