 Research article
 Open Access
 Published:
The alarming problems of confounding equivalence using logistic regression models in the perspective of causal diagrams
BMC Medical Research Methodology volume 17, Article number: 177 (2017)
Abstract
Background
Confounders can produce spurious associations between exposure and outcome in observational studies. For majority of epidemiologists, adjusting for confounders using logistic regression model is their habitual method, though it has some problems in accuracy and precision. It is, therefore, important to highlight the problems of logistic regression and search the alternative method.
Methods
Four causal diagram models were defined to summarize confounding equivalence. Both theoretical proofs and simulation studies were performed to verify whether conditioning on different confounding equivalence sets had the same biasreducing potential and then to select the optimum adjusting strategy, in which logistic regression model and inverse probability weighting based marginal structural model (IPWbasedMSM) were compared. The “docalculus” was used to calculate the true causal effect of exposure on outcome, then the bias and standard error were used to evaluate the performances of different strategies.
Results
Adjusting for different sets of confounding equivalence, as judged by identical Markov boundaries, produced different biasreducing potential in the logistic regression model. For the sets satisfied Gadmissibility, adjusting for the set including all the confounders reduced the equivalent bias to the one containing the parent nodes of the outcome, while the bias after adjusting for the parent nodes of exposure was not equivalent to them. In addition, all causal effect estimations through logistic regression were biased, although the estimation after adjusting for the parent nodes of exposure was nearest to the true causal effect. However, conditioning on different confounding equivalence sets had the same biasreducing potential under IPWbasedMSM. Compared with logistic regression, the IPWbasedMSM could obtain unbiased causal effect estimation when the adjusted confounders satisfied Gadmissibility and the optimal strategy was to adjust for the parent nodes of outcome, which obtained the highest precision.
Conclusions
All adjustment strategies through logistic regression were biased for causal effect estimation, while IPWbasedMSM could always obtain unbiased estimation when the adjusted set satisfied Gadmissibility. Thus, IPWbasedMSM was recommended to adjust for confounders set.
Background
Causal inference is a key task in epidemiology which discovers the causality between exposure and outcome. Theoretically, causality is the difference in outcome caused by a change in exposure, which can be gotten by ‘docalculus’ in observational studies [1]. In practice, however, as exposure is impossible to intervene in analytic epidemiology, confounders inevitably distort the causal effect of exposure on outcome [2,3,4,5]. For majority of epidemiologists, adjusting for confounders using logistic regression model for dichotomous outcomes is the routine method [6,7,8,9,10]. Although some studies have verified that different adjustment strategies in logistic regression models might lead to different magnitudes of bias (the difference of the estimation minus the true causal effect) and precision [8, 11], it is still the most commonly used strategy in analytic epidemiologic studies. This phenomenon is mainly attributed to their vague knowledge about the behaviour of logistic regression model. For causal inference in observational study, the inverse probability weighting based marginal structural model (IPWbasedMSM) has been confirmed as an unbiased causal effect estimation approach to adjust for measured confounders [12,13,14]. Unfortunately, the advantages of IPWbasedMSM are not recognized by most epidemiologists. Furthermore, for both logistic regression and IPWbasedMSM, the selection of adjustment variables sets remains a big challenge. Fortunately, the concept of confounding equivalence (cequivalence) proposed by Judea Pearl might help us to select adjusting strategies [15].
The cequivalence is presented to determine whether two variables sets are equally valuable for adjustment, namely, whether adjustment for one set is guaranteed to have the same asymptotic bias as adjustment for the others [15]. Tests for cequivalence are fairly easy to perform through a necessary and sufficient condition [15, 16], and they can also be implemented by propensity score methods [17]. This provides us a strategy for selecting adjusting variables sets when using logistic regression models and IPWbasedMSMs, which help to clarify whether adjusting for different cequivalent sets has same biasreducing potential.
In this paper, we focused on 4 typical causal diagrams (Fig. 1), which summarized the generalization of cequivalence to detect the performances of logistic regression models and IPWbasedMSMs under the framework of cequivalence. Both theoretical proofs and simulation studies were performed to determine whether adjusting for the sets of cequivalence had the same biasreducing potential and observed their precision in logistic regression models and IPWbasedMSMs respectively, and further comparing the performances of cequivalence between these two models through assessing their accuracy (bias) and precision (standard error). Our aim was to highlight the problems of cequivalence using logistic regression model as well as the advantages of IPWbasedMSM.
Methods
Cequivalence and its test
Let X, Y and Z be three disjoint subsets of discrete variables, and P(x, y, z) are their joint distribution. The causal effect of X on Y can be defined as \( P\left(y do(x)\right)=\sum \limits_zP\left(yx,z\right)p(z) \) [5, 18, 19], where a sufficient set Z is chosen to include variables judged as “confounders” [16, 20, 21]. In this framework, the two confounders sets T and Z are cequivalent if \( \sum \limits_tP\left(yx,t\right)P(t)=\sum \limits_zP\left(yx,z\right)P(z) \) ∀x, y. This means that adjustment for set T or Z would produce the same asymptotic bias relative to the target causal effect quantity [15]. To meet the necessary and sufficient condition of cequivalence, it is first necessary to define the Gadmissibility of a variables set S, which satisfies the backdoor criterion [19]: 1) No element of S is a descendant of X; 2) The elements of S block every path between X and Y that contains an arrow into X. Another condition of cequivalence is the identical Markov boundary [15], which is defined as: let S _{m} be the minimal subset of S that satisfies the condition (X ⊥ S S _{ m })_{ G }. This means that measurement of S _{m} renders X independent of all other members of S, and no proper subset of S _{m} has this property. Therefore, the necessary and sufficient conditions for T and Z to be cequivalent are that at least one of the following conditions hold: 1) T _{m} = Z _{m}; and 2) T and Z are Gadmissible [15].
As an example, Fig. 1 illustrates the four typical causal diagrams with simple and complex confounding paths for the target causal path X→Y [22]. For instance, Fig. 1d contains three confounding paths X←W→Y, X←Z→W→Y and X←W←T→Y, with three corresponding confounders Z, W and T [22, 23]. Theoretically, conditioning on {Z, W}, {T, W} or {Z, T, W} can achieve the same biasreducing potential [23]. Thus, they are cequivalent.
Statistical methods for adjusting for confounders
Expect for the wellknown logistic regression model which is the habitual method for most of epidemiologists, IPWbasedMSM is an alternative approach that can obtain the unbiased causal effect estimation [24, 25]. In IPWbasedMSM, the unbiased causal effect is estimated by inverse probability weighted which can correct for confounding bias [26]. In this paper, the following stabilized weights, which has been recommended to increase the statistical efficiency and to achieve better coverage of the confidence intervals, were used [13, 27],
where Z is a set of variables which are considered to be confounders. By weighting the original observations using the stabilized weights (sw _{ i }), we can fit the following marginal structural model to estimate the causal effect of X on Y,
where the causal effect estimation of X on Y is \( {\beta}_1^{MSM} \).
Theoretical derivation for biasreducing potential of cequivalence under logistic regression model
Taking Fig. 1a as an example, we deduced whether adjusting for different cequivalence sets had the same biasreducing potential under logistic regression by the following procedure.
1) Based on the necessary and sufficient condition, A _{1} = {Z}, A _{2} = {T} and A _{3} = {Z, T} satisfied Gadmissibility, thus they were equivalent, as denoted by A _{1} ≈ A _{2} ≈ A _{3}.
2) Calculated the true causal effect (ACE ^{log(OR)}) of X on Y through the average causal effect (ACE) on the scale of the logarithm odds ratio (OR),
3) Calculated the effect (\( {\beta}_X^{set} \)) of X on Y by logistic regression,
4) Calculated the biases\( {\beta}_X^{A_1}{ACE}^{\log (OR)} \), \( {\beta}_X^{A_2}{ACE}^{\log (OR)} \) and \( {\beta}_X^{A_3}{ACE}^{\log (OR)} \), and then deduced whether \( {\beta}_X^{A_1}{ACE}^{\log (OR)}={\beta}_X^{A_2}{ACE}^{\log (OR)}={\beta}_X^{A_3}{ACE}^{\log (OR)} \).
Simulation
Taking the four typical causal diagrams, which covered the generalization of cequivalence (Fig. 1), as examples, a series of simulation studies were performed to determine whether adjusting for the sets of cequivalence had the same biasreducing potential and observed their precision in logistic regression models and IPWbasedMSMs respectively, further compared the performances of cequivalence between these two models though assessing their accuracy and precision.
Four simulation scenarios were considered, and assumed that: 1) all variables were binary and followed a Bernoulli distributions; and 2) the effects of parent nodes on their child nodes were positive and loglinearly additive. Logistic regression models were used to simulate child nodes from their corresponding parent nodes.
For scenario 1 (Fig. 1a), the simulated data were generated as follows. LetP(Z = 1) = π. Then, P(T = 1 Z) = exp(c _{0} Z + α _{1})/(1 + exp(c _{0} Z + α _{1})) was used to derive the probability of child node T from its parent node Z. Similarly, P(X = 1 Z) = exp(c _{1} Z + α _{2})/(1 + exp(c _{1} Z + α _{2})) and P(Y = 1 X, T) = exp(c _{3} X + c _{2} T + α _{0})/(1 + exp(c _{3} X + c _{2} T + α _{0})) were used to obtain the probability of X = 1 and Y = 1, respectively, where the parameters α _{0}, α _{1}, α _{2} denoted the intercepts of Y, T and X, respectively, and each effect parameter (c _{0}, c _{1}, c _{2}, c _{3}) referred to the effect of the parent node on its corresponding child node. Simulated data was generated for 1000 subjects by above procedure.
In this scenario (Fig. 1a), variable sets A _{1} = {Z}, A _{2} = {T} and A _{3} = {Z, T} satisfied the necessary and sufficient conditions of cequivalence; thus, A _{1} ≈ A _{2} ≈ A _{3}. Therefore, we compared three adjustment strategies with the following six models,
model 1: \( \mathrm{logit}\left(p\left(Y=1X,{A}_1\right)\right)={{\widehat{\beta}}^{A_1}}_0+{\widehat{\beta}}_X^{A_1}X+{{\widehat{\beta}}^{A_1}}_Z\mathrm{Z} \).
model 2: \( \mathrm{logit}\left(p\left(Y=1X,{A}_2\right)\right)={{\widehat{\beta}}^{A_2}}_0+{\widehat{\beta}}_X^{A_2}X+{{\widehat{\beta}}^{A_2}}_TT \).
model 3: \( \mathrm{logit}\left(p\left(Y=1X,{A}_3\right)\right)={{\widehat{\beta}}^{A_3}}_0+{\widehat{\beta}}_X^{A_3}X+{{\widehat{\beta}}^{A_3}}_TT+{{\widehat{\beta}}^{A_3}}_ZZ \).
model 4: \( \mathrm{logit}\kern0.1em P\left({Y}_x^{A_1}=1\right)={\widehat{\beta}}_0^{MSM\_{A}_1}+{\widehat{\beta}}_x^{MSM\_{A}_1}x \) \( {sw}_i^{A_1}=\frac{P\left(X={x}_i\right)}{P\left(X={x}_i{A}_{1i}={A}_{1i}\right)} \).
model 5: \( \mathrm{logit}\kern0.1em P\left({Y}_x^{A_2}=1\right)={\widehat{\beta}}_0^{MSM\_{A}_2}+{\widehat{\beta}}_x^{MSM\_{A}_2}x \) \( {sw}_i^{A_2}=\frac{P\left(X={x}_i\right)}{P\left(X={x}_i{A}_{2i}={A}_{2i}\right)} \).
model 6: \( \mathrm{logit}\kern0.1em P\left({Y}_x^{A_3}=1\right)={\widehat{\beta}}_0^{MSM\_{A}_3}+{\widehat{\beta}}_x^{MSM\_{A}_3}x \) \( {sw}_i^{A_3}=\frac{P\left(X={x}_i\right)}{P\left(X={x}_i{A}_{3i}={A}_{3i}\right)} \).
where\( {\widehat{\beta}}_X^{A_1} \), \( {\widehat{\beta}}_X^{A_2} \), \( {\widehat{\beta}}_X^{A_3} \), \( {\widehat{\beta}}_X^{MSM\_{A}_1} \), \( {\widehat{\beta}}_X^{MSM\_{A}_2} \) and \( {\widehat{\beta}}_X^{MSM\_{A}_3} \) denoted the causal effect estimations after conditioning on A _{1} , A _{2} and A _{3} by logistic regression and IPWbasedMSM, respectively. Given the true causal effect \( A\widehat{C}{E}^{\log (OR)} \) calculated by docalculus, both the biases (\( {\widehat{\beta}}_X^{A_1}A\widehat{C}{E}^{\log (OR)} \),\( {\widehat{\beta}}_X^{A_2}A\widehat{C}{E}^{\log (OR)} \), \( {\widehat{\beta}}_X^{A_3}A\widehat{C}{E}^{\log (OR)} \), \( {\widehat{\beta}}_x^{MSM\_{A}_1}A\widehat{C}{E}^{\log (OR)} \), \( {\widehat{\beta}}_x^{MSM\_{A}_2}A\widehat{C}{E}^{\log (OR)} \), \( {\widehat{\beta}}_x^{MSM\_{A}_3}A\widehat{C}{E}^{\log (OR)} \)) and their corresponding standard errors (\( \mathrm{SE}\left({\widehat{\beta}}_X^{A_1}\right) \), \( \mathrm{SE}\left({\widehat{\beta}}_X^{A_2}\right) \), \( \mathrm{SE}\left({\widehat{\beta}}_X^{A_3}\right) \), \( \mathrm{SE}\left({\widehat{\beta}}_X^{MSM\_{A}_1}\right) \), \( \mathrm{SE}\left({\widehat{\beta}}_X^{MSM\_{A}_2}\right) \) and \( \mathrm{SE}\left({\widehat{\beta}}_X^{MSM\_{A}_3}\right) \)) were used to identify whether adjusting for different cequivalence sets A _{1}, A _{2} or A _{3} still produced the same biasreducing under the logistic regression model and IPWbasedMSM, further to evaluate their accuracy and precision.
For scenario 2 (Fig. 1b), similar simulation data sets were created as scenario 1. In this scenario, A _{1} = {Z, W}, A _{2} = {T, W} and A _{3} = {Z, T, W} satisfied Gadmissibility; thus, A _{1} ≈ A _{2} ≈ A _{3}. Therefore, three corresponding logistic regression models and three corresponding IPWbasedMSMs conditional on A _{1} , A _{2} or A _{3} were constructed to identify whether the cequivalence has identical biases and to evaluate their precisions. In addition, B _{1} = {Z} was cequivalent to B _{2} = {Z, T}, namely, B _{1} ≈ B _{2}, due to their identical Markov boundary, written as B _{1m} = B _{2m} = {Z}. Therefore, four corresponding models conditioning on B _{1} or B _{2} were used to calculate the biases and standard errors.
In scenario 3 (Fig. 1c), the simulated data was generated in the same way as in scenario 1. In addition, the sets A _{1} = {Z} ≈ A _{ 2 } = {Z, T} and B _{1} = {W} ≈ B _{2} = {W,V} were separately cequivalent due to A _{1m} = A _{2m} = {Z} and B _{1m} = B _{2m} = {W}. As A _{1} ≈ A _{ 2 } and B _{1} ≈ B _{2} were identical in the cequivalence mechanism, it was sufficient to analyze one group to explore the cequivalence mechanism of the identical Markov boundary. Thus, we constructed two logistic regression models and two IPWbasedMSMs conditioning on A _{1} or A _{2} to explore their cequivalence and to evaluate their precision. Furthermore, as variables sets C _{1} = {Z,W}, C _{2} = {T,V} and C _{3} = {Z,W,T,V} blocked all backdoor paths from X to Y, they were admissible and equivalent, C _{1} ≈ C _{2} ≈ C _{3}. Therefore, the six corresponding models conditional on C _{1}, C _{2} or C _{3} were performed to identify biases and precisions.
For scenario 4 (Fig. 1d), following the path directions, simulation data sets were created same with scenario 1. A _{1} = {Z, W}, A _{2} = {T, W} and A _{3} = {Z, T, W} satisfied Gadmissibility; thus, A _{1} ≈ A _{2} ≈ A _{3}. Their corresponding three logistic regression models and three IPWbasedMSMs conditional on A _{1} , A _{2} or A _{3} were used to observe the biases and precisions.
For each of the 4 simulation scenarios, we varied across the effect of a specific edge given the others fixed with 1000 simulation repetitions. The R (http://cran.rproject.org/) programming language was used to conduct the statistical simulations.
Results
Theoretical results for biasreducing potential of cequivalence under logistic regression model
Considered scenario 1 (Fig. 1a) as a typical diagram for deducing whether adjusting for different cequivalence sets resulted in the same bias reduction under the logistic regression models. In this causal diagram, A _{1} = {Z}, A _{2} = {T} and A _{3} = {Z, T} composed the cequivalence group, which satisfied the Gadmissibility .
For A _{1} ≈ A _{2} ≈ A _{3} of cequivalence, the true causal effect of X on Y was calculated as
By conditioning on A _{1} = {Z}, the effect of X on Y was equal to
Similarly, the effect of X on Y when conditioning on A _{2} = {T} was equal to
Additionally, the effect of X on Y when conditioning on A _{3} = {T, Z} was equal to
After a series of derivations (Additional file 1: Appendix), we obtained \( {\beta}_X^{A_2}={\beta}_X^{A_3} \) under any condition, suggesting that the biasreducing after adjusting for cequivalence sets A _{ 2 } ≈ A _{3} was equivalent under the logistic regression model. \( {\beta}_X^{A_1}={\beta}_X^{A_2}={\beta}_X^{A_3} \) only if c _{2} = 0 or c _{3} = 0, indicating that the biasreducing after adjusting for cequivalence sets A _{1} ≈ A _{ 2 } ≈ A _{3}, respectively, was equivalent in this situation. However, \( {\beta}_X^{A_1}<{\beta}_X^{A_2}={\beta}_X^{A_3} \) if c _{2} ≠ 0 and c _{3} > 0, and \( {\beta}_X^{A_1}>{\beta}_X^{A_2}={\beta}_X^{A_3} \) if c _{2} ≠ 0 and c _{3} < 0,which indicating an unequal biasreducing after adjusting for cequivalence sets A _{1} ≈ A _{ 2 } ≈ A _{3} when both c _{2} and c _{3} were not equal to zero (for more details, see Appendix).
Simulation results
Scenario 1
For Fig. 1a, various simulation strategies were performed. From the panel a and panel b of Fig. 2 and Additional file 2: Figure S1, as for the logistic regression models, we observed that adjusting for the cequivalent set A _{2} or A _{3} has resulted in approximate biases, but adjusting for set A _{1} was not equal to them. Moreover, the strategy of adjusting for A _{1} achieved the minimum bias. When adjusting for confounders by IPWbasedMSM, the estimations of all the strategies were approximate and unbiased. Panel c and d of Fig. 2 and Additional file 2: Figure S1 showed that adjusting for A _{2} by IPWbasedMSM achieved the highest precision in all situations. Thus, compared with logistic regression models, the IPWbasedMSM produced an unbiased causal effect estimation and the highest precision in this scenario. The optimal adjustment strategy was conditioning on A _{2}. Although the estimations through logistic regression model were biased, adjusting for A _{1} produced a result nearest to the true causal effect.
When varying across the effect of Z on T with the other parameters fixed, the simulation results indicated that the biases of all six models (models 1–6) tended to be stable (Fig. 2a). Similar performances were observed when varying across the effect of Z on X (Additional file 2: Figure S1a). However, when varying across the effect of T on Y and keeping the other parameters constant, the bias showed a linear increasing trend after adjusting for set A _{2} or A _{3} under the logistic regression model, but was approximately to zero after adjusting for set A _{1}. However, the biases remained stable under IPWbasedMSM (Fig. 2b). We observed similar trends with the effect of X on Y increasing (Additional file 2: Figure S1b).
Scenario 2
In Fig. 1b, for the first cequivalent subsets A _{1} = {Z, W}, A _{2} = {T, W} and A _{3} = {Z, T, W}, we observed that the bias after adjusting for set A _{2} was similar to that of A _{3} but not to that of A _{1}, and the strategy of adjusting for A _{1} achieved the minimum bias under the logistic regression models, as shown in panels a and b of Fig. 3, Additional file 3: Figure S2 and Additional file 4: Figure S3 under logistic regression models. The adjustment of any confounding sets of cequivalent subsets through IPWbasedMSM had the same biasreducing potential and the estimations were unbiased. Panel c and d of these figures showed that adjusting for A _{2} under IPWbasedMSM achieved the highest precision in all situations. Thus, conditioning on any cequivalent set that was satisfied Gadmissibility through IPWbasedMSM produced an unbiased causal effect estimate and adjustment for A _{2} was the best strategy. When using logistic regression models to adjust for confounders, the optimal adjustment strategy was adjusting for variable subset A _{1}.
In the logistic regression models, when keeping the other parameters constant, bias elevated with the effect of T on Y increasing when adjusting for A _{2} or A _{3}, whereas it elevated in the opposite direction when adjusting for A _{1} (Fig. 3a). All three models revealed increased biases with the effects of W on Y increasing (Fig. 3b). Similar performances were observed when varying across the effect X on Y (Additional file 3: Figure S2b). When varying across the effect of Z on T with the other parameters fixed, the simulation results indicated that the biases of all three adjustment strategies tended to be stable (Additional file 3: Figure S2b). We observed similar trends with the increase of the effect of Z on X (Additional file 4: Figure S3a) or the effect of W on X (Additional file 4: Figure S3b). When adjusting for confounders through IPWbasedMSM, the biases of all three adjustment strategies tended to be stable in all situations.
For another cequivalent subsets B _{1} = {Z} and B _{2} = {Z, T}, panels a and b of Fig. 4, Additional file 5: Figure S4 and Additional file 6: Figure S5 showed that adjusting for cequivalence set B _{1} or B _{2} had different biasreducing, and the bias of adjusting for B _{1} was less than that of adjusting for B _{2} under the logistic regression models. For IPWbasedMSM, the biases were equivalent after adjusting for B _{1} or B _{2}. Panels c and d of these figures showed that adjusting for B _{2} through IPWbasedMSM resulted in higher precision.
Keeping the other parameters constant, the bias elevated as the effect of T on Y increasing when adjusting for set B _{2}, whereas it was stable after adjusting for B _{1} under logistic regression. A stable trend also appeared after adjusting for any sets through IPWbasedMSM (Fig. 4a). Similar performances were observed when varying across the effect of X on Y (Additional file 5: Figure S4b). When varying across the effect of W on Y with the other parameters fixed, the simulation results indicated that biases of four models revealed an increasing trend (Fig. 4b). Similar trends of the effect of W on X increasing were observed in Additional file 6: Figure S5b. When varying across the effect of Z on T with the other parameters fixed, the biases of the four models were stable (Additional file 5: Figure S4a). Similar performances were observed when varying across the effect of Z on X (Additional file 6: Figure S5a).
Scenario 3
In Fig. 1c, for the first cequivalent subsets, A _{1} = {Z} and A _{ 2 } = {Z, T}, Fig. 5, Additional file 7: Figure S6 and Additional file 8: Figure S7 showed that adjusting for cequivalence set A _{1} or A _{2} resulted in different biasreducing, and the bias of adjusting for A _{1} was less than that after adjusting for A _{2} under logistic regression models. Then the biases were equal after conditioning on A _{1} and A _{ 2 } via IPWbasedMSM. In consideration of the standard error, adjusting for A _{2} through IPWbasedMSM resulted in higher precision.
For other cequivalent subsets C _{1} = {Z,W}, C _{2} = {T,V} and C _{3} = {Z,W,T,V}, the simulation result (Fig. 6, Additional file 9: Figure S8 and Additional file 10: Figure S9) showed that adjusting for the variable set C _{2} resulted in similar bias to that of set C _{3} but not to C _{1}, and the strategy of adjusting for C _{1} resulted in the minimum bias under the logistic regression models. However, the estimations of all strategies conditioned by IPWbasedMSM were approximately equivalent and unbiased. For the standard error, conditioning on C _{2} by IPWbasedMSM resulted in the minimum standard error in all situations. Thus, IPWbasedMSM was a better method than logistic regression for controlling for confounders. The optimal adjustment strategy was conditioning on C _{2} by IPWbasedMSM in this scenario. Besides, adjusting for A _{1} produced the result that was nearest to the true causal effect under the logistic regression model.
Scenario 4
For Fig. 1d, simulation results (Fig. 7, Additional file 11: Figure S10 and Additional file 12: Figure S11) showed that adjusting for cequivalence set A _{2} or A _{3} had different biasreducing but adjusting for A _{1} was not equal to them and the strategy of adjusting for A _{1} got the minimum bias than others under logistic regression models. Conditioning on any confounding set through MSM had the same biasreducing and produce unbiased estimations. In consideration of the standard error, we observed that adjusting for A _{2} by IPWbasedMSM resulted in higher precision in all situations. Thus, IPWbasedMSM produced unbiased causal effect estimations after conditioning on any cequivalent set, and the strategy of adjusting for A _{2} achieved highest precision in this scenario. When using logistic regression models to adjust for confounders, adjusting for variables subset A _{1} produced the minimum bias.
Discussion
In this paper, we focused on the 4 typical causal diagrams shown in Fig. 1 to assess the performances of logistic regression models and IPWbasedMSMs with respect to cequivalence. The necessary and sufficient conditions for T and Z to be cequivalent proposed by Pearl are that at least one of the following conditions hold [15]: 1) T _{m} = Z _{m}; or 2) T and Z are Gadmissible. Our results revealed that cequivalence sets satisfying the cequivalence condition 1) (e.g., A _{2} (T) and A _{3} (Z, T) in scenario 2) had different biasreducing under logistic regression. For cequivalence condition 2), adjusting for the set including all confounders had approximately biasreducing as adjusting for the set containing the parent nodes of Y, while adjusting for the set containing the parent nodes of X was not equivalent to adjusting for the two above sets. However, under the framework of IPWbasedMSM, conditioning on any set of cequivalence, as judged by the necessary and sufficient conditions, still had same biasreducing. In summary, adjusting for different sets of cequivalence under logistic regression always produced different biasreducing; whereas when using IPWbasedMSM, the estimations of all strategies were approximately equivalent.
Adjusting more confounders would improve accuracy and precision of estimation in classic linear regression [28, 29]. Nevertheless, including more confounders in logistic regression model usually leads to less bias and lower precision [30]. Our studies showed that adjusting for the set containing the parent nodes of X had the minimum bias in logistic regression. With regard to the standard error, adjusting for set with fewer confounders would improve precision. Under the framework of IPWbasedMSM, we observed that adjusting for any set satisfying condition 2) had unbiased estimations; and conditioning on the set containing all parent nodes of Y achieved the highest precision in all situations. In summary, compared with logistic regression, the IPWbasedMSM produced unbiased causal effect estimates when the adjusted variable sets satisfied condition 2) and the optimal adjustment strategy was conditioning on parent nodes of outcome Y, which achieved the highest precision. Although the estimations obtained by logistic regression was biased, the estimation of adjusting for the parent nodes of the exposure X was nearest to true causal effect.
The true causal effect of exposure on outcome calculated by “docalculus” is defined in terms of marginal probability distributions. However, the conditional treatment effects estimated from logistic regression model differ from the true causal effect [31, 32]. Logistic regression estimates do not behave like linear regression estimates. They are affected by omitted variables, even when those variables are unrelated to the independent variables in the model [11]. The use of IPWbasedMSM could lead to a more precise estimation of causal effects.
The discrepancy between the marginal OR and the conditional OR even in the absence of confounders is acknowledged as the noncollapsibility of the OR [4, 33]. The noncollapsibility effect depends on a variety of parameters, e.g., the effect of the exposure, the prevalence and effect of the covariate [4, 33]. According to our results, the differences in estimates between the logistic regression model and IPWbasedMSM were equal to the noncollapsibility effect in number. However, the discrepancy in estimates between these two model were different after adjusting for different sets of cequivalence maybe due to these sets have different variables.
Conclusions
In conclusion, the biasreducing differed after adjusting for the sets of cequivalence under the logistic regression model, whereas it were approximately equivalent when using IPWbasedMSM. All adjustment strategies through logistic regression were biased, while IPWbasedMSM could always obtain unbiased estimation when the adjusted set satisfied Gadmissibility. Thus, for adjusting confounders set, we recommend IPWbasedMSM rather than logistic regression model.
Abbreviations
 ACE :

Average causal effect
 cequivalence:

Confounding equivalence
 IPWbasedMSM:

Inverse probability weight based marginal structural model
 OR:

Odds ratio
References
 1.
Pearl J. The docalculus revisited. In: Proceedings of the twentyeighth conference on uncertainty in artificial intelligence (UAI12); 2012. p. 4–11.
 2.
Weinberg CR. Toward a clearer definition of confounding. Am J Epidemiol. 1993;137(1):1–8.
 3.
Howards PP, Schisterman EF, Poole C, et al. “Toward a clearer definition of confounding” revisited with directed acyclic graphs. Am J Epidemiol. 2012;176(6):506–11.
 4.
Greenland S, Robins JM, Pearl J. Confounding and collapsibility in causal inference. Stat Sci. 1999;14(1):29–46.
 5.
Grimes DA, Schulz KF. Bias and causal associations in observational research. Lancet. 2002;359(9302):248–52.
 6.
MacKenzie TA, Tosteson TD, Morden NE, et al. Using instrumental variables to estimate a Cox’s proportional hazards regression subject to additive confounding. Health Serv Outcomes Res Methodol. 2014;14(1–2):54–68.
 7.
Liu W, Brookhart MA, Schneeweiss S, et al. Implications of M bias in epidemiologic studies: a simulation study. Am J Epidemiol. 2012;176(10):938–48.
 8.
Robinson LD, Jewell NP. Some surprising results about covariate adjustment in logistic regression models. Int Stat Rev. 1991;59(2):227–40.
 9.
Smolle C, Tuca A, Wurzer P, et al. Complications in tissue expansion: a logistic regression analysis for risk factors. Burns. 2017;
 10.
Gong X, Cui J, Jiang Z, et al. Risk factors for pedicled flap necrosis in hand soft tissue reconstruction: a multivariate logistic regression analysis. ANZ J Surg. 2017. doi:10.1111/ans.13977.
 11.
Mood C. Logistic regression: why we cannot do what we think we can do, and what we can do about it. Eur Sociol Rev. 2010;26(1):67–82.
 12.
Cole SR, Hernán MA. Constructing inverse probability weights for marginal structural models. Am J Epidemiol. 2008;168(6):656–64.
 13.
Hernán MA, Robins JM. Estimating causal effects from epidemiological data. J Epidemiol Community Health. 2006;60(7):578–86.
 14.
Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000;11(5):550–60.
 15.
Pearl J, Paz A. Confounding equivalence in causal inference. J Causal Inference. 2014;2(1):75–93.
 16.
Pearl J. Invited commentary: understanding bias amplification. Am J Epidemiol. 2011;174(11):1228–9.
 17.
Pearl J. Causal inference in statistics: an overview. Stat Surv. 2009;3:96–146.
 18.
Pearl J. The deductive approach to causal inference. J Causal Inference. 2014;2(2):115–29.
 19.
Pearl J. Causal diagrams and the identification of causal effects. In: Causality. Cambridge: Cambridge university press; 2009.
 20.
Knüppel S, Stang A. DAG program: identifying minimal sufficient adjustment sets. Epidemiology. 2010;21(1):159.
 21.
Evans D, Chaix B, Lobbedez T, et al. Combining directed acyclic graphs and the changeinestimate procedure as a novel approach to adjustmentvariable selection in epidemiology. BMC Med Res Methodol 2012; 12(1):156156.
 22.
Greenland S, Pearl J, Robins JM. Causal diagrams for epidemiologic research. Epidemiology. 1999;10(1):37–48.
 23.
VanderWeele TJ. On the relative nature of over adjustment and unnecessary adjustment. Epidemiology. 2009;20(4):496–9.
 24.
Robins JM. Causal inference from complex longitudinal data. Latent variable modeling and applications to causality. 120th ed; 1997. p. 69–117.
 25.
Robins JM, Greenland S, Hu FC. Estimation of the causal effect of a timevarying exposure on the marginal mean of a repeated binary outcome. J Am Stat Assoc. 1999;94(447):687–700.
 26.
Robins JM. Marginal structural models. 1997 Proc Am Stat Assoc. 1998; 1998: 110.
 27.
Hernán MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIVpositive men. Epidemiology. 2000;11(5):561.
 28.
McNamee R. Regression modelling and other methods to control confounding. Occup Environ Med. 2005;62(7):500–6.
 29.
Hosman C, Hansen B, Holland P. The sensitivity of linear regression coefficient confidence limits to the omission of a confounder. Ann Appl Stat. 2010;4(2):849–70.
 30.
Li H, Yuan Z, Su P, et al. A simulation study on matched casecontrol designs in the perspective of causal diagrams. BMC Med Res Methodol. 2016;16(1):102.
 31.
Moffitt R. Estimating marginal treatment effects in heterogeneous populations. Ann d'Econ Stat. 2008;91(91):239–61.
 32.
Heckman JJ, Vytlacil E. Structural equations, treatment effects, and econometric policy evaluation. Econometrica. 2005;73(3):669–738.
 33.
Pang M, Kaufman JS, Platt RW. Studying noncollapsibility of the odds ratio with marginal structural and logistic regression models. Stat Methods Med Res. 2016;25(5):1925–37.
Acknowledgements
We would like to thank the reviewers and academic editors for providing us with constructive comments and suggestions and also wish to acknowledge our colleagues for their invaluable work. In addition, I have benefited greatly from suggestions provided by the group of biostatistics at Shandong University. I am also grateful to the support of the National Natural Science Foundation of China.
Funding
This work was supported by grants from the National Natural Science Foundation of China (grant number 81773547, 81,573,259).
Availability of data and materials
Not applicable
Author information
Affiliations
Contributions
YYY, HKL, YXL and FZX conceived, designed the study. YYY performed the simulation and theoretical proof, HKL perfected the result of theoretical proof. YYY, HKL, XRS, PS, TTW, YL, ZSY drafted of the manuscript and its revision. All authors read and approved the final manuscript.
Corresponding authors
Correspondence to Yanxun Liu or Fuzhong Xue.
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
Appendix: Deducing whether cequivalence had same biasreducing potential under logistic regression model. (DOCX 107 kb)
Additional file 2: Figure S1.
Scenario 1 (Fig. 1a), simulation results of the bias and standard error of cequivalence sets A _{1} ≈ A _{2} ≈ A _{3} when varied across the log transformed odds ratio effect of Z on X and X on Y. (PDF 25 kb)
Additional file 3: Figure S2.
Scenario 2 (Fig. 1b), simulation results of the bias and standard error of cequivalence sets A _{1} ≈ A _{2} ≈ A _{3} when varied across the log transformed odds ratio effect of Z on T and X on Y. (PDF 25 kb)
Additional file 4: Figure S3.
Scenario 2 (Fig. 1b), simulation results of the bias and standard error of cequivalence sets A _{1} ≈ A _{2} ≈ A _{3} when varied across the log transformed odds ratio effect of Z on X and W on X. (PDF 25 kb)
Additional file 5: Figure S4.
Scenario 2 (Fig. 1b), simulation results of the bias and standard error of cequivalence sets B _{1} ≈ B _{2} when varied across the log transformed odds ratio effect of Z on T and X on Y. (PDF 19 kb)
Additional file 6: Figure S5.
Scenario 2 (Fig. 1b), simulation results of the bias and standard error of cequivalence sets B _{1} ≈ B _{2} when varied across the log transformed odds ratio effect of Z on X and W on X. (PDF 19 kb)
Additional file 7: Figure S6
Scenario 3 (Fig. 1c), simulation results of the bias and standard error of cequivalence sets A _{1} ≈ A _{2} when varied across the log transformed odds ratio effect of Z on X and W on V. (PDF 19 kb)
Additional file 8: Figure S7.
Scenario 3 (Fig. 1c), simulation results of the bias and standard error of cequivalence sets A _{1} ≈ A _{2} when varied across the log transformed odds ratio effect of T on Y, W on X and X on Y. (PDF 27 kb)
Additional file 9: Figure S8.
Scenario 3 (Fig. 1c), simulation results of the bias and standard error of cequivalence sets C _{1} ≈ C _{2} ≈ C _{3} when varied across the log transformed odds ratio effect of Z on X and W on V. (PDF 25 kb)
Additional file 10: Figure S9.
Scenario 3 (Fig. 1c), simulation results of the bias and standard error of cequivalence sets C _{1} ≈ C _{2} ≈ C _{3} when varied across the log transformed odds ratio effect of T on Y, W on X and X on Y. (PDF 35 kb)
Additional file 11: Figure S10
Scenario 4 (Figure 1d), simulation results of the bias and standard error of cequivalence sets A _{1} ≈ A _{2} ≈ A _{3} when varied across the log transformed odds ratio effect of Z on X and X on Y. (PDF 25 kb)
Additional file 12: Figure S11.
Scenario 4 (Fig. 1d), simulation results of the bias and standard error of cequivalence sets A _{1} ≈ A _{2} ≈ A _{3} when varied across the log transformed odds ratio effect of Z on W,T on W and W on Y. (PDF 35 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Yu, Y., Li, H., Sun, X. et al. The alarming problems of confounding equivalence using logistic regression models in the perspective of causal diagrams. BMC Med Res Methodol 17, 177 (2017). https://doi.org/10.1186/s1287401704497
Received:
Accepted:
Published:
Keywords
 Confounding equivalence
 Logistic regression model
 Inverse probability weighting based marginal structural model
 Simulation study
 Causal diagrams