A novel method for controlling unobserved confounding using double confounders

Background Controlling unobserved confounding still remains a great challenge in observational studies, and a series of strict assumptions of the existing methods usually may be violated in practice. Therefore, it is urgent to put forward a novel method. Methods We are interested in the causal effect of an exposure on the outcome, which is always confounded by unobserved confounding. We show that, the causal effect of an exposure on a continuous or categorical outcome is nonparametrically identified through only two independent or correlated available confounders satisfying a non-linear condition on the exposure. Asymptotic theory and variance estimators are developed for each case. We also discuss an extension for more than two binary confounders. Results The simulations show better estimation performance by our approach in contrast to the traditional regression approach adjusting for observed confounders. A real application is separately applied to assess the effects of Body Mass Index (BMI) on Systolic Blood Pressure (SBP), Diastolic Blood Pressure (DBP), Fasting Blood Glucose (FBG), Triglyceride (TG), Total Cholesterol (TC), High Density Lipoprotein (HDL) and Low Density Lipoprotein (LDL) with individuals in Shandong Province, China. Our results suggest that SBP increased 1.60 (95% CI: 0.99–2.93) mmol/L with per 1- kg/m2 higher BMI and DBP increased 0.37 (95% CI: 0.03–0.76) mmol/L with per 1- kg/m2 higher BMI. Moreover, 1- kg/m2 increase in BMI was causally associated with a 1.61 (95% CI: 0.96–2.97) mmol/L increase in TC, a 1.66 (95% CI: 0.91–55.30) mmol/L increase in TG and a 2.01 (95% CI: 1.09–4.31) mmol/L increase in LDL. However, BMI was not causally associated with HDL with effect value − 0.20 (95% CI: − 1.71–1.44). And, the effect value of FBG per 1- kg/m2 higher BMI was 0.56 (95% CI: − 0.24–2.18). Conclusions We propose a novel method to control unobserved confounders through double binary confounders satisfying a non-linear condition on the exposure which is easy to access.


Background
Controlling unobserved confounding is a great challenge when estimating the causal effect of an exposure on an outcome of interest in observational studies [1][2][3][4]. Several techniques such as traditional regression model, marginal structure model, adjustment, stratification, inverse probability weighing (IPW), matching based on propensity score cannot deal with unobserved confounding [5][6][7][8][9]. The observed data distribution may have compatibility with many contradictory causal explanations due to the existence of unobserved confounding. In this circumstance, we say the causal estimand is not identified. On the contrary, when the causal estimand can be obtained entirely from observable probability distributions, we say the query is identified.
Some methods have been developed to alleviate the problems caused by unobserved confounding. Instrumental variable analysis (IVA) is the commonly used method to eliminate unobserved confounding [10]. But in practice, choosing valid instruments (IVs) is a stumbling block in IVA [11][12][13]. The difference-in-differences (DID) is contingent on the availability of repeated outcomes in both periods, but invokes strict parallel trend assumptions, i.e., confounders varying across the groups are time invariant and time-varying confounders are group invariant [14,15]. Regression discontinuity design (RDD) is a quasi-experimental pretest-posttest design for controlling unobserved confounding by assigning a cutoff or threshold above or below to a treatment [16]. Nevertheless, RDD requires that treatment assignment is sufficiently randomized at the threshold [17]. Negative controls are widely used in epidemiologic practice to detect the presence of unobserved confounding. While a valid negative control outcome needs to be influenced by the same unobserved confounders of the exposure effects on the outcome in view, although not directly influenced by the exposure. But this approach fails to obtain causal estimation of the exposure on the outcome [18]. Moreover, strict assumptions of the methods above usually may be violated in practice and impose restrictions on their generalization.
In this article, we propose a novel method to control unobserved confounding through double confounders with two values satisfying a non-linear condition on the exposure. Under the assumption of ignorable treatment assignment, causal effects can be identified and estimated using commonly generalized moment estimate model. Furthermore, we relax the assumption that observed and unobserved confounders are independent in sensitivity analysis and observe that even when the correlations between binary observed confounders and unobserved confounders are relatively weak, we still obtain the almost unbiased causal effect estimation. Additionally, we explore the statistical properties of this method by a simulation study and compare with the traditional regression approach only adjusting for observed confounders. Finally, we apply this method to a cohort from a follow-up survey (136,895 individuals) from 2007 to 2015 in Jining, China to exam the causal associations of BMI on other factors, including SBP, DBP, FBG, TG, TC, HDL and LDL.

Notation and preliminaries
Throughout, we let X, Y, C 1 , C 2 and U denote the treatment, outcome, two observed confounders and unobserved confounders, respectively ( Fig. 1). Following the convention in causal inference, we use Y(x) to denote the potential outcome of Y under an intervention which sets X to x, and the observed outcome Y is a realization of the potential outcome under the exposure actually received: Y = Y(x) when X = x. We focus on the average causal effect (ACE) of X on Y which is the difference in expectation of potential outcome at two different exposure levels 0 and 1, for instance, ACE X → Y = E(Y(1) − Y(0)) for a binary exposure. The conditional ignorability assumption Y(x) ⊥ x | C 1 , C 2 is conventionally made in causal inference, but it does not hold in the present of unmeasured confounding U. In this case, latent ignorability Y(x) ⊥ x | C 1 , C 2 , U is more reasonable, allowing for an unobserved confounder U that captures the source of non-ignorability of the exposure mechanism. The model with continuous exposure and outcome can be developed as follows.
where ε Y and ε X are two mutually independent random errors with means 0 and variances σ 2 Y and σ 2 X respectively, φ(·) and ϕ(·) are two arbitrary functions. The causal effect of X on Y is interpreted as β 1 . Similarly, for binary exposure and outcome, we can also construct corresponding linear probability model (LPM) for P(X = 1| C 1 , C 2 , U), P(Y = 1| C 1 , C 2 , U), respectively. In addition, the "complementary log link" of transformation of Y (i.e. − log(1 − Risk), Risk denotes cumulative completion risk) is also appropriate for our method [19].

Identification and estimation of causal effects with double binary confounders
Obviously, the estimation of regression model adjusting for C 1 and C 2 is biased when there exists an unobserved confounder U. Unfortunately, the no unobserved confounding assumption usually not be satisfied in practical studies. Next, we explain how to identify and estimate the causal effect by relaxing the no unobserved confounding assumption in four cases with different types of exposure and outcome (continuous or binary). It will be shown below that causal effect β 1 in (2) is not identifiable if the function F(X| C 1 , C 2 ) is linear with respect to C 1 and C 2 .
Within the causal framework provided by Fig. 1, we propose four assumptions and discuss the necessary and sufficient condition for identification of parameters in the model (2).
Assumption 3: The effect of (X, C 1 , C 2 ) on Y is linear (or no interaction).
Assumption 4: The effect of (C 1 , C 2 ) on X is nonlinear (e.g. with an interaction effect or the quadratic term of C 1 and C 2 on X).
Under Assumptions 1 and 2, E(ϕ(U, ε Y )| C 1 , C 2 ) = 0 is satisfied. The Assumption 2 is the same as the exchangeability assumption in instrumental variable (IV) analysisan IV is not associated with any confounder of the exposure-outcome relationship. In addition, a valid IV must have a main effect on X and no direct effect on Y, which are called the Relevance and the Exclusion restriction assumption, respectively [20]. In our method, (C 1 , C 2 ) can be regarded as two near-IVs with a non-linear effect on X and a linear effect on Y as stated in the Assumption 3 and 4.
Theorem 1: The causal effect β 1 of X on Y in model (2) is identifiable if and only if the Assumptions 1-4 are satisfied.
See Appendix A for the proof of Theorem 1. Causal effect is identified based on above four assumptions. And various estimation approaches have been applied to estimate the causal effect, including moment estimation, maximum likelihood estimation and generalized moment estimate model (GMM) with different assumptions [21][22][23]. In this section, we aim to find an efficient estimation of causal effect in the model (2).
In our estimation approach, the pivotal "orthogonality condition" is the independence (C 1 , C 2 ) ⊥ ϕ(U, ε Y ), which implies the following equation: where f(·) = (f 1 (·), ⋯, f K (·)) T is an arbitrary vector function and 0 is a K × 1 zero vector. For the case of confounders C 1 and C 2 with two values, we chose the function f * (C 1 , the Eq. (4) can be rewritten as where the elements of c Note that the nonlinearity of E[X| C 1 , C 2 ] with respect to C 1 and C 2 implies The estimated causal effect b β 1 is reasonable compared with the "ratio estimator" of a binary IV when C 1 and C 2 are regarded as two near-IVs. Denote the sub-sample average of y and x by y 1 and x 1 when z = 1 and by y 0 and x 0 when z = 0. Then Δy=Δz ¼ y 1 − y 0 and Δx=Δz ¼ x 1 − x 0 , and "ratio estimator" of an IV [24] iŝ , and β can be identified only when rank(Q) = l + 2, l = 2, where l denotes the number of the observed confounders. For any f(·) that makes rank(Q) = l + 2, a GMM estimate of β iŝ whereŴ is denoted as a symmetric and positive definite weight matrix for K ≥ (l + 2), K = rank(Q). The elements ofĝðβÞ;Q andR are sample means of the corresponding elements of g(β), Q and R respectively [20]. IfŴ →W as N → ∞ in probability where W is positive semi-definite and N denotes the sample size. The main general properties about GMM estimators under the appropriate regularity conditions are that, 1.β→β in probability as N → ∞ where β denotes the true parameter.
2. ffiffiffiffi N p ðβ − βÞ converges in distribution to a normal distribution with mean zero and variance ðQ T WQÞ Then we describe a GMM estimator with the efficient instrument (a function of C 1 , C 2 ) proposed by Newey and McFadden [25], which has the minimum variance among all estimators satisfying the Eq. (4). It also shown that there is an efficient estimatorβ when the instrument f(C 1 , C 2 ) is defined as From (5), the efficient GMM estimator of β is defined aŝ The matrix Q eff equals E[f(C 1 , C 2 )f(C 1 , C 2 ) T ] proved in Appendix B and has full rank if the nonlinearity condition in Theorem 1 holds. Since Q eff and the positive definiteŴ for l = 2 are full rank,β eff is a valid estimator which can be simplified as We findβ eff does not rely on the choice ofŴ at this point. From the above property of the GMM estimator, the asymptotic variance ofβ eff is easily obtained by Q eff and σ 2 Additionally, we show for two binary observed confounders that any f(·) which makes the Eq. (4) have the unique solution leads to the same estimator of parameters as that obtained by f * (C 1 , C 2 ) in Appendix B. Therefore, for the case of two binary observed confounders, our estimator b β Ã is efficient and it is not necessary to choose an extra function f(·) to improve the efficiency.
Extend to the case of more than two binary confounders Note that the method proposed is an easy tool to detect the causal effect in the absence of enough confounding information. Because of a growing appreciation of the power gains of multivariate association analyses, more than two covariates are generally selected to analysis in practice. In this section, for the case of more than two binary confounders, we discuss the identification of parameter β 1 in the following model Define β = (β 0 , β 1 , β 2 , ⋯, β l + 1 ) T , Eq. (7) can be rewritten as Q × β = R, where Q = (1, E(X| C 1 , C 2 , ⋯C l ), C 1 , C 2 , ⋯C l ) denotes a 2 l × (l + 2) matrix, β is a (l + 2) vector. According to the equation Q × β = R, parameter β 1 is identifiable if K = rank(Q) ≥ (l + 2), such that the matrix Q has full column rank. We extend Assumptions 2-4 to the case of more than two binary confounders as follows: Assumption 2*: (C 1 , C 2 , ⋯, C l ) ⊥ ϕ(U, ε Y ), i.e. (C 1 , C 2 , ⋯, C l ) are not associated with any confounder (U) of the exposure-outcome relationship and random errors ε Y .
Assumption 3*: The effect of (X, C 1 , C 2 , ⋯, C l ) on Y is linear.
Assumption 4*: The effect of (C 1 , C 2 , ⋯, C l ) on X is non-linear (e.g. with an interaction effect or the quadratic term of (C 1 , C 2 , ⋯, C l ) on X).
Thus, under the Assumption 1 and the Assumptions 2*-4*, the causal effect of X on Y can be identified.
Similar to the case of two binary observed confounders C 1 and C 2 , different f(·) for C 1 , C 2 , ⋯, C l in (7) leads to different estimators. Here, we have Under the conditional expectation (8), the efficient From the above property of the GMM estimator, we can obtain the asymptotic variance ofβ eff in this case whereQ eff andR eff are the sample means of

Simulations
In order to investigate the performance of our method in simulation study, as well as to determine in which scenario it performs well or badly in comparison with the traditional regression approach adjusting for observed confounders, we perform simulations with four cases. In our simulations, data are generated based on the causal diagram depicted in Fig. 1.
In each case (exposure and outcome are continuous or binary respectively, denoted as Simulation A: exposure and outcome are both continuous; Simulation B: exposure is binary while outcome is continuous; Simulation C: exposure is continuous while outcome is binary; Simulation D: exposure and outcome are both binary), we take into account two scenarios where C 1 , C 2 and U are independent, C 1 , C 2 and U are correlated respectively (For Simulation A, two scenarios denotes as Simulation A1, Simulation A2, respectively, similar notation for Simulation B, C, D). We performed simulation to compare performances of three methods including crude association without adjusting for C 1 , C 2 and U (model 1), model of adjusting for C 1 and C 2 (model 2) and the method we proposed here (model 3). We simulated baseline covariates and a quantitative exposure in a large population consisting of 2000 subjects in Simulation A, 10000 subjects in Simulation B and 100,000 subjects in Simulation C, D.
The data of Simulation B1-B2, Simulation C1-C2, Simulation D1-D2 were similarly generated for each individual. Details are listed in the Appendix C.

Data application
Numerous epidemiological studies have evaluated the relationships between BMI and SBP, DBP, FBG, TG, TC, HDL and LDL, respectively, but the causal effects are inconclusive due to the existence of unobserved confounding [26][27][28]. Hence, we use the method proposed in this article to evaluate the potential causal effect of BMI on SBP, DBP, FBG, TG, TC, HDL and LDL compared with the traditional regression approach adjusting for age (discrete) and gender, adjusting for age (continuous) and gender and adjusting for age (discrete), gender and other health factors (including SBP, DBP, FBG, TC, TG, HDL and LDL) using data from a follow-up survey in Jining, Shandong Province.

Simulation results
We separately varied across the main effect value α 1 of C 1 on X, the main effect value α 2 of C 2 on X, the interaction effect α 3 of C 1 and C 2 on X, the causal effect β 1 of X on Y, the confounding effect β 2 of C 1 on Y, the correlation coefficient c between C 1 and C 2 as well as correlation coefficient d between C 1 and U.
We changed one parameter at a time while keeping all others at their basic values, when N = 2000 for Simulation A1-A2, N = 10,000 for Simulation B1-B2, N = 100, 000 for Simulation C1-C2 and N = 100,000 for Simulation D1-D2. Results showed the estimated bias, standard error (SE) and the mean squared error (MSE) from the three models for varied effects of C 1 on X, the interaction effects of C 1 and C 2 on X, U on X, X on Y, C 1 on Y, U on Y, and the correlations among C 1 , C 2 and U. The results showed the estimates of the model without any adjustment (model 1) and the model adjusting for both C 1 and C 2 (model 2) were biased. When we used the method we proposed (model 3) to analyze simulated data sets, the estimates were unbiased and had acceptable standard error. And our method had better MSE than other methods (i.e. model 1 and model 2). Figures 2  and 3 showed the results of Simulation A1, Fig. 4 showed the partial results of Simulation A2, the rest of results of Simulation A2 and other six different scenarios were showed in Figure S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13 (Figure S1-S2 for Simulation A2, Figure S3-S4 for Simulation B1, Figure S5, S6, S7 for Simulation B2, Figure S8-S9 for Simulation C1, Figure S10 for Simulation C2, Figure S11-S12 for Simulation D1, Figure  S13 for Simulation D2). Specifically, when varying across the effects of C 1 on Y, the biases of unadjusted model (model 1) had a significant positive linear correlation with β 2 and the ones of adjustment for C 1 and C 2 (model 2) remained stable. When varying across the effects of U on Y, the biases of both model 1 and model 2 linearly increased. Our method still had unbiased causal effect estimates and lower MSE than other two methods. Similarly, when varying across the effects of C 1 , U on X, respectively, the biases of model 1 and model 2 monotonically varied. Furthermore, the larger interaction effect of C 1 and C 2 on X, the causal effect estimation had higher precision for our proposed method (model 3). In order to make sure model 3 remains the smallest MSE among three models, we suggested the interaction effect of C 1 and C 2 on X moderately larger. Certainly, the bias did not change significantly from the basic scenario in any of the models when we changed the causal effect of X on Y. Moreover, when there existed an correlation between observed confounders C 1 and C 2 , our method still got unbiased causal effect estimated. We still obtained the almost unbiased causal effect estimation if P(C 1 = 1) = P(C 2 = 1) = 0.5 when observed confounder C 1 (or C 2 ) and unobserved confounders U were correlated.
Compared to the situation of continuous variables, sample size N needs to be larger for discrete exposure and outcome variable. Under the linear probability model setting of X, C 1 , C 2 , U on Y (i.e. 0 ≤ P(Y| X, C 1 , C 2 , U) ≤ 1), the effects of C 1 on X, C 2 on X, the interaction effect of C 1 and C 2 on X, U on X should be relatively small. Furthermore, our method had better MSE than model 2, and was similar with model 1, the estimates remained unbiased. We use R program (version 3.6.1) to reproduce all simulations and analyses which are available on Github (https://github.com/LULIU1816/Two-binary-confounder).

Data application results
In this section, we used the method proposed in this article to evaluate the potential causal effects of BMI on Fig. 2 The Simulation A1 result. Results shows the estimated biases, SE and MSE from the 3 models for varied effects of a C 1 on X, b the interaction effect of C 1 and C 2 on X, c U on X SBP, DBP, FBG, TG, TC, HDL and LDL using data from a follow-up survey in Jining, Shandong Province. Furthermore, we selected age and gender as near-IVs compared with the traditional regression approaches adjusting for age (discrete) and gender, adjusting for age (continuous) and gender and adjusting for age (discrete), gender and other health factors (including SBP, DBP, FBG, TC, TG, HDL and LDL). The cohort recruited 136, 895 individuals aged 20-years between 2007 and 2015. In order to avoid reverse causation, we selected the data of BMI, age, gender in 2014 and SBP, DBP, FBG, TG, TC, HDL and LDL in 2015(N = 7013), respectively. Age and gender were divided into two discrete variables 0, 1. Age was discretized by median 40 as 0 and 1. Subject characteristics showed at Table 1. Proportion of female was 41.01%. Kolmogorov-Smirnov test is one of normality tests when the sample size over 5000. The median SBP was 120 (95% CI: 109-160) mmHg, the median DBP was 75 (95% CI: 66-101) mmHg. The median FBG was 5.20 (95% CI: 4.80-8.20) mmol/L, the median TC Fig. 3 The Simulation A1 result. Results shows the estimated biases, SE and MSE from the 3 models for varied effects of a X on Y, b C 1 on Y, c U on Y was 4.64 (95% CI: 4.01-6.76) mmol/L, the median TG was 1.04 (95% CI: 0.60-4.67) mmol/L, the median HDL was 1.29 (95% CI: 1.10-1.87) mmol/L, the median LDL was 2.76 (95% CI: 2.23-4.30) mmol/L. And, the median BMI was 24.68 (95% CI: 21.83-32.07) kg/m 2 .
Results of our proposed method and the traditional regression approach adjusting for age and gender are showed in Table 2. Firstly, we examine whether two independent or correlated available confounders satisfy the non-linear condition on the exposure. Results confirmed the interaction between age and gender on BMI (P = 2.82 × 10 −9 ). Traditional regression approach implied the significant associations between SBP, DBP, FBG, TG, TC, HDL, LDL and BMI, respectively. BMI had obviously positive correlation with SBP, DBP, FBG, TG, TC and LDL while obviously negative correlation with HDL. However, our proposed method revealed the causal effect of BMI on indicators including SBP, DBP, TG, TC, LDL. In addition, the causal effects of BMI on HDL and FBG were not significant. SBP increased 1.60 (95% CI: 0.99-2.93) mmol/L with per 1-kg/m 2 higher BMI and DBP increased 0.37 (95% CI: 0.03-0.76) mmol/L with per 1-kg/m 2 higher BMI. Moreover, 1-kg/m 2 increase in BMI had potential causality with a 1.61-SD increase in TC (β, 1.61; 95% CI: 0.96-2.97), a 1.66-SD increase in TG (β, 1.66; 95% CI: 0.91-55.30) and a 2.01-SD increase in LDL (β, 2.01; 95% CI: 1.09-4.31). However, BMI had Fig. 4 The Simulation A2 result. Results shows the estimated biases, SE and MSE from the 3 models for varied effects of a the correlation between X and C 1 , b the correlation between C 1 and C 2

Discussion
In this paper, we present a simple and intuitive method to identify and correct for confounding bias in observation studies using a confounder-exposure nonlinear condition. In cases where the independence assumption between observed confounder and unobserved confounder is violated, a sensible approach shows the almost unbiased causal effect estimation if P(C 1 = 1) = P(C 2 = 1) = 0.5. In this sense, our proposed method can be viewed very much as a tool for identifying causal effect in epidemiology.
To identify and estimate causal effect with unobserved confounders, different approaches require different untestable assumptions, such as DID needs an untestable common trend assumption. Fortunately, the approach proposed in this paper requires the nonlinearity assumption of at least two observed confounders (C 1 , C 2 , ⋯, C l ) and treatment variable X, as well as the conditional expectation of unobserved confounders given observed confounders U is 0. Furthermore, the observed data contain (C 1 , C 2 , ⋯, C l ) and X which can be utilized to test our nonlinearity assumption. This is exclusive to our method compared with other methods.
For binary variable, the parameter estimates from the LPM can be directly interpreted as the effect of the exposure on the prevalence rate of the outcome which is consistent with the ACE. Conversely, logistic regression or log-linear regression are not applicable in causal inference, since they do not provide a direct interpretation of bivariate associations. However, error terms of LPM with ordinary least square estimation are heteroskedastic and predicted probability can be above 1 or below 0. Therefore, we adopt GMM to deal with these problems. Additionally, our proposed method has no assumption on the independence among observed confounders. This can be widely used in epidemiology to obtain the causal inference of the exposure on the outcome.
Another advantage is the accessibility of two observed confounders satisfying nonlinear condition. Further, we find out that using GMM still fairly reliable in our proposed method if the number of observed confounders is over two.
A number of factors must be considered before implementing the method. First, the independence assumption between observed and unobserved confounders is essential for causal estimate correction. One strategy to overcome this limitation is to select other observed confounders with the probabilities of binary values 0.5. Second, interaction between observed confounders and unobserved confounders on exposure is sufficiently strong. Finally, our method cannot identify the reverse causation. Requiring a priori knowledge or using the idea of Cross-lagged Panel Analysis may avoid this problem.