Estimation of average treatment effect based on a multi-index propensity score

Background Estimating the average effect of a treatment, exposure, or intervention on health outcomes is a primary aim of many medical studies. However, unbalanced covariates between groups can lead to confounding bias when using observational data to estimate the average treatment effect (ATE). In this study, we proposed an estimator to correct confounding bias and provide multiple protection for estimation consistency. Methods With reference to the kernel function-based double-index propensity score (Ker.DiPS) estimator, we proposed the artificial neural network-based multi-index propensity score (ANN.MiPS) estimator. The ANN.MiPS estimator employed the artificial neural network to estimate the MiPS that combines the information from multiple candidate models for propensity score and outcome regression. A Monte Carlo simulation study was designed to evaluate the performance of the proposed ANN.MiPS estimator. Furthermore, we applied our estimator to real data to discuss its practicability. Results The simulation study showed the bias of the ANN.MiPS estimators is very small and the standard error is similar if any one of the candidate models is correctly specified under all evaluated sample sizes, treatment rates, and covariate types. Compared to the kernel function-based estimator, the ANN.MiPS estimator usually yields smaller standard error when the correct model is incorporated in the estimator. The empirical study indicated the point estimation for ATE and its bootstrap standard error of the ANN.MiPS estimator is stable under different model specifications. Conclusions The proposed estimator extended the combination of information from two models to multiple models and achieved multiply robust estimation for ATE. Extra efficiency was gained by our estimator compared to the kernel-based estimator. The proposed estimator provided a novel approach for estimating the causal effects in observational studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-022-01822-3.

noncompliance. Therefore, observational studies are increasingly used for estimating ATE. However, the baseline covariates are commonly unbalanced between treated and controlled groups in observational studies, and simply comparing mean outcomes may induce confounding bias [2].
Inverse probability weighting (IPW) under potential outcome framework is a popular approach for correcting confounding bias [3][4][5]. The IPW approach specifies a propensity score (PS) model to estimate subjects' PS and uses the inverse of PS to balance baseline covariates between groups [6,7]. For binary treatment, the mostly used PS model is the logistic regression. Some machine learning models, such as decision tree [8] and artificial neural network [9][10][11][12] are also used to estimate the PS. Another widely used approach is outcome regression (OR) [13]. The OR approach specifies an OR model, such as generalized linear model [14] to model the outcome as a function of the treatment and covariates to correct confounding bias directly. Some machine learning models, such as random forest [15] and artificial neural network [16] are also used as the OR model. Both IPW and OR approaches yield consistent estimation only if the corresponding model is correctly specified, but neither can be verified by the data alone.
Doubly robust approach, combining the models of PS and OR, can yield consistent estimation when any one of these two models is correctly specified (not necessarily both). Recently, a variety of doubly robust estimators for ATE have been proposed, such as augmented estimating equations estimator [17] and target maximum likelihood estimator [18]. The kernel function-based double-index propensity score (Ker.DiPS) estimator proposed by Cheng et al. [19] is one of the weighting-based doubly robust estimators. They used the Nadaraya-Watsontype kernel function to combine the information from one PS model and one OR model to obtain an integrated PS, which they named as double-index propensity score (DiPS). Using IPW approach based on the DiPS, the Ker. DiPS estimator achieved doubly robust estimation for ATE. However, the integrated PS estimated by Nadaraya-Watson-type kernel may be out of range between 0 to 1. The unreasonable PS violates the causal inference assumption and may yield uncertain estimation. Moreover, the Ker.DiPS estimator allows only two opportunities for estimation consistency.
To provide more protection on estimation consistency, we would like to develop an estimator allowing specifying multiple candidate models and can achieve estimation consistency when any one model is correctly specified. Such type of estimator is defined as multiply robust estimator [20,21]. When combining the information from multiple candidate models to obtain the multi-index propensity score (MiPS), the Nadaraya-Watson-type kernel function may yield unstable estimation as it suffers from the "curse of dimensionality" [22][23][24]. With the development of scalable computing and optimization techniques [25,26], the use of machine learning, such as artificial neural network (ANN) has been one of the most promising approaches in connection with applications related to approximation and estimation of multivariate functions [27,28]. The ANN has the potential of overcoming the curse of dimensionality [29,30] and has been used as a universal approximators for various functional representations [31][32][33]. Therefore, we replaced the kernel function with ANN to conduct nonparametric regression to estimate the MiPS. We aim to achieve multiply robust estimation for ATE using the ANN-based MiPS.
The rest of the article is organized as follows. In the Notations and assumptions section, we introduce necessary notations and causal inference assumptions. In the Some existing approaches section, we introduce some existing estimators that leads to the development of our estimator. In the Proposed multi-index propensity score section, we describe the origin and construction of the proposed estimator in detail. In the Simulation studies section, we perform simulations to evaluate the performance of the proposed estimator. A real data analysis was conducted in the Application to NHEFS data section. We make further discussion in the Discussion section and conclude the paper in the Conclusions section.

Notations and assumptions
. . , n be the observed data for i th subject from independent and identically distributed copies of Z = Y , A, X ⊤ ⊤ , where Y is the outcome, A is the binary indicator of treatment ( A = 1 if treated and A = 0 if controlled), and X is the p-dimensional vector of pretreatment covariates. Let Y 1 and Y 0 represent the potential outcomes if a subject was assigned to treated or controlled group, respectively. The formula for average treatment effect (ATE) is Under causal inference framework, the identifiability assumptions are usually assumed, that is [6], with probability 1; Assumption 2. Ignorability: (Y 1 , Y 0 ) ⫫ A | X, ⫫ denotes statistical independence; Assumption 3. Positivity: 0 < π(X) < 1 , where π (X) = P(A = 1|X) denotes the propensity score.

Some existing approaches
The IPW estimator is usually used for correcting confounding bias. The propensity score (PS) π (X) = P(A = 1|X) can be modeled as π (X; α) = g π α 0 + α T 1 X , where g π (·) is a specified link function, for example, the inverse of the logit function for the logistic regression, and α = α 0 , α T 1 T are the unknown parameters and can be estimated from maximum likelihood estimation. Under causal inference assumptions, the ATE can be estimated by the IPW estimator where α is the estimated value of α . If π (X; α) is correctly specified, IPW is a consistent estimator of .
The OR estimator is another commonly used approach for correcting confounding bias. Let is a specified link function, for example, the identity function for the linear regression, β = β 0 , β ⊤ 1 , β 2 ⊤ are the unknown parameters and can be estimated from maximum likelihood estimation. Interactions between A and X in OR model can also be accommodated by estimating the OR separately by treated and controlled groups [19]. Under causal inference assumptions, the ATE also can be estimated by the OR estimator where β is the estimated value of β . If µ(X, A; β) is correctly specified, OR is a consistent estimator of . If the PS model for IPW estimator or the OR model for OR estimator is incorrectly specified, the estimation consistency of IPW or OR with can not be guaranteed. To provide protection against model misspecification, Cheng et al. [19] considered integrating the information of PS π (X; α) and OR µ a (X; β) to construct double-index propensity score (DiPS), which is denoted by π X; α 1 , β 1 = E A|α T 1 X, β T 1 X . In order to estimate this conditional expectation, Cheng et al. [19] firstly got the estimated value α 1 of PS model and the estimated value β 1 of OR model, then used the Nadaraya-Watson kernel estimator [34] to conduct nonparametric regression of A on α T 1 X and β T 1 X , to get the estimated value of DiPS as (1) is a kernel function with a bandwidth H of 2 × 2 matrix. Using the estimated DiPS π X; α 1 , β 1 , the ATE can be estimated by Cheng et al. [19] demonstrated that DiPS is a doubly robust estimator: it is consistent when π(X; α) is correctly specified, or µ A (X; β) is correctly specified, but not necessarily both.

Proposed multi-index propensity score
Although DiPS in (3) can achieve doubly robust estimation for ATE, the DiPS estimated by the Nadaraya-Watson kernel estimator in (2), which may make the estimated probability outside the range of 0 to1, then the above Assumption 3 is violated. Furthermore, DiPS in (3) only allows a single model for PS and a single model for OR, the estimation consistency cannot be guaranteed when both models are incorrect. To provide more protection on estimation consistency, we would like to develop an approach that allows multiple candidate models for PS and/or OR, to achieve multiple robustness: the estimator is consistent when any model for PS or any model for OR is correctly specified.
Specifically, we consider multiple candidate models for PS  However, if there are no additional assumptions about the regression structure, the performance of Nadaraya-Watson kernel estimator in (5) degrades as the number of regressors increases. This degradation in performance is often referred to as the "curse of dimensionality" [22][23][24]. Our following simulation results also show that Ker MiPS has obvious bias when multiple candidate models are included in π Ker X; α 1 1 , ..., α K 1 , β 1 1 , ..., β L 1 , even if the correct PS and/or OR model is covered.
With the development of scalable computing and optimization techniques [25,26], the use of machine learning has been one of the most promising approaches in connection with applications related to approximation and estimation of multivariate functions [27,28]. Artificial neural network (ANN) is one of machine learning approaches. Benefiting from its flexible structure, the ANN becomes a universal approximator of a variety of functions [31][32][33]. The ANN comprises an input layer, a researcher-specified number of hidden layer(s), and an output layer. The hidden layer(s) and output layer consist of a number of neurons (also specified by researchers) with activation functions [35]. The operation of ANN includes following steps: 1) Information is input from the input layer, which passes it to the hidden layer; 2) In the hidden layer(s), the information is multiplied by the weight and a bias is added, and then passed to the next (5) layer after transforming by the activation function; 3) The information is passed layer by layer until the last layer, where it is multiplied by the weight and then transformed by the activation function to provide the output; and 4) Calculate the error between the output and the actual value, and minimize the error by optimizing the weight parameters and bias parameters through the backpropagation algorithm [36]. In addition to having the potential of overcoming the "curse of dimensionality" [29,30], the ANN is capable of automatically capturing complex relationships between variables [27]. It may be suited for modeling the relationship between treatment and multiindex because interactions commonly exist between indexes due to shared covariates in candidate PS and/ or OR models. Therefore, we replaced the kernel function by ANN and proposed our ANN-based MiPS (ANN. MiPS) estimator. Now we propose the ANN-based MiPS. We firstly get the estimated values α 1 1 ,…, α K 1 of multiple PS models and the estimated values β MiPS : its bias is ignorable when any model for PS or any model for OR is correctly specified.
We implemented the ANN that contains 2 hidden layers with 4 neurons in each hidden layer using AMORE package [37] for ANN.MiPS estimator. Therefore, the total number of parameters to be estimated in the ANN is 4 * (K + L) + 32 , including 4 * (K + L) + 24 weight parameters and 8 bias parameters. The learning rate is set as 0.001 [10,12]. The momentum is set as 0.5, the default value in the AMORE package. The hyperbolic tangent function was specified as the activation function for hidden layer. The sigmoid function was specified as the activation function for output layer to ensure the estimated ANN-based MiPS is between 0 to 1 [38]. To examine the performance stability of the estimator, we performed a sensitivity analysis using different hyperparameter selections. The simulations, real data analysis, and all statistical tests were conducted using R software (Version 4.1.0) [39]. A zip file of AMORE package and an example code for implementing the ANN.MiPS approach can be found in the attachment.

Simulation studies
We conducted simulation studies to evaluate the performance of (i) single model-based estimators: IPW estimator in (1) and OR estimator in (2); (ii) doubly robust estimators: augmented inverse probability weighting (AIPW) [17] and target maximum likelihood estimator (TMLE) [18], which allows a single model for PS and a single model for OR; (iii) multiple models-based estimators: kernel-based estimator in (6) and ANN-based estimator in (7), which allows multiple candidate models for PS and/or OR. Ten covariates X 1 − X 10 were generated from standard normal distribution, and the correlation between them are shown in Fig. 1. The binary treatment indicator A was generated from a Bernoulli distribution according to the following propensity score α 0 was set to be 0 or -1.1 to make approximately 50% or 25% subjects entering the treatment group. The continuous outcome Y was generated from where ε follows the standard normal distribution. The In the estimation, two estimation models were specified for propensity score, and two estimation models were specified for outcome regression. According to the data-generating mechanism, π 1 X; α 1 and µ A 1 X; β 1 were correct PS and correct OR models, whereas π 2 X; α 2 and µ A 2 X; β 2 were incorrect PS and incorrect OR models, due to the mis-specified functional forms of covariates.
To distinguish these estimation methods, each estimator is denoted as "method-0000". Each of the four numbers, from left to right, represents if π 1 X; where "1" indicates yes and "0" indicates no.
We investigated sample sizes of n = 300 and n = 1000 with 1000 replications in all settings. Tables 1 and 2 show the estimation results of all estimators, along with five evaluation measures including percentage of bias (BIAS, in percentage), root mean square error (RMSE), Monte Carlo standard error (MC-SE), bootstrapping standard error (BS-SE) based on 100 resamples, and coverage rate of 95% Wald confidence interval (CI-Cov). Our bootstrapping procedure resamples from the original sample set with replacement until the bootstrapping sample size reaches the original sample size. Fig. S1 shows the Fig. 1 The simulation data structure in our simulation studies   To evaluate the performance of the MiPS estimator when the number of specified models increases, we have considered three additional estimators: MiPS-1111-2PS, adding two additional incorrect PS models on the basis of the MiPS-1111; MiPS-1111-2PS-2OR, adding two additional incorrect PS models π 3 X; α 3 and π 4 X; α 4 and two additional incorrect OR models µ 3 A X; β 3 and µ 4 A X; β 4 on the basis of the MiPS-1111. Table 3 shows the estimation results. The following conclusions can be obtained. MiPS-1111-2OR, and Ker-1111-2PS2OR estimators is close to or larger than 10%. The MC-SE of these estimators is obviously larger than that of Ker.MiPS-1010 estimator.
We also evaluated the performance of ANN.MiPS estimator under the simulation scenario with both continuous and discrete covariates. The simulation setting was described in Supplementary Document. Similar conclusions can be obtained as the above scenario with all continuous covariates (Table S1, S2). The sensitivity analysis of hyperparameters selection in ANN revealed the performance stability of ANN.MiPS estimator (Table S3).

Application to NHEFS data
To illustrate our proposed method, we analyzed a subset of real data from the National Health and Nutrition The estimator which contains correct and/or incorrect models for propensity score and/or outcome regression is denoted as "method-0000", where each digit of the four numbers, from left to right, indicates if π 1 X; α 1 , π 2 X; α 2 ,µ A 1 X; β 1 orµ A 2 X; β 2 is included in the estimator ("1" indicates yes and "0" indicates no)  Examination Survey Data | Epidemiologic Follow-up Study (NHEFS) (wwwn.cdc.gov/nchs/nhanes/nhefs/). The dataset consists of 1,507 participants aged 25-74 who smoked at the first survey and were followed for approximately 10 years. The empirical study aimed to estimate the ATE of smoking cessation (coded as quitting and non-quitting, with non-quitting as the reference group) on weight gain. Participants were categorized as treated if they quit smoking during follow-up, otherwise controlled. Weight gain for each individual was measured as weight at the end of follow-up minus weight at baseline survey (in kilograms). During the 10-year followup, 379 (25.15%) participants quit smoking. The average weight gain was greater for those who quit smoking with an unadjusted difference of 2.4 kg. Table 4 summarized the baseline characteristics, including age, gender, race, baseline weight, active life level, education level, exercise, smoking intensity, smoking years, and ever use of weight loss medication between the smoking quitters and non-quitters. As shown in the Table 3 Estimation results for multi-index propensity score estimator incorporating extra incorrect models based on 1000 replications   The estimator which contains correct and/or incorrect models for propensity score and/or outcome regression is denoted as "method-0000", where each digit of the four numbers, from left to right, indicates ifπ 1 X; α 1 , π 2 X; α 2 ,µ A 1 X; β 1 orµ A 2 X; β 2 is included in the estimator ("1" indicates yes and "0" indicates no) table, the distribution of age, gender, race, education level, smoking intensity, and smoking years was different between quitters and non-quitters. When estimating the ATE of smoking cessation on weight gain, these factors should be adjusted for if they are confounders.
To identify candidate models for ANN.MiPS estimator, we explored the association of smoking cessation with all potential risk factors by logistic regression, and explored the association of weight gain with all potential risk factors by linear regression. The covariates in model 1 and model 2 for both PS and OR models were identified at significant levels of 0.05 and 0.1, respectively. The covariates in PS model 1 and model 2 were (i) age, gender, race, smoking intensity, and smoking years; (ii) age, gender, race, smoking intensity, smoking years, education level, and exercise situation. The covariates in OR model 1 and model 2 were (i) age, weight at baseline, smoking intensity, education level, and active life level; (ii) age, weight at baseline, smoking intensity, education level, active life level, and family income level. We applied the single model-based IPW estimator, single model-based OR estimator, and our proposed ANN.MiPS estimator to estimate the ATE. The four numbers in the ANN.MiPS estimator, from left to right, represents if PS model 1, PS model 2, OR model 1, or OR model 2 is included in the estimator, where "1" indicates yes and "0" indicates no. For example, "ANN.MiPS-1010" represents that the PS model 1 and OR model 1 are included in the estimator. The standard error of estimation was estimated based on 500 resampled bootstrapping.
The estimation results in Table 5 indicated that all estimators suggested quitting smoking significantly increased participants' weight gain. Most of the estimated adjusted effects based on these estimators were greater than the estimated unadjusted effects of 2.4, which seems more precise and reliable. The point estimation and its bootstrap standard error for ATE of the ANN.MiPS estimator was stable under different model specifications.

Discussion
In this paper, we considered causal inference in observational studies where effects estimation was susceptible to confounding bias due to imbalanced covariates between groups. With reference to the Ker.DiPS estimator [19], we proposed the ANN.MiPS estimator to provide more Table 4 The NHEFS data analysis: baseline characteristics between non-quitters and quitters The continuous variable is presented as mean (standard deviance) and the difference between non-quitters and quitters is compared by t-test. The categorical variable is presented as counts (percentage) and the difference between non-quitters and quitters is compared by Chi-square test chances for correcting the confounding bias. We evaluated the performance of our estimator under simulation scenarios with small ( n = 300 ) or large ( n = 1000 ) sample size, with treatment rate of 25% or 50%, and with covariates consisting of all continuous type or both continuous and discrete types. The results indicated the multiple robustness property of our estimator: the estimation bias is small if any model for PS or any model for OR is correctly specified. In addition to achieving multiply robust estimation for ATE, the proposed estimator showed a higher estimation efficiency than the kernelbased estimator when any model for PS or OR is correctly specified, especially when only the OR model is correctly specified. One limitation of our approach is that the multiple candidate models for PS { k ; k = g , in which we need to conduct nonparametric regression of A on α 1T 1 X,…, α K T 1 X and ̂ 1T 1 ,…, β LT 1 X . Therefore, the nonparametric models, such as the kernel function, ANN, and random forest are not suitable as candidate models for the MiPS estimator because the coefficients of covariates cannot be obtained. When the candidate models are constructed by nonparametric models, some other multiply robust approaches may be adopted to integrate the information from multiple candidate models, such as the regression-based estimator under least square's framework [40], the estimator based on empirical likelihood weighting [20], and the estimator based on model mixture procedures [41]. At this point, double/debiased machine learning approach may be extended to multiple/debiased machine learning for obtaining valid inference about ATE [42].
Although the performance of ANN.MiPS estimator remains stable when specifying eight candidate models, an excessive number of models can impose a heavy computational burden. Therefore, we recommend carefully constructing a comprehensive set of reasonable but less similar candidate models to control the model number in practical applications, using both subject knowledge and reliable datadriven tools, such as causality diagrams [43], variable selection techniques [44], and covariate balancing diagnostics [45].
Finally, we give some intuitive discussions about the theoretical properties of the proposed estimator. Referring to proof Chen et al. [19], ANN MiPS is consistent for where ̂ 1 1 , ...,̂   Table 5 The NHEFS data analysis: estimated average treatment effect of quitting smoking on weight gain (not quitting smoking as reference) BS-SE bootstrapping standard error based on 500 resamples, 95%-CI 95% Wald confidence interval. The artificial neural network-based MiPS estimator which contains propensity score model and/or outcome regression model is denoted as "method-0000", where each digit of the four numbers, from left to right, indicates if propensity score model 1, propensity score model 2, outcome regression model 1, outcome regression model 2 is included in the estimator ("1" indicates yes and "0" indicates no)