Skip to main content

Estimation of average treatment effect based on a multi-index propensity score

Abstract

Background

Estimating the average effect of a treatment, exposure, or intervention on health outcomes is a primary aim of many medical studies. However, unbalanced covariates between groups can lead to confounding bias when using observational data to estimate the average treatment effect (ATE). In this study, we proposed an estimator to correct confounding bias and provide multiple protection for estimation consistency.

Methods

With reference to the kernel function-based double-index propensity score (Ker.DiPS) estimator, we proposed the artificial neural network-based multi-index propensity score (ANN.MiPS) estimator. The ANN.MiPS estimator employed the artificial neural network to estimate the MiPS that combines the information from multiple candidate models for propensity score and outcome regression. A Monte Carlo simulation study was designed to evaluate the performance of the proposed ANN.MiPS estimator. Furthermore, we applied our estimator to real data to discuss its practicability.

Results

The simulation study showed the bias of the ANN.MiPS estimators is very small and the standard error is similar if any one of the candidate models is correctly specified under all evaluated sample sizes, treatment rates, and covariate types. Compared to the kernel function-based estimator, the ANN.MiPS estimator usually yields smaller standard error when the correct model is incorporated in the estimator. The empirical study indicated the point estimation for ATE and its bootstrap standard error of the ANN.MiPS estimator is stable under different model specifications.

Conclusions

The proposed estimator extended the combination of information from two models to multiple models and achieved multiply robust estimation for ATE. Extra efficiency was gained by our estimator compared to the kernel-based estimator. The proposed estimator provided a novel approach for estimating the causal effects in observational studies.

Peer Review reports

Background

Estimating the average treatment effect (ATE) is essential for assessing causal effects of treatments or interventions in biometrics, epidemiology, econometrics, and sociology. The ATE can be estimated by directly comparing mean outcomes between treated and controlled groups in randomized controlled trials [1]. However, randomized controlled trials are usually difficult to implement because of budget restrictions, ethics, and subjects’ noncompliance. Therefore, observational studies are increasingly used for estimating ATE. However, the baseline covariates are commonly unbalanced between treated and controlled groups in observational studies, and simply comparing mean outcomes may induce confounding bias [2].

Inverse probability weighting (IPW) under potential outcome framework is a popular approach for correcting confounding bias [3,4,5]. The IPW approach specifies a propensity score (PS) model to estimate subjects’ PS and uses the inverse of PS to balance baseline covariates between groups [6, 7]. For binary treatment, the mostly used PS model is the logistic regression. Some machine learning models, such as decision tree[8] and artificial neural network [9,10,11,12] are also used to estimate the PS. Another widely used approach is outcome regression (OR) [13]. The OR approach specifies an OR model, such as generalized linear model [14] to model the outcome as a function of the treatment and covariates to correct confounding bias directly. Some machine learning models, such as random forest [15] and artificial neural network [16] are also used as the OR model. Both IPW and OR approaches yield consistent estimation only if the corresponding model is correctly specified, but neither can be verified by the data alone.

Doubly robust approach, combining the models of PS and OR, can yield consistent estimation when any one of these two models is correctly specified (not necessarily both). Recently, a variety of doubly robust estimators for ATE have been proposed, such as augmented estimating equations estimator [17] and target maximum likelihood estimator [18]. The kernel function-based double-index propensity score (Ker.DiPS) estimator proposed by Cheng et al. [19] is one of the weighting-based doubly robust estimators. They used the Nadaraya-Watson-type kernel function to combine the information from one PS model and one OR model to obtain an integrated PS, which they named as double-index propensity score (DiPS). Using IPW approach based on the DiPS, the Ker.DiPS estimator achieved doubly robust estimation for ATE. However, the integrated PS estimated by Nadaraya-Watson-type kernel may be out of range between 0 to 1. The unreasonable PS violates the causal inference assumption and may yield uncertain estimation. Moreover, the Ker.DiPS estimator allows only two opportunities for estimation consistency.

To provide more protection on estimation consistency, we would like to develop an estimator allowing specifying multiple candidate models and can achieve estimation consistency when any one model is correctly specified. Such type of estimator is defined as multiply robust estimator [20, 21]. When combining the information from multiple candidate models to obtain the multi-index propensity score (MiPS), the Nadaraya-Watson-type kernel function may yield unstable estimation as it suffers from the “curse of dimensionality” [22,23,24]. With the development of scalable computing and optimization techniques [25, 26], the use of machine learning, such as artificial neural network (ANN) has been one of the most promising approaches in connection with applications related to approximation and estimation of multivariate functions [27, 28]. The ANN has the potential of overcoming the curse of dimensionality [29, 30] and has been used as a universal approximators for various functional representations [31,32,33]. Therefore, we replaced the kernel function with ANN to conduct nonparametric regression to estimate the MiPS. We aim to achieve multiply robust estimation for ATE using the ANN-based MiPS.

The rest of the article is organized as follows. In the Notations and assumptions section, we introduce necessary notations and causal inference assumptions. In the Some existing approaches section, we introduce some existing estimators that leads to the development of our estimator. In the Proposed multi-index propensity score section, we describe the origin and construction of the proposed estimator in detail. In the Simulation studies section, we perform simulations to evaluate the performance of the proposed estimator. A real data analysis was conducted in the Application to NHEFS data section. We make further discussion in the Discussion section and conclude the paper in the Conclusions section.

Methods

Notations and assumptions

Suppose that \({\mathbf{Z}}_{i}={\left({Y}_{i},{A}_{i},{\mathbf{X}}_{i}^{{\top }}\right)}^{{\top }}, i=1,\dots ,n\) be the observed data for \({i}^{\mathrm{th}}\) subject from independent and identically distributed copies of \(\mathbf{Z}={\left(Y,A,{\mathbf{X}}^{{\top }}\right)}^{{\top }}\), where \(Y\) is the outcome, \(A\) is the binary indicator of treatment (\(A=1\) if treated and \(A=0\) if controlled), and \(\mathbf{X}\) is the p-dimensional vector of pretreatment covariates. Let \({Y}^{1}\) and \({Y}^{0}\) represent the potential outcomes if a subject was assigned to treated or controlled group, respectively. The formula for average treatment effect (ATE) is

$$\Delta ={\mu }_{1}-{\mu }_{0}=E\left({Y}^{1}\right)-E\left({Y}^{0}\right).$$

Under causal inference framework, the identifiability assumptions are usually assumed, that is [6],

  • Assumption 1. Consistency: \(Y=A{Y}^{1}+(1-A){Y}^{0}\) with probability 1;

  • Assumption 2. Ignorability: (Y 1, Y 0) A | X, denotes statistical independence;

  • Assumption 3. Positivity: \(0<\pi \left(\mathbf{X}\right)<1\), where \(\pi \left(\mathbf{X}\right)=P\left(A=1 \right| \mathbf{X})\) denotes the propensity score.

Some existing approaches

The IPW estimator is usually used for correcting confounding bias. The propensity score (PS) \(\pi \left(\mathbf{X}\right)=P\left(A=1 \right| \mathbf{X})\) can be modeled as \(\pi \left(\mathbf{X};\boldsymbol{\alpha }\right)={g}_{\pi }\left({\alpha }_{0}+{\boldsymbol{\alpha }}_{1}^{\mathrm{T}}\mathbf{X}\right)\), where \({g}_{\pi }\left(\cdot \right)\) is a specified link function, for example, the inverse of the logit function for the logistic regression, and \(\boldsymbol{\alpha }={\left({\alpha }_{0},{\boldsymbol{\alpha }}_{1}^{\mathrm{T}}\right)}^{\mathrm{T}}\) are the unknown parameters and can be estimated from maximum likelihood estimation. Under causal inference assumptions, the ATE can be estimated by the IPW estimator

$$\begin{array}{c}{\widehat\Delta}_{IPW}=\left(\sum\limits_{i=1}^n\frac{A_i}{\pi\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}\right)}\right)^{-1}\sum\limits_{i=1}^n\frac{A_i}{\pi\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}\right)}Y_i-\\ \left(\sum\limits_{i=1}^n\frac{1-A_i}{1-\pi\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}\right)}\right)^{-1}\sum\limits_{i=1}^n\frac{1-A_i}{1-\pi\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}\right)}Y_i,\end{array}$$
(1)

where \(\widehat{\boldsymbol{\alpha }}\) is the estimated value of \(\boldsymbol{\alpha }\). If \(\pi \left(\mathbf{X};\boldsymbol{\alpha }\right)\) is correctly specified, \({\widehat{\Delta }}_{IPW}\) is a consistent estimator of \(\Delta\).

The OR estimator is another commonly used approach for correcting confounding bias. Let \({\mu }_{A}\left(\mathbf{X}\right)=E\left(Y \right| \mathbf{X},A)\) denote outcome regression (OR), where \(A\in \{\mathrm{0,1}\}\). It can be modeled as \({\mu }_{A}\left(\mathbf{X};{\varvec{\beta}}\right)={g}_{\mu }\left({\beta }_{0}+{{\varvec{\beta}}}_{1}^{T}\mathbf{X}+{\beta }_{2}A\right)\), where \({g}_{\mu }(\cdot )\) is a specified link function, for example, the identity function for the linear regression, \({\varvec{\beta}}={\left({\beta }_{0},{{\varvec{\beta}}}_{1}^{{\top }},{\beta }_{2}\right)}^{{\top }}\) are the unknown parameters and can be estimated from maximum likelihood estimation. Interactions between \(A\) and \(\mathbf{X}\) in OR model can also be accommodated by estimating the OR separately by treated and controlled groups [19]. Under causal inference assumptions, the ATE also can be estimated by the OR estimator

$${\widehat{\Delta }}_{OR}=\frac{1}{n}\sum_{i=1}^{n} {\mu }_{1}\left({\mathbf{X}}_{i};\widehat{{\varvec{\beta}}}\right)-\frac{1}{n}\sum_{i=1}^{n} {\mu }_{0}\left({\mathbf{X}}_{i};\widehat{{\varvec{\beta}}}\right),$$
(2)

where \(\widehat{{\varvec{\beta}}}\) is the estimated value of \({\varvec{\beta}}\). If \(\mu \left(\mathbf{X},A;{\varvec{\beta}}\right)\) is correctly specified, \({\widehat{\Delta }}_{OR}\) is a consistent estimator of \(\Delta\).

If the PS model for IPW estimator or the OR model for OR estimator is incorrectly specified, the estimation consistency of \({\widehat{\Delta }}_{IPW}\) or \({\widehat{\Delta }}_{OR}\) with \(\Delta\) can not be guaranteed. To provide protection against model misspecification, Cheng et al. [19] considered integrating the information of PS \(\pi \left(\mathbf{X};\boldsymbol{\alpha }\right)\) and OR \({\mu }_{a}\left(\mathbf{X};{\varvec{\beta}}\right)\) to construct double-index propensity score (DiPS), which is denoted by \(\pi \left(\mathbf{X};{\boldsymbol{\alpha }}_{1},{{\varvec{\beta}}}_{1}\right)=E\left[A | {\boldsymbol{\alpha }}_{1}^{\mathrm{T}}\mathbf{X},{{\varvec{\beta}}}_{1}^{\mathrm{T}}\mathbf{X}\right]\). In order to estimate this conditional expectation, Cheng et al. [19] firstly got the estimated value \({\widehat{\boldsymbol{\alpha }}}_{1}\) of PS model and the estimated value \({\widehat{{\varvec{\beta}}}}_{1}\) of OR model, then used the Nadaraya-Watson kernel estimator [34] to conduct nonparametric regression of \(A\) on \({\widehat{\boldsymbol{\alpha }}}_{1}^{\mathrm{T}}\mathbf{X}\) and \({\widehat{{\varvec{\beta}}}}_{1}^{\mathrm{T}}\mathbf{X}\), to get the estimated value of DiPS as

$$\widehat{\pi }\left(\mathbf{X};{\widehat{\boldsymbol{\alpha }}}_{1},{\widehat{{\varvec{\beta}}}}_{1}\right)=\frac{\sum_{j=1}^{n} {\mathcal{K}}_{\mathbf{H}}\left\{\left({\widehat{\mathbf{S}}}_{j}-\widehat{\mathbf{S}}\right)\right\}{A}_{j}}{\sum_{j=1}^{n} {\mathcal{K}}_{\mathbf{H}}\left\{\left({\widehat{\mathbf{S}}}_{j}-\widehat{\mathbf{S}}\right)\right\}}$$
(3)

where \({\widehat{\mathbf{S}}}_{i}=\left({\widehat{\boldsymbol{\alpha }}}_{1}^{\mathrm{T}}{\mathbf{X}}_{i},{\widehat{{\varvec{\beta}}}}_{1}^{\mathrm{T}}{\mathbf{X}}_{i}\right)\) and \(\widehat{\mathbf{S}}=\left({\widehat{\boldsymbol{\alpha }}}_{1}^{\mathrm{T}}\mathbf{X},{\widehat{{\varvec{\beta}}}}_{1}^{\mathrm{T}}\mathbf{X}\right)\) are bivariate regressors, which is named double-index. \({\mathcal{K}}_{\mathbf{H}}\left(\bullet \right)\) is a kernel function with a bandwidth \(\mathbf{H}\) of \(2\times 2\) matrix. Using the estimated DiPS \(\widehat{\pi }\left(\mathbf{X};{\widehat{\boldsymbol{\alpha }}}_{1},{\widehat{{\varvec{\beta}}}}_{1}\right)\), the ATE can be estimated by

$$\begin{array}{c}{\widehat\Delta}_{DiPS}=\left(\sum\limits_{i=1}^n\frac{A_i}{\widehat\pi\left({\mathbf X}_i;{\widehat{\boldsymbol\alpha}}_1,{\widehat{\beta}}_1\right)}\right)^{-1}\sum\limits_{i=1}^n\frac{A_i}{\widehat\pi\left({\mathbf X}_i;{\widehat{\boldsymbol\alpha}}_1,{\widehat{\beta}}_1\right)}Y_i-\\ \left(\sum\limits_{i=1}^n\frac{1-A_i}{1-\widehat\pi\left({\mathbf X}_i;{\widehat{\boldsymbol\alpha}}_1,{\widehat{\beta}}_1\right)}\right)^{-1}\sum\limits_{i=1}^n\frac{1-A_i}{1-\widehat\pi\left({\mathbf X}_i;{\widehat{\boldsymbol\alpha}}_1,{\widehat{\beta}}_1\right)}Y_i.\end{array}$$
(4)

Cheng et al. [19] demonstrated that \({\widehat{\Delta }}_{DiPS}\) is a doubly robust estimator: it is consistent when \(\pi \left(\mathbf{X};\boldsymbol{\alpha }\right)\) is correctly specified, or \({\mu }_{A}\left(\mathbf{X};{\varvec{\beta}}\right)\) is correctly specified, but not necessarily both.

Proposed multi-index propensity score

Although \({\widehat{\Delta }}_{DiPS}\) in (3) can achieve doubly robust estimation for ATE, the DiPS estimated by the Nadaraya-Watson kernel estimator in (2), which may make the estimated probability outside the range of 0 to1, then the above Assumption 3 is violated. Furthermore, \({\widehat{\Delta }}_{DiPS}\) in (3) only allows a single model for PS and a single model for OR, the estimation consistency cannot be guaranteed when both models are incorrect. To provide more protection on estimation consistency, we would like to develop an approach that allows multiple candidate models for PS and/or OR, to achieve multiple robustness: the estimator is consistent when any model for PS or any model for OR is correctly specified.

Specifically, we consider multiple candidate models for PS \(\{{\pi }^{k}\left(\mathbf{X};{\boldsymbol{\alpha }}^{k}\right)={g}_{\pi }\left({\alpha }_{0}^{k}+{\boldsymbol{\alpha }}_{1}^{k\mathrm{T}}\mathbf{X}\right),k=1,\dots ,K\}\) and multiple candidate models for OR \(\left\{{\mu }_{A}^{l}\left(\mathbf{X};{{\varvec{\beta}}}^{l}\right)={g}_{\mu }\left({\beta }_{1}^{l}+{{\varvec{\beta}}}_{1}^{l\mathrm{T}}\mathbf{X}+{\beta }_{2}^{l}A\right),l=1,\dots ,L\right\}\), probably with different choices or functional forms of covariates. Then we integrate the information from multiple PS models and multiple OR models to construct multi-index propensity score (MiPS), which is denoted by \(\pi \left(\mathbf{X};{\boldsymbol{\alpha }}_{1}^{1},...,{\boldsymbol{\alpha }}_{1}^{K},{{\varvec{\beta}}}_{1}^{1},...,{{\varvec{\beta}}}_{1}^{L}\right)=E\left[A | {\boldsymbol{\alpha }}_{1}^{1\mathrm{T}}\mathbf{X},...{\boldsymbol{\alpha }}_{1}^{K\mathrm{T}}\mathbf{X},{{\varvec{\beta}}}_{1}^{1\mathrm{T}}\mathbf{X},...,{{\varvec{\beta}}}_{1}^{L\mathrm{T}}\mathbf{X}\right]\). In order to estimate this conditional expectation, we firstly get the estimated values \({\widehat{\boldsymbol{\alpha }}}_{1}^{1}\),…, \({\widehat{\boldsymbol{\alpha }}}_{1}^{K}\) of multiple PS models and the estimated values \({\widehat{{\varvec{\beta}}}}_{1}^{1}\),…, \({\widehat{{\varvec{\beta}}}}_{1}^{L}\) of multiple OR models, then a naive idea is to use the multivariate Nadaraya-Watson kernel estimator to conduct nonparametric regression of \(A\) on \({\widehat{\boldsymbol{\alpha }}}_{1}^{1\mathrm{T}}\mathbf{X}\),…, \({\widehat{\boldsymbol{\alpha }}}_{1}^{K\mathrm{T}}\mathbf{X}\) and \({\widehat{{\varvec{\beta}}}}_{1}^{1\mathrm{T}}\mathbf{X}\),…, \({\widehat{{\varvec{\beta}}}}_{1}^{L\mathrm{T}}\mathbf{X}\) to get the estimated value of MiPS as

$${\widehat{\pi }}^{Ker}\left(\mathbf{X};{\widehat{\boldsymbol{\alpha }}}_{1}^{1},...,{\widehat{\boldsymbol{\alpha }}}_{1}^{K},{\widehat{{\varvec{\beta}}}}_{1}^{1},...,{\widehat{{\varvec{\beta}}}}_{1}^{L}\right)=\frac{\sum_{j=1}^{n} {\mathcal{K}}_{\mathbf{H}}\left\{\left({\widehat{\mathbf{S}}}_{j}-\widehat{\mathbf{S}}\right)\right\}{A}_{j}}{\sum_{j=1}^{n} {\mathcal{K}}_{\mathbf{H}}\left\{\left({\widehat{\mathbf{S}}}_{j}-\widehat{\mathbf{S}}\right)\right\}},$$
(5)

where \({\widehat{\mathbf{S}}}_{j}=\left({\widehat{\boldsymbol{\alpha }}}_{1}^{1\mathrm{T}}{\mathbf{X}}_{j},\dots , {\widehat{\boldsymbol{\alpha }}}_{1}^{K\mathrm{T}}{\mathbf{X}}_{j},{\widehat{{\varvec{\beta}}}}_{1}^{1\mathrm{T}}{\mathbf{X}}_{j},\dots , {\widehat{{\varvec{\beta}}}}_{1}^{L\mathrm{T}}{\mathbf{X}}_{j}\right)\) and \(\widehat{\mathbf{S}}=\left({\widehat{\boldsymbol{\alpha }}}_{1}^{1\mathrm{T}}\mathbf{X},\dots , {\widehat{\boldsymbol{\alpha }}}_{1}^{K\mathrm{T}}\mathbf{X},{\widehat{{\varvec{\beta}}}}_{1}^{1\mathrm{T}}\mathbf{X},\dots , {\widehat{{\varvec{\beta}}}}_{1}^{L\mathrm{T}}\mathbf{X}\right)\) are multivariate regressors, which is named multi-index. \({\mathcal{K}}_{\mathbf{H}}\left(\bullet \right)\) is a kernel function with a bandwidth \(\mathbf{H}\) of \(\left(K+L\right)\times \left(K+L\right)\) matrix. Using the estimated kernel-based MiPS \({\widehat{\pi }}^{Ker}\left(\mathbf{X};{\widehat{\boldsymbol{\alpha }}}_{1}^{1},...,{\widehat{\boldsymbol{\alpha }}}_{1}^{K},{\widehat{{\varvec{\beta}}}}_{1}^{1},...,{\widehat{{\varvec{\beta}}}}_{1}^{L}\right)\), the ATE can be estimated by

$$\begin{array}{c}\widehat\Delta_{MiPS}^{Ker}=\left(\sum\limits_{i=1}^n\frac{A_i}{\widehat\pi^{Ker}\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}_1^1,...,\widehat{\boldsymbol\alpha}_1^K,\widehat{\beta}_1^1,...,\widehat{\beta}_1^L\right)}\right)^{-1}\sum\limits_{i=1}^n\frac{A_i}{\widehat\pi^{Ker}\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}_1^1,...,\widehat{\boldsymbol\alpha}_1^K,\widehat{\beta}_1^1,...,\widehat{\beta}_1^L\right)}Y_i-\\ \left(\sum\limits_{i=1}^n\frac{1-A_i}{1-\widehat\pi^{Ker}\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}_1^1,...,\widehat{\boldsymbol\alpha}_1^K,\widehat{\beta}_1^1,...,\widehat{\beta}_1^L\right)}\right)^{-1}\sum\limits_{i=1}^n\frac{1-A_i}{1-\widehat\pi^{Ker}\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}_1^1,...,\widehat{\boldsymbol\alpha}_1^K,\widehat{\beta}_1^1,...,\widehat{\beta}_1^L\right)}Y_i.\end{array}$$
(6)

However, if there are no additional assumptions about the regression structure, the performance of Nadaraya-Watson kernel estimator in (5) degrades as the number of regressors increases. This degradation in performance is often referred to as the “curse of dimensionality” [22,23,24]. Our following simulation results also show that \({\widehat{\Delta }}_{MiPS}^{Ker}\) has obvious bias when multiple candidate models are included in \({\widehat{\pi }}^{Ker}\left(\mathbf{X};{\widehat{\boldsymbol{\alpha }}}_{1}^{1},...,{\widehat{\boldsymbol{\alpha }}}_{1}^{K},{\widehat{{\varvec{\beta}}}}_{1}^{1},...,{\widehat{{\varvec{\beta}}}}_{1}^{L}\right)\), even if the correct PS and/or OR model is covered.

With the development of scalable computing and optimization techniques [25, 26], the use of machine learning has been one of the most promising approaches in connection with applications related to approximation and estimation of multivariate functions [27, 28]. Artificial neural network (ANN) is one of machine learning approaches. Benefiting from its flexible structure, the ANN becomes a universal approximator of a variety of functions [31,32,33]. The ANN comprises an input layer, a researcher-specified number of hidden layer(s), and an output layer. The hidden layer(s) and output layer consist of a number of neurons (also specified by researchers) with activation functions [35]. The operation of ANN includes following steps: 1) Information is input from the input layer, which passes it to the hidden layer; 2) In the hidden layer(s), the information is multiplied by the weight and a bias is added, and then passed to the next layer after transforming by the activation function; 3) The information is passed layer by layer until the last layer, where it is multiplied by the weight and then transformed by the activation function to provide the output; and 4) Calculate the error between the output and the actual value, and minimize the error by optimizing the weight parameters and bias parameters through the backpropagation algorithm [36]. In addition to having the potential of overcoming the “curse of dimensionality” [29, 30], the ANN is capable of automatically capturing complex relationships between variables [27]. It may be suited for modeling the relationship between treatment and multi-index because interactions commonly exist between indexes due to shared covariates in candidate PS and/or OR models. Therefore, we replaced the kernel function by ANN and proposed our ANN-based MiPS (ANN.MiPS) estimator.

Now we propose the ANN-based MiPS. We firstly get the estimated values \({\widehat{\boldsymbol{\alpha }}}_{1}^{1}\),…, \({\widehat{\boldsymbol{\alpha }}}_{1}^{K}\) of multiple PS models and the estimated values \({\widehat{{\varvec{\beta}}}}_{1}^{1}\),…, \({\widehat{{\varvec{\beta}}}}_{1}^{L}\) of multiple OR models, then use the ANN to conduct nonparametric regression of \(A\) on multiple indexes \({\widehat{\boldsymbol{\alpha }}}_{1}^{1\mathrm{T}}\mathbf{X}\),…, \({\widehat{\boldsymbol{\alpha }}}_{1}^{K\mathrm{T}}\mathbf{X}\) and \({\widehat{{\varvec{\beta}}}}_{1}^{1\mathrm{T}}\mathbf{X}\),…, \({\widehat{{\varvec{\beta}}}}_{1}^{L\mathrm{T}}\mathbf{X}\) to get the estimated value of MiPS as \({\widehat{\pi }}^{Ann}\left(\mathbf{X};{\widehat{\boldsymbol{\alpha }}}_{1}^{1},...,{\widehat{\boldsymbol{\alpha }}}_{1}^{K},{\widehat{{\varvec{\beta}}}}_{1}^{1},...,{\widehat{{\varvec{\beta}}}}_{1}^{L}\right)\). Then the ATE can be estimated by

$$\begin{array}{c}\widehat\Delta_{MiPS}^{Ann}=\left(\sum\limits_{i=1}^n\frac{A_i}{\widehat\pi^{Ann}\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}_1^1,...,\widehat{\boldsymbol\alpha}_1^K,\widehat{\beta}_1^1,...,\widehat{\beta}_1^L\right)}\right)^{-1}\sum\limits_{i=1}^n\frac{A_i}{\widehat\pi^{Ann}\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}_1^1,...,\widehat{\boldsymbol\alpha}_1^K,\widehat{\beta}_1^1,...,\widehat{\beta}_1^L\right)}Y_i-\\ \left(\sum\limits_{i=1}^n\frac{1-A_i}{1-\widehat\pi^{Ann}\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}_1^1,...,\widehat{\boldsymbol\alpha}_1^K,\widehat{\beta}_1^1,...,\widehat{\beta}_1^L\right)}\right)^{-1}\sum\limits_{i=1}^n\frac{1-A_i}{1-\widehat\pi^{Ann}\left({\mathbf X}_i;\widehat{\boldsymbol\alpha}_1^1,...,\widehat{\boldsymbol\alpha}_1^K,\widehat{\beta}_1^1,...,\widehat{\beta}_1^L\right)}Y_i.\end{array}$$
(7)

Our following simulations indicate the multiple robustness of \({\widehat{\Delta }}_{MiPS}^{Ann}\): its bias is ignorable when any model for PS or any model for OR is correctly specified.

We implemented the ANN that contains 2 hidden layers with 4 neurons in each hidden layer using AMORE package [37] for ANN.MiPS estimator. Therefore, the total number of parameters to be estimated in the ANN is \(4*(K+L)+32\), including \(4*(K+L)+24\) weight parameters and 8 bias parameters. The learning rate is set as 0.001 [10, 12]. The momentum is set as 0.5, the default value in the AMORE package. The hyperbolic tangent function was specified as the activation function for hidden layer. The sigmoid function was specified as the activation function for output layer to ensure the estimated ANN-based MiPS is between 0 to 1 [38]. To examine the performance stability of the estimator, we performed a sensitivity analysis using different hyperparameter selections. The simulations, real data analysis, and all statistical tests were conducted using R software (Version 4.1.0) [39]. A zip file of AMORE package and an example code for implementing the ANN.MiPS approach can be found in the attachment.

Simulation studies

We conducted simulation studies to evaluate the performance of (i) single model-based estimators: IPW estimator in (1) and OR estimator in (2); (ii) doubly robust estimators: augmented inverse probability weighting (AIPW) [17] and target maximum likelihood estimator (TMLE) [18], which allows a single model for PS and a single model for OR; (iii) multiple models-based estimators: kernel-based estimator in (6) and ANN-based estimator in (7), which allows multiple candidate models for PS and/or OR.

Ten covariates \({X}_{1}-{X}_{10}\) were generated from standard normal distribution, and the correlation between them are shown in Fig. 1. The binary treatment indicator \(A\) was generated from a Bernoulli distribution according to the following propensity score

Fig. 1
figure 1

The simulation data structure in our simulation studies

$$\begin{array}{c}\mathrm{logit}\left[\pi\left(\mathbf X;\alpha\right)\right]=\alpha_0+0.16X_1-0.05X_2+0.12X_3-\\ 0.1X_4-0.16X_5-0.1X_6+0.15X_7\end{array}$$

\({\alpha }_{0}\) was set to be 0 or -1.1 to make approximately 50% or 25% subjects entering the treatment group. The continuous outcome \(Y\) was generated from

$$\begin{array}{c}Y=-3.85-0.4A-0.8X_1-0.36X_2-0.73X_3-\\ 0.2X_4+0.71X_8-0.19X_9+0.26X_{10}+\varepsilon,\end{array}$$

where \(\varepsilon\) follows the standard normal distribution. The true ATE was \(\Delta =E\left({Y}^{1}\right)-E\left({Y}^{0}\right)=-0.4\).

In the estimation, two estimation models were specified

$${\mathbb{A}}=\left\{\begin{array}{c}logit\left[{\pi }^{1}\left(\mathbf{X};{\boldsymbol{\alpha }}^{1}\right)\right]=\left(1,{X}_{1},{X}_{2},{X}_{3},{X}_{4},{X}_{5},{X}_{6},{X}_{7}\right){\boldsymbol{\alpha }}^{1}\\ logit\left[{\pi }^{2}\left(\mathbf{X};{\boldsymbol{\alpha }}^{2}\right)\right]=\left(1,{X}_{1}^{2},{X}_{2}^{2},{X}_{3}^{2},{X}_{4}^{2},{X}_{5}^{2},{X}_{6}^{2},{X}_{7}^{2}\right){\boldsymbol{\alpha }}^{2}\end{array}\right\}$$

for propensity score, and two estimation models were specified

$${\mathbb{B}}=\left\{\begin{array}{c}{{\mu }_{A}}^{1}\left(\mathbf{X};{{\varvec{\beta}}}^{1}\right)=\left(1,{A,X}_{1},{X}_{2},{X}_{3},{X}_{4},{X}_{8},{X}_{9},{X}_{10}\right){{\varvec{\beta}}}^{1}\\ {{\mu }_{A}}^{2}\left(\mathbf{X};{{\varvec{\beta}}}^{2}\right)=\left(1,{A,X}_{1}^{2},{X}_{2}^{2},{X}_{3}^{2},{X}_{4}^{2},{X}_{8}^{2},{X}_{9}^{2},{X}_{10}^{2}\right){{\varvec{\beta}}}^{2}\end{array}\right\}$$

for outcome regression. According to the data-generating mechanism, \({\pi }^{1}\left(\mathbf{X};{\boldsymbol{\alpha }}^{1}\right)\) and \({{\mu }_{A}}^{1}\left(\mathbf{X};{{\varvec{\beta}}}^{1}\right)\) were correct PS and correct OR models, whereas \({\pi }^{2}\left(\mathbf{X};{\boldsymbol{\alpha }}^{2}\right)\) and \({{\mu }_{A}}^{2}\left(\mathbf{X};{{\varvec{\beta}}}^{2}\right)\) were incorrect PS and incorrect OR models, due to the mis-specified functional forms of covariates. To distinguish these estimation methods, each estimator is denoted as "method-0000". Each of the four numbers, from left to right, represents if \({\pi }^{1}\left(\mathbf{X};{\boldsymbol{\alpha }}^{1}\right)\), \({\pi }^{2}\left(\mathbf{X};{\boldsymbol{\alpha }}^{2}\right)\), \({{\mu }_{A}}^{1}\left(\mathbf{X};{{\varvec{\beta}}}^{1}\right)\) or \({{\mu }_{A}}^{2}\left(\mathbf{X};{{\varvec{\beta}}}^{2}\right)\) is included in the estimator, where “1” indicates yes and “0” indicates no.

We investigated sample sizes of \(n=300\) and \(n=1000\) with 1000 replications in all settings. Tables 1 and 2 show the estimation results of all estimators, along with five evaluation measures including percentage of bias (BIAS, in percentage), root mean square error (RMSE), Monte Carlo standard error (MC-SE), bootstrapping standard error (BS-SE) based on 100 resamples, and coverage rate of 95% Wald confidence interval (CI-Cov). Our bootstrapping procedure resamples from the original sample set with replacement until the bootstrapping sample size reaches the original sample size. Fig. S1 shows the distribution of the estimated ATEs of Ker.MiPS and ANN.MiPS estimators. The following conclusions can be obtained. For estimation bias,

  1. (i)

    If specifying one model for PS or one for OR: The IPW, Ker.MiPS, and ANN.MiPS estimators all have a small bias if the PS model is correctly specified (IPW.correct, Ker.MiPS-1000, ANN.MiPS-1000). The OR, Ker.MiPS, and ANN.MiPS estimators all have a small bias if the OR model is correctly specified (IPW.correct, Ker.MiPS-0010, ANN.MiPS-0010).

  2. (ii)

    If specifying one model for PS and one model for OR: The AIPW, TMLE, Ker.MiPS and ANN.MiPS estimators all have a small bias if the PS model is correctly specified (AIPW-1010, AIPW-1001, Ker.MiPS-1010, Ker.MiPS-1001, ANN.MiPS-1010, ANN.MiPS-1001), or if the OR model is correctly specified (AIPW-1010, AIPW-0110, Ker.MiPS-1010, Ker.MiPS-0110, ANN.MiPS-1010, ANN.MiPS-0110).

  3. (iii)

    If specifying multiple candidate models for PS and OR: The multiple robustness property of the ANN.MiPS estimator is well demonstrated by the ignorable bias of ANN.MiPS-1110, ANN.MiPS-1101, ANN.MiPS-1011, ANN.MiPS-0111, and ANN.MiPS-1111. On the contrary, the biases of the Ker.MiPS estimators under all model specifications are close to or larger than 5%.

Table 1 Estimation results under 50% treated based on 1000 replications
Table 2 Estimation results under 25% treated based on 1000 replications

For estimation efficiency,

  1. (i)

    If models for both PS and OR are correctly specified: The MC-SE of AIPW-1010, TMLE-1010, and ANN.MiPS-1010 estimators are all smaller than that of IPW.correct and ANN.MiPS-1000 estimators. The improved efficiency may benefit from the information of the correct OR model.

  2. (ii)

    If multiple candidate models incorporate the correct PS and OR models: The MC-SE of ANN.MiPS-1110, ANN.MiPS-1011, and ANN.MiPS-1111 estimators are all close to ANN.MiPS-1010.

To evaluate the performance of the MiPS estimator when the number of specified models increases, we have considered three additional estimators: MiPS-1111-2PS, adding two additional incorrect PS models \(\left\{\begin{array}{c}logit\left[{\pi }^{3}\left(\mathbf{X};{\boldsymbol{\alpha }}^{3}\right)\right]=\left(1,{X}_{1},{X}_{2},{X}_{3}\right){\boldsymbol{\alpha }}^{3}\\ logit\left[{\pi }^{4}\left(\mathbf{X};{\boldsymbol{\alpha }}^{4}\right)\right]=\left(1,{X}_{1}^{2},{X}_{2}^{2},{X}_{3}^{2}\right){\boldsymbol{\alpha }}^{4}\end{array}\right\}\) on the basis of the MiPS-1111; MiPS-1111-2OR, adding two additional incorrect OR models \(\left\{\begin{array}{c}{\mu }_{A}^{3}\left(\mathbf{X};{{\varvec{\beta}}}^{3}\right)=\left(1,{X}_{1},{X}_{2},{X}_{3},A\right){{\varvec{\beta}}}^{3}\\ {\mu }_{A}^{4}\left(\mathbf{X};{{\varvec{\beta}}}^{4}\right)=\left(1,{X}_{1}^{2},{X}_{2}^{2},{X}_{3}^{2},A\right){{\varvec{\beta}}}^{4}\end{array}\right\}\) on the basis of the MiPS-1111; MiPS-1111-2PS-2OR, adding two additional incorrect PS models \({\pi }^{3}\left(\mathbf{X};{\boldsymbol{\alpha }}^{3}\right)\) and \({\pi }^{4}\left(\mathbf{X};{\boldsymbol{\alpha }}^{4}\right)\) and two additional incorrect OR models \({\mu }_{A}^{3}\left(\mathbf{X};{{\varvec{\beta}}}^{3}\right)\) and \({\mu }_{A}^{4}\left(\mathbf{X};{{\varvec{\beta}}}^{4}\right)\) on the basis of the MiPS-1111. Table 3 shows the estimation results. The following conclusions can be obtained.

  1. (i)

    The estimation bias of ANN.MiPS-1111-2PS, ANN.MiPS-1111-2OR, and ANN.MiPS-1111-2PS2OR estimators is still ignorable. The estimation efficiency of these estimators is hardly degraded compared to ANN.MiPS-1010 estimator.

  2. (ii)

    The estimation bias of Ker.MiPS-1111-2PS, Ker.MiPS-1111-2OR, and Ker-1111-2PS2OR estimators is close to or larger than 10%. The MC-SE of these estimators is obviously larger than that of Ker.MiPS-1010 estimator.

Table 3 Estimation results for multi-index propensity score estimator incorporating extra incorrect models based on 1000 replications

We also evaluated the performance of ANN.MiPS estimator under the simulation scenario with both continuous and discrete covariates. The simulation setting was described in Supplementary Document. Similar conclusions can be obtained as the above scenario with all continuous covariates (Table S1, S2). The sensitivity analysis of hyperparameters selection in ANN revealed the performance stability of ANN.MiPS estimator (Table S3).

Application to NHEFS data

To illustrate our proposed method, we analyzed a subset of real data from the National Health and Nutrition Examination Survey Data | Epidemiologic Follow-up Study (NHEFS) (wwwn.cdc.gov/nchs/nhanes/nhefs/). The dataset consists of 1,507 participants aged 25–74 who smoked at the first survey and were followed for approximately 10 years. The empirical study aimed to estimate the ATE of smoking cessation (coded as quitting and non-quitting, with non-quitting as the reference group) on weight gain. Participants were categorized as treated if they quit smoking during follow-up, otherwise controlled. Weight gain for each individual was measured as weight at the end of follow-up minus weight at baseline survey (in kilograms). During the 10-year follow-up, 379 (25.15%) participants quit smoking. The average weight gain was greater for those who quit smoking with an unadjusted difference of 2.4 kg.

Table 4 summarized the baseline characteristics, including age, gender, race, baseline weight, active life level, education level, exercise, smoking intensity, smoking years, and ever use of weight loss medication between the smoking quitters and non-quitters. As shown in the table, the distribution of age, gender, race, education level, smoking intensity, and smoking years was different between quitters and non-quitters. When estimating the ATE of smoking cessation on weight gain, these factors should be adjusted for if they are confounders.

Table 4 The NHEFS data analysis: baseline characteristics between non-quitters and quitters

To identify candidate models for ANN.MiPS estimator, we explored the association of smoking cessation with all potential risk factors by logistic regression, and explored the association of weight gain with all potential risk factors by linear regression. The covariates in model 1 and model 2 for both PS and OR models were identified at significant levels of 0.05 and 0.1, respectively. The covariates in PS model 1 and model 2 were (i) age, gender, race, smoking intensity, and smoking years; (ii) age, gender, race, smoking intensity, smoking years, education level, and exercise situation. The covariates in OR model 1 and model 2 were (i) age, weight at baseline, smoking intensity, education level, and active life level; (ii) age, weight at baseline, smoking intensity, education level, active life level, and family income level. We applied the single model-based IPW estimator, single model-based OR estimator, and our proposed ANN.MiPS estimator to estimate the ATE. The four numbers in the ANN.MiPS estimator, from left to right, represents if PS model 1, PS model 2, OR model 1, or OR model 2 is included in the estimator, where “1” indicates yes and “0” indicates no. For example, “ANN.MiPS-1010” represents that the PS model 1 and OR model 1 are included in the estimator. The standard error of estimation was estimated based on 500 resampled bootstrapping.

The estimation results in Table 5 indicated that all estimators suggested quitting smoking significantly increased participants' weight gain. Most of the estimated adjusted effects based on these estimators were greater than the estimated unadjusted effects of 2.4, which seems more precise and reliable. The point estimation and its bootstrap standard error for ATE of the ANN.MiPS estimator was stable under different model specifications.

Table 5 The NHEFS data analysis: estimated average treatment effect of quitting smoking on weight gain (not quitting smoking as reference)

Discussion

In this paper, we considered causal inference in observational studies where effects estimation was susceptible to confounding bias due to imbalanced covariates between groups. With reference to the Ker.DiPS estimator [19], we proposed the ANN.MiPS estimator to provide more chances for correcting the confounding bias. We evaluated the performance of our estimator under simulation scenarios with small (\(n=300\)) or large (\(n=1000\)) sample size, with treatment rate of 25% or 50%, and with covariates consisting of all continuous type or both continuous and discrete types. The results indicated the multiple robustness property of our estimator: the estimation bias is small if any model for PS or any model for OR is correctly specified. In addition to achieving multiply robust estimation for ATE, the proposed estimator showed a higher estimation efficiency than the kernel-based estimator when any model for PS or OR is correctly specified, especially when only the OR model is correctly specified.

One limitation of our approach is that the multiple candidate models for PS \(\{{\pi }^{k}\left(\mathbf{X};{\boldsymbol{\alpha }}^{k}\right)={g}_{\pi }\left({\alpha }_{0}^{k}+{\boldsymbol{\alpha }}_{1}^{kT}\mathbf{X}\right),k=1,\dots ,K\}\) and the multiple candidate models for OR \(\left\{{\mu }^{l}\left(\mathbf{X},A;{{\varvec{\beta}}}^{l}\right)={g}_{\mu }\left({\beta }_{1}^{l}+{{\varvec{\beta}}}_{1}^{lT}\mathbf{X}+{\beta }_{2}^{l}A\right),l=1,\dots ,L\right\}\) need to be parametric, since the MiPS is defined as \(\pi \left(\mathbf{X};{\boldsymbol{\alpha }}_{1}^{1},...,{\boldsymbol{\alpha }}_{1}^{K},{{\varvec{\beta}}}_{1}^{1},...,{{\varvec{\beta}}}_{1}^{L}\right)=E\left[A |{\boldsymbol{\alpha }}_{1}^{1T}\mathbf{X},...{\boldsymbol{\alpha }}_{1}^{KT}\mathbf{X},{{\varvec{\beta}}}_{1}^{1T}\mathbf{X},...,{{\varvec{\beta}}}_{1}^{LT}\mathbf{X}\right]\), in which we need to conduct nonparametric regression of \(A\) on \({\widehat{\boldsymbol{\alpha }}}_{1}^{1\mathrm{T}}\mathbf{X}\),…, \({\widehat{\boldsymbol{\alpha }}}_{1}^{K\mathrm{T}}\mathbf{X}\) and \({\widehat{{\varvec{\beta}}}}_{1}^{1\mathrm{T}}\mathbf{X}\),…, \({\widehat{{\varvec{\beta}}}}_{1}^{L\mathrm{T}}\mathbf{X}\). Therefore, the nonparametric models, such as the kernel function, ANN, and random forest are not suitable as candidate models for the MiPS estimator because the coefficients of covariates cannot be obtained. When the candidate models are constructed by nonparametric models, some other multiply robust approaches may be adopted to integrate the information from multiple candidate models, such as the regression-based estimator under least square’s framework [40], the estimator based on empirical likelihood weighting [20], and the estimator based on model mixture procedures [41]. At this point, double/debiased machine learning approach may be extended to multiple/debiased machine learning for obtaining valid inference about ATE [42].

Although the performance of ANN.MiPS estimator remains stable when specifying eight candidate models, an excessive number of models can impose a heavy computational burden. Therefore, we recommend carefully constructing a comprehensive set of reasonable but less similar candidate models to control the model number in practical applications, using both subject knowledge and reliable data-driven tools, such as causality diagrams [43], variable selection techniques [44], and covariate balancing diagnostics [45].

Finally, we give some intuitive discussions about the theoretical properties of the proposed estimator. Referring to proof Chen et al. [19], \({\widehat{\Delta }}_{MiPS}^{ANN}\) is consistent for

$${\overline{\Delta } }_{MiPS}^{ANN}=\frac{E\left\{\frac{{A}_{i}{Y}_{i}}{{\overline{\pi }}^{ANN}\left({\mathbf{X}}_{i};{\overline{\boldsymbol{\alpha }} }_{1}^{1},...,{\overline{\boldsymbol{\alpha }} }_{1}^{K},{\overline{{\varvec{\beta}}} }_{1}^{1},...,{\overline{{\varvec{\beta}}} }_{1}^{L}\right)}\right\}}{E\left\{\frac{{A}_{i}}{{\overline{\pi }}^{ANN}\left({\mathbf{X}}_{i};{\overline{\boldsymbol{\alpha }} }_{1}^{1},...,{\overline{\boldsymbol{\alpha }} }_{1}^{K},{\overline{{\varvec{\beta}}} }_{1}^{1},...,{\overline{{\varvec{\beta}}} }_{1}^{L}\right)}\right\}}-\frac{E\left\{\frac{\left(1-{A}_{i}\right){Y}_{i}}{\left[1-{\overline{\pi }}^{ANN}\left({\mathbf{X}}_{i};{\overline{\boldsymbol{\alpha }} }_{1}^{1},...,{\overline{\boldsymbol{\alpha }} }_{1}^{K},{\overline{{\varvec{\beta}}} }_{1}^{1},...,{\overline{{\varvec{\beta}}} }_{1}^{L}\right)\right]}\right\}}{E\left\{\frac{\left(1-{A}_{i}\right)}{\left[1-{\overline{\pi }}^{ANN}\left({\mathbf{X}}_{i};{\overline{\boldsymbol{\alpha }} }_{1}^{1},...,{\overline{\boldsymbol{\alpha }} }_{1}^{K},{\overline{{\varvec{\beta}}} }_{1}^{1},...,{\overline{{\varvec{\beta}}} }_{1}^{L}\right)\right]}\right\}}$$

where \({\widehat{\boldsymbol{\alpha }}}_{1}^{1},...,{\widehat{\boldsymbol{\alpha }}}_{1}^{K},{\widehat{{\varvec{\beta}}}}_{1}^{1},...,{\widehat{{\varvec{\beta}}}}_{1}^{L}\) converge to \({\overline{\boldsymbol{\alpha }} }_{1}^{1},...,{\overline{\boldsymbol{\alpha }} }_{1}^{K},{\overline{{\varvec{\beta}}} }_{1}^{1},...,{\overline{{\varvec{\beta}}} }_{1}^{L}\), \({\widehat{\pi }}^{ANN}\left(\bullet \right)\) converges to \({\overline{\pi }}^{ANN}\left(\bullet \right)\). According to some theoretical results on ANN, under certain conditions, \({\overline{\pi }}^{ANN}\left(\mathbf{X};{\overline{\boldsymbol{\alpha }} }_{1}^{1},...,{\overline{\boldsymbol{\alpha }} }_{1}^{K},{\overline{{\varvec{\beta}}} }_{1}^{1},...,{\overline{{\varvec{\beta}}} }_{1}^{L}\right)=\pi \left(\mathbf{X};{\overline{\boldsymbol{\alpha }} }_{1}^{1},...,{\overline{\boldsymbol{\alpha }} }_{1}^{K},{\overline{{\varvec{\beta}}} }_{1}^{1},...,{\overline{{\varvec{\beta}}} }_{1}^{L}\right)\). At this time, when one of candidate models for PS \(\{{\pi }^{k}\left(\mathbf{X};{\boldsymbol{\alpha }}^{k}\right)={g}_{\pi }\left({\alpha }_{0}^{k}+{\boldsymbol{\alpha }}_{1}^{kT}\mathbf{X}\right),k=1,\dots ,K\}\) is correctly specified, \(\pi \left(\mathbf{X};{\overline{\boldsymbol{\alpha }} }_{1}^{1},...,{\overline{\boldsymbol{\alpha }} }_{1}^{K},{\overline{{\varvec{\beta}}} }_{1}^{1},...,{\overline{{\varvec{\beta}}} }_{1}^{L}\right)=\pi \left(\mathbf{X}\right)\), \({\overline{\Delta } }_{MiPS}^{ANN}=\Delta\). On the other hand, when one of candidate models for OR \(\left\{{\mu }_{A}^{l}\left(\mathbf{X};{{\varvec{\beta}}}^{l}\right)={g}_{\mu }\left({\beta }_{1}^{l}+{{\varvec{\beta}}}_{1}^{lT}\mathbf{X}+{\beta }_{2}^{l}A\right),l=1,\dots ,L\right\}\) is correctly specified, \(E\left[Y |{\overline{\boldsymbol{\alpha }} }_{1}^{1T}\mathbf{X},...{\overline{\boldsymbol{\alpha }} }_{1}^{KT}\mathbf{X},{\overline{{\varvec{\beta}}} }_{1}^{1T}\mathbf{X},...,{\overline{{\varvec{\beta}}} }_{1}^{LT}\mathbf{X},A \right]={\mu }_{A}\left(\mathbf{X}\right)\), \({\overline{\Delta } }_{MiPS}^{ANN}=\Delta\). As for the asymptotic distribution of proposed estimator, the variability of \({\widehat{\Delta }}_{MiPS}^{ANN}\) mainly comes from: (1) the estimated values \({\widehat{\boldsymbol{\alpha }}}_{1}^{1}\),…, \({\widehat{\boldsymbol{\alpha }}}_{1}^{K}\) of multiple PS models and the estimated values \({\widehat{{\varvec{\beta}}}}_{1}^{1}\),…, \({\widehat{{\varvec{\beta}}}}_{1}^{L}\) of multiple OR models, (2) the estimated nonparametric function \({\widehat{\pi }}^{ANN}\left(\bullet \right)\) using ANN. For the first variation, if the parameters are estimated by maximum likelihood, the asymptotic normality of the estimators has been obtained by White [46]. For the second variation, the error bound and convergence rate have been discussed in some theoretical research [29, 47]. It will be our future research topic to give and prove the theoretical properties of \({\widehat{\Delta }}_{MiPS}^{ANN}\) estimator strictly and systematically.

Conclusions

IN this study, we proposed the ANN.MiPS estimator to correct confounding bias when using the observational data to estimate the ATE. The proposed estimator allowed multiple candidate models for PS and OR, and guaranteed the estimated integrated PS is between 0 and 1. The multiple robustness property of our estimator was illustrated through simulation studies. Extra efficiency was gained compared to the kernel function-based estimator. The proposed estimator provided a new choice for multiply robust estimation of ATE in observational studies.

Availability of data and materials

The simulated data can be simulated from the example code in the attachment. The real-world data used can be accessed from https://wwwn.cdc.gov/nchs/nhanes/nhefs/default.aspx/.

Abbreviations

ATE:

Average treatment effect

IPW:

Inverse probability weighting

PS:

Propensity score

OR:

Outcome regression

AIPW:

Augment inverse probability weighting

TMLE:

Target maximum likelihood estimator

DiPS:

Double-index propensity score

Ker.DiPS:

Kernel function-based double-index propensity score

MiPS:

Multi-index propensity score

ANN:

Artificial neural network

ANN.MiPS:

Artificial neural network-based multi-index propensity score

Ker.MiPS:

Kernel function-based multi-index propensity score

RMSE:

Root mean square error

MC-SE:

Monte Carlo standard error

BS-SE:

Bootstrapping standard error

95CI-Cov:

95% Confidence interval coverage rate

NHEFS:

Nutrition Examination Survey Data | Epidemiologic Follow-up Study

References

  1. Kovesdy CP, Kalantar-Zadeh K. Observational studies versus randomized controlled trials: avenues to causal inference in nephrology. Adv Chronic Kidney Dis. 2012;19(1):11–8.

    Article  Google Scholar 

  2. Imbens GW, Rubin DB. Causal inference in statistics, social, and biomedical sciences. New York: Cambridge University Press; 2015.

  3. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55.

    Article  Google Scholar 

  4. Wooldridge JM. Inverse probability weighted M-estimators for sample selection, attrition, and stratification. Port Econ J. 2002;1(2):117–39.

    Article  Google Scholar 

  5. Lunceford JK, Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Stat Med. 2004;23(19):2937–60.

    Article  Google Scholar 

  6. Hernán MA, Robins JM. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC; 2020.

    Google Scholar 

  7. Joffe MM, Ten Have TR, Feldman HI, Kimmel SE. Model selection, confounder control, and marginal structural models: review and new applications. Am Stat. 2004;58(4):272–9.

    Article  Google Scholar 

  8. Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29(3):337–46.

    Article  Google Scholar 

  9. Keller B, Kim JS, Steiner PM. Neural networks for propensity score estimation: Simulation results and recommendations. Quantitative psychology research. Wisconsin: Springer; 2015: 279–291.

  10. Collier ZK, Leite WL, Zhang H. Estimating propensity scores using neural networks and traditional methods: a comparative simulation study. Commun Stat-Simul Comput 2021:1–16.

  11. Collier ZK, Zhang H, Liu L. Explained: Artificial intelligence for propensity score estimation in multilevel educational settings. Pract Assess Res Eval. 2022;27(1):3.

    Google Scholar 

  12. Setoguchi S, Schneeweiss S, Brookhart MA, Glynn RJ, Cook EF. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol Drug Saf. 2008;17(6):546–55.

    Article  Google Scholar 

  13. Elwert F, Winship C: Effect heterogeneity and bias in main-effects-only regression models. Heuristics, probability and causality: A tribute to Judea Pearl 2010:327–336.

  14. Vansteelandt S, Goetghebeur E. Causal inference with generalized structural mean models. J Roy Stat Soc Ser B (Stat Method). 2003;65(4):817–35.

    Article  Google Scholar 

  15. Lu M, Sadiq S, Feaster DJ, Ishwaran H. Estimating individual treatment effect in observational data using random forest methods. J Comput Graph Stat. 2018;27(1):209–19.

    Article  Google Scholar 

  16. Chen X, Liu Y, Ma S, Zhang Z. Efficient estimation of general treatment effects using neural networks with a diverging number of confounders. 2020. arXiv preprint arXiv:200907055.

    Google Scholar 

  17. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Amer Statistical Assoc. 1994;89(427):846–66.

    Article  Google Scholar 

  18. Van Der Laan MJ, Rubin D. Targeted maximum likelihood learning. Int J Biostat. 2006;2(1):1–38.

  19. Cheng D, Chakrabortty A, Ananthakrishnan AN, Cai T. Estimating average treatment effects with a double-index propensity score. Biometrics. 2020;76(3):767–77.

    Article  Google Scholar 

  20. Han P, Wang L. Estimation with missing data: beyond double robustness. Biometrika. 2013;100(2):417–30.

    Article  Google Scholar 

  21. Han P. Multiply robust estimation in regression analysis with missing data. J Amer Statistical Assoc. 2014;109(507):1159–73.

    Article  CAS  Google Scholar 

  22. Bellman RE. Curse of dimensionality. Adaptive control processes: a guided tour. New Jersey: Princeton University Press; 1961.

  23. Donoho DL. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Challenges Lecture. 2000;2000(1):32.

    Google Scholar 

  24. Rodrıguez G. Smoothing and non-parametric regression. New Jersey: Princeton University 2001.

  25. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. 2016. arXiv preprint arXiv:160304467.

    Google Scholar 

  26. Kingma DP, Ba J. Adam: A method for stochastic optimization. 2014. arXiv preprint arXiv:14126980.

    Google Scholar 

  27. Mitchell TM, Mitchell TM. Machine learning, vol. 1. New York: McGraw-hill; 1997.

    Google Scholar 

  28. Bzdok D, Krzywinski M, Altman N. Machine learning: a primer. Nat Methods. 2017;14(12):1119.

    Article  CAS  Google Scholar 

  29. Bauer B, Kohler M. On deep learning as a remedy for the curse of dimensionality in nonparametric regression. Ann Stat. 2019;47(4):2261–85.

    Article  Google Scholar 

  30. Chen X, White H. Improved rates and asymptotic normality for nonparametric neural network estimators. IEEE Trans Inf Theory. 1999;45(2):682–91.

    Article  Google Scholar 

  31. White H, Gallant AR. Artificial Neural Networks: Approximation and Learning Theory. Oxford: Blackwell; 1992.

  32. Hornik K, Stinchcombe M, White H, Auer P. Degree of approximation results for feedforward networks approximating unknown mappings and their derivatives. Neural Comput. 1994;6(6):1262–75.

    Article  Google Scholar 

  33. Yarotsky D. Optimal approximation of continuous functions by very deep ReLU networks. In: 2018: Stockholm: PMLR: 639–649.

  34. Conn D, Li G. An oracle property of the Nadaraya-Watson kernel estimator for high-dimensional nonparametric regression. Scand J Stat. 2019;46(3):735–64.

    Article  Google Scholar 

  35. Hart PE, Stork DG, Duda RO. Pattern classification. New Jersey: Wiley Hoboken; 2000.

  36. Hecht-Nielsen R. Theory of the backpropagation neural network. Neural networks for perception. California: Academic Press; 1992:65–93.

  37. Limas MC, Meré JBO, Marcos AG, Ascacíbar FJMdP, Espinoza AVP, Elias F, Ramos JMP. AMORE: A MORE flexible neural network package. In: 2014; 2014.

  38. Kyurkchiev N, Markov S. Sigmoid functions: some approximation and modelling aspects. Saarbrucken: LAP LAMBERT Academic Publishing; 2015. p. 4.

    Google Scholar 

  39. Team RC. R: A language and environment for statistical computing. 2013.

    Google Scholar 

  40. Chan KCG. A simple multiply robust estimator for missing response problem. Stat. 2013;2(1):143–9.

    Article  Google Scholar 

  41. Li W, Gu Y, Liu L. Demystifying a class of multiply robust estimators. Biometrika. 2020;107(4):919–33.

    Article  Google Scholar 

  42. Chernozhukov V, Chetverikov D, Demirer M, Duflo E, Hansen C, Newey W, Robins J. Double/debiased machine learning for treatment and structural parameters. In.: Oxford University Press, Oxford, UK; 2018.

    Google Scholar 

  43. Pearl J. Causal diagrams for empirical research. Biometrika. 1995;82(4):669–88.

    Article  Google Scholar 

  44. VanderWeele TJ. Principles of confounder selection. Eur J Epidemiol. 2019;34(3):211–9.

    Article  Google Scholar 

  45. Austin PC. Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Stat Med. 2009;28(25):3083–107.

    Article  Google Scholar 

  46. White H. Maximum likelihood estimation of misspecified models. Econometrica: J Econom Society. 1982;50(1):1–25.

  47. Schmidt-Hieber J. Nonparametric regression using deep neural networks with ReLU activation function. Ann Stat. 2020;48(4):1875–97.

    Google Scholar 

Download references

Acknowledgements

None.

Funding

This work was funded by National Natural Science Foundation of China (No.82173612, No.82273730), Shanghai Rising-Star Program (21QA1401300), Shanghai Municipal Natural Science Foundation (22ZR1414900), Shanghai Special Program: Clinical Multidisciplinary Treatment System and Systems Epidemiology Research, and Shanghai Municipal Science and Technology Major Project (ZD2021CY001). The sponsors had no role in study design, data collection, data analysis, data interpretation, or writing of this report.

Author information

Authors and Affiliations

Authors

Contributions

GYQ and YFY designed the study. JQX and KCW wrote the manuscript. JQX performed simulations and analyzed the real-world data. CW, CH, YXX, and RZ revised the manuscript. All authors have provided critical comments on the draft, and read and approved the final manuscript.

Corresponding authors

Correspondence to Guoyou Qin or Yongfu Yu.

Ethics declarations

Ethics approval and consent to participate

Since the simulated datasets did not involve any human data, ethics approval was not applicable. Since the real data is publicly available, ethics approval was not required.

Consent for publication

Not applicable.

Competing interests

The authors declared no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Fig. S1. The distribution of the estimated average treatment effect for kernel-based MiPS estimator and artificial neural network-based MiPS estimator in 1000 simulated data sets. The range of the y-axis is restricted from -1.4 to 0.6 given that the kernel-based MiPS estimator yields highly biased estimation under some model specifications. The dashed line denotes the true average treatment effect. Table S1. Estimation results for scenario with both continuous and discrete covariates under 50% treated based on 1000 replications. Table S2. Estimation results of multi-index propensity score estimator incorporating extra incorrect models under scenario with both continuous and discrete covariates. Table S3. Sensitivity analysis of ANN.MiPS estimator with different tuning parameters selection for ANN under scenario of all continuous covariates and 50% treated.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Xu, J., Wei, K., Wang, C. et al. Estimation of average treatment effect based on a multi-index propensity score. BMC Med Res Methodol 22, 337 (2022). https://doi.org/10.1186/s12874-022-01822-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12874-022-01822-3

Keywords

  • Average treatment effect
  • Multiply robust
  • Multi-index propensity score
  • Artificial neural network