Skip to main content

An improved multiply robust estimator for the average treatment effect

Abstract

Background

In observational studies, double robust or multiply robust (MR) approaches provide more protection from model misspecification than the inverse probability weighting and g-computation for estimating the average treatment effect (ATE). However, the approaches are based on parametric models, leading to biased estimates when all models are incorrectly specified. Nonparametric methods, such as machine learning or nonparametric double robust approaches, are robust to model misspecification, but the efficiency of nonparametric methods is low.

Method

In the study, we proposed an improved MR method combining parametric and nonparametric models based on the previous MR method (Han, JASA 109(507):1159-73, 2014) to improve the robustness to model misspecification and the efficiency. We performed comprehensive simulations to evaluate the performance of the proposed method.

Results

Our simulation study showed that the MR estimators with only outcome regression (OR) models, where one of the models was a nonparametric model, were the most recommended because of the robustness to model misspecification and the lowest root mean square error (RMSE) when including a correct parametric OR model. And the performance of the recommended estimators was comparative, even if all parametric models were misspecified. As an application, the proposed method was used to estimate the effect of social activity on depression levels in the China Health and Retirement Longitudinal Study dataset.

Conclusions

The proposed estimator with nonparametric and parametric models is more robust to model misspecification.

Peer Review reports

Background

The primary goal of much-applied research is to estimate the causal effect of specific treatment (exposure or intervention) on the outcome. In randomized controlled trials, where treatments are randomly assigned to participants, the average treatment effect (ATE) can be estimated by directly comparing outcomes between treatment and control groups [1]. In observational studies, however, there are usually unbalanced covariates between treatment and control groups due to the non-randomized treatment assignment. As a result, a direct comparison of outcomes between treatment and control groups may lead to a biased estimation of the ATE [2].

The inverse probability weighting (IPW) with a propensity score (PS) model and the direct confounding adjustment (known as g-computation) with an outcome regression (OR) model are the general approaches to handling confounding bias [1, 3, 4]; Compared to IPW and g-computation methods, doubly robust approaches provide double protection from model misspecification [5,6,7]; but, the doubly robust approach does not offer sufficient protection for estimating ATE in practice, as they allow only one PS model and one OR model. Recently, multiply robust (MR) approaches, increasing the likelihood of including the correct model, are proposed for estimating ATE or a population mean with missing values [8,9,10,11].And, the previous MR approach [9] is robust against extreme values of the fitted receiving treatment probability. However, the previous MR approach [9] only considering parametric models may lead to a biased estimation when the included parametric models are all incorrectly specified.

In addition, there is growing interest in developing nonparametric methods for estimating ATE to protect against model misspecification. Machine learning, a general term for a diverse number of nonparametric algorithms, is particularly useful for classification and prediction and is used to estimate the ATE [12,13,14,15,16,17]. However, the root mean square error (RMSE) of machine learning seems to be higher than that of the correct parametric model may due to incorrect hyperparameter settings [18,19,20]. Nonparametric double robust methods based on the kernel smoothing approach [21, 22] or targeted minimum loss [23] have also been proposed to estimate the ATE. Yet, the efficiency of these estimators is not high because of slow convergence rates.

In this study, based on the previous MR approach [9], we proposed an improved MR approach considering both parametric and nonparametric models to improve the robustness to model misspecification. Our simulation study showed that the proposed MR approach is more robust to model misspecification than previous MR approach; and the MR estimators with only OR models, where one of the models was a nonparametric model, were the best among all MR estimators for the following two reasons. First, the MR estimators were robust to model misspecification, and had the lowest root mean square errors (RMSEs) when including a correct parametric OR model. Second, the performance of the best estimators was comparative even if all parametric models were misspecified.

Method

Notation and assumptions

Let \({{\varvec{X}}}_{i}\) be a \(p\)-dimensional vector of covariates, \({Y}_{i}\) be the observed outcome, and \({Z}_{i}\) be the treatment status taking value 1 if treated or 0 if untreated. Let (\({Y}^{1}, {Y}^{0}\)) be the two potential outcomes in the treatment and control groups, respectively, and the ATE is defined as

$$\Delta =E\left({Y}^{1}\right)-E\left({Y}^{0}\right)$$

And to draw a correct causal inference in the study, exchangeability, consistency, and positivity assumptions hold [24].

Previous multiply robust method

The previous MR approach proposed by Han [9] provides multiple protection to the model misspecification. Specifically, specifying two sets of parametric models, \(\mathcal{P}=\left\{{\pi }^{l}\left({\varvec{X}}\right),l=1, 2, 3,\dots ,L\right\}\) for propensity score and \(\mathcal{M}=\left\{{m}_{z}^{k}\left({\varvec{X}}\right),k=\mathrm{1,2},3,\dots ,K\right\}\) for outcome, where \({m}_{z}^{k}\left({\varvec{X}}\right)={m}^{k}\left({\varvec{X}},Z\right)\). Without loss of generality, let \({\mathbb{I}}=1,\dots ,{n}_{1}\) and \({\mathbb{J}}=1,\dots ,{n}_{0}\) be the indexes for treated and untreated subjects, respectively. Let \({n}_{1}\) and \({n}_{0}=n-{n}_{1}\) represent the size of treatment and control groups, respectively.

To recover the treated population average from subjects in the treatment group, the empirical likelihood weights \({w}_{i}\left(i\in {\mathbb{I}}\right)\) for the outcome \({Y}_{i}\left(i\in {\mathbb{I}}\right)\) in the treatment group are estimated by maximizing \(\prod_{i\in {\mathbb{I}}}{w}_{i}\) subject to the following constraints:

$${w}_{i}\ge 0 (i\in {\mathbb{I}})$$
$$\sum_{i\in {\mathbb{I}}}{w}_{i}=1$$
$$\sum_{i\in {\mathbb{I}}}{w}_{i}{\widehat{\pi }}^{l}\left({{\varvec{X}}}_{i}\right)={\widehat{\theta }}_{1}^{l}(l=1, 2, 3,\dots ,L)$$
$$\sum_{i\in {\mathbb{I}}}{w}_{i}{\widehat{m}}_{1}^{k}\left({{\varvec{X}}}_{i}\right)={\widehat{\eta }}_{1}^{k}(k=1, 2, 3,\dots ,K)$$

where \({\widehat{\theta }}_{1}^{l}={n}^{-1}{\sum }_{i=1}^{n}{\pi }^{l}\left({{\varvec{X}}}_{i}\right)\) and \({\widehat{\eta }}_{1}^{k}={n}^{-1}{\sum }_{i=1}^{n}{m}_{1}^{k}\left({{\varvec{X}}}_{i}\right)\). By symmetry, the weights \({w}_{j}\left(j\in {\mathbb{J}}\right)\) for the control group are given by maximizing \(\prod_{j\in {\mathbb{J}}}{w}_{j}\) according to the following constraints:

$${w}_{j}\ge 0 (j\in {\mathbb{J}})$$
$$\sum_{j\in {\mathbb{J}}}{w}_{j}=1$$
$$\sum_{j\in {\mathbb{J}}}{w}_{j}({1-\widehat{\pi }}^{l}\left({{\varvec{X}}}_{j}\right))={\widehat{\theta }}_{0}^{l}(l=1, 2, 3,\dots ,L)$$
$$\sum_{j\in {\mathbb{J}}}{w}_{j}{\widehat{m}}_{0}^{k}\left({{\varvec{X}}}_{j}\right)={\widehat{\eta }}_{0}^{k}(k=1, 2, 3,\dots ,K)$$

where \({\widehat{\theta }}_{0}^{l}={n}^{-1}{\sum }_{i=1}^{n}(1-{\pi }^{l}\left({{\varvec{X}}}_{i}\right))\) and \({\widehat{\eta }}_{0}^{k}={n}^{-1}{\sum }_{i=1}^{n}{m}_{0}^{k}\left({{\varvec{X}}}_{i}\right)\). The \({w}_{i}\) and \({w}_{j}\) can be given with Lagrange multiplier method as follows:

$${\widehat w}_i=\frac1{n_1}\frac1{1+\widehat p_1^T{\widehat g}_1\left(X_i\right)}\left(i\in\mathbb{I}\right)$$
$${\widehat w}_j=\frac1{n_0}\frac1{1+\widehat p_0^T{\widehat g}_0\left(X_j\right)}\left(j\in\mathbb{J}\right)$$

where

$${\widehat{{\varvec{g}}}}_{1}{\left({\varvec{X}}\right)}^{T}=\left\{{\widehat{\pi }}^{1}\left({\varvec{X}}\right)-{\widehat{\theta }}_{1}^{1},\dots ,{\widehat{\pi }}^{L}\left({\varvec{X}}\right)-{\widehat{\theta }}_{1}^{L}, {\widehat{m}}_{1}^{1}\left({\varvec{X}}\right)-{\widehat{\eta }}_{1}^{1},\dots ,{\widehat{m}}_{1}^{K}\left({\varvec{X}}\right)-{\widehat{\eta }}_{1}^{K}\right\}$$
$${\widehat{{\varvec{g}}}}_{0}{\left({\varvec{X}}\right)}^{T}=\left\{{(1-\widehat{\pi }}^{1}\left({\varvec{X}}\right)\right)-{\widehat{\theta }}_{0}^{1},\dots ,(1-{\widehat{\pi }}^{L}\left({\varvec{X}}\right))-{\widehat{\theta }}_{0}^{L}, {\widehat{m}}_{0}^{1}\left({\varvec{X}}\right)-{\widehat{\eta }}_{0}^{1},\dots ,{\widehat{m}}_{0}^{K}\left({\varvec{X}}\right)-{\widehat{\eta }}_{0}^{K}\}$$

\({\widehat{{\varvec{\rho}}}}_{1}^{T}\) and \({\widehat{{\varvec{\rho}}}}_{0}^{T}\) are the \((J+K\))-dimensional Lagrange multipliers solving

$$\frac1{n_1}\sum\limits_{i\in\mathbb{I}}\frac{{\widehat{{\varvec{g}}}}_1\left(\boldsymbol{X}_i\right)}{1+{\widehat{{\varvec{\rho}}}}_1^T{\widehat{{\varvec{g}}}}_1\left(\boldsymbol{X}_i\right)}=0$$
$$\frac1{n_0}\sum\limits_{j\in\mathbb{I}}\frac{{\widehat{{\varvec{g}}}}_0\left(\boldsymbol{X}_j\right)}{1+{\widehat{{\varvec{\rho}}}}_0^T{\widehat{{\varvec{g}}}}_0\left(\boldsymbol{X}_j\right)}=0$$

\({\widehat{{\varvec{\rho}}}}_{1}^{T}\) and \({\widehat{{\varvec{\rho}}}}_{0}^{T}\) must satisfy \(1+{\widehat{\rho }}_{1}^{T}{\widehat{g}}_{1}\left({{\varvec{X}}}_{i}\right)>0\) and \(1+{\widehat{\rho }}_{0}^{T}{\widehat{g}}_{0}\left({{\varvec{X}}}_{j}\right)>0\) due to the non-negativity of \({w}_{i}\) and \({w}_{j}\), respectively. The estimation of \({\widehat{w}}_{i}\) and \({\widehat{w}}_{j}\) can be solved by the Newton–Raphson algorithm [9].

In summary, the ATE estimated by MR method [9] is defined as

$${\widehat{\Delta }}_{mr(Han)}=\sum_{i\in {\mathbb{I}}}{\widehat{w}}_{i}{Y}_{i}-\sum_{j\in {\mathbb{J}}}{\widehat{w}}_{j}{Y}_{j}$$

Proposed multiply robust method

The previous MR method [9] allows multiple parametric models, increasing the likelihood of including the correct model; However, there is still a biased estimates of ATE when all parametric models are misspecified. Based on the previous MR method, the proposed MR method allows multiple parametric models, and also includes a nonparametric PS model and a nonparametric OR model.

We select the neural network (NNET) as nonparametric models in the proposed MR method. NNET, one machine learning algorithm, has been used to estimate the ATE [15, 16]. We specified three-layer (input layer, one hidden layer, output layer) NNET, which may be practical [12], and the hidden layer consists of 4 nodes. We performed the NNET using the nnet R package with default parameters. Finally, the proposed MR method added a NNET-based outcome regression model (NN-OR) and a NNET-based propensity score model (NN-PS) in base of previous MR method.

Similar to the previous MR method [9], we specify two sets of models, \(\mathcal{P}=\left\{{\pi }^{l}\left({\varvec{X}}\right),l=1, 2, 3,\dots ,L,L+1\right\}\) for propensity score and \(\mathcal{M}=\left\{{m}_{z}^{k}\left({\varvec{X}}\right),k=\mathrm{1,2},3,\dots ,K,K+1\right\}\) for outcome. Assume \({\pi }^{L+1}\left({\varvec{X}}\right)\) and \({m}_{z}^{K+1}({\varvec{X}})\) are the NN-PS and NN-OR models, respectively, and the other are parametric models. The empirical likelihood weights \({w}_{i}\left(i\in {\mathbb{I}}\right)\) for the outcome \({Y}_{i}\left(i\in {\mathbb{I}}\right)\) in the treatment group are estimated by maximizing \(\prod_{i\in {\mathbb{I}}}{w}_{i}\) subject to the following constraints:

$${w}_{i}\ge 0 (i\in {\mathbb{I}})$$
$$\sum_{i\in {\mathbb{I}}}{w}_{i}=1$$
$$\sum_{i\in {\mathbb{I}}}{w}_{i}{\widehat{\pi }}^{l}\left({{\varvec{X}}}_{i}\right)={\widehat{\theta }}_{1}^{l}(l=1, 2, 3,\dots ,L,L+1)$$
$$\sum_{i\in {\mathbb{I}}}{w}_{i}{\widehat{m}}_{1}^{k}\left({{\varvec{X}}}_{i}\right)={\widehat{\eta }}_{1}^{k}(k=1, 2, 3,\dots ,K,K+1)$$

where \({\widehat{\theta }}_{1}^{l}={n}^{-1}{\sum }_{i=1}^{n}{\pi }^{l}\left({{\varvec{X}}}_{{\varvec{i}}}\right)\) and \({\widehat{\eta }}_{1}^{k}={n}^{-1}{\sum }_{i=1}^{n}{m}_{1}^{k}\left({{\varvec{X}}}_{i}\right)\). By symmetry, the weights \({w}_{j}\left(j\in {\mathbb{J}}\right)\) for the control group are given by maximizing \(\prod_{j\in {\mathbb{J}}}{w}_{j}\) according to the following constraints:

$${w}_{j}\ge 0 (j\in {\mathbb{J}})$$
$$\sum_{j\in {\mathbb{J}}}{w}_{j}=1$$
$$\sum_{j\in {\mathbb{J}}}{w}_{j}({1-\widehat{\pi }}^{l}\left({{\varvec{X}}}_{j}\right))={\widehat{\theta }}_{0}^{l}(l=1, 2, 3,\dots ,L,L+1)$$
$$\sum_{j\in {\mathbb{J}}}{w}_{j}{\widehat{m}}_{0}^{k}\left({{\varvec{X}}}_{j}\right)={\widehat{\eta }}_{0}^{k}(k=1, 2, 3,\dots ,K,K+1)$$

where \({\widehat{\theta }}_{0}^{l}={n}^{-1}{\sum }_{i=1}^{n}(1-{\pi }^{l}\left({{\varvec{X}}}_{i}\right))\) and \({\widehat{\eta }}_{0}^{k}={n}^{-1}{\sum }_{i=1}^{n}{m}_{0}^{k}\left({{\varvec{X}}}_{i}\right)\). The \({w}_{i}\) and \({w}_{j}\) can be given with Lagrange multiplier method as follows:

$${\widehat{w}}_{i}=\frac{1}{{n}_{1}}\frac{1}{1+{\widehat{{\varvec{\rho}}}}_{1}^{T}{\widehat{{\varvec{g}}}}_{1}\left({{\varvec{X}}}_{i}\right)} \left(i\in {\mathbb{I}}\right)$$
$${\widehat{w}}_{j}=\frac{1}{{n}_{0}}\frac{1}{1+{\widehat{{\varvec{\rho}}}}_{0}^{T}{\widehat{{\varvec{g}}}}_{0}\left({{\varvec{X}}}_{j}\right)} \left(j\in {\mathbb{J}}\right)$$

where

$${\widehat{{\varvec{g}}}}_{1}{\left({\varvec{X}}\right)}^{T}=\left\{{\widehat{\pi }}^{1}\left({\varvec{X}}\right)-{\widehat{\theta }}_{1}^{1},\dots ,{\widehat{\pi }}^{L}\left({\varvec{X}}\right)-{\widehat{\theta }}_{1}^{L},{\widehat{\pi }}^{L+1}\left({\varvec{X}}\right)-{\widehat{\theta }}_{1}^{L+1}, {\widehat{m}}_{1}^{1}\left({\varvec{X}}\right)-{\widehat{\eta }}_{1}^{1},\dots ,{\widehat{m}}_{1}^{K}\left({\varvec{X}}\right)-{\widehat{\eta }}_{1}^{K},{\widehat{m}}_{1}^{K+1}\left({\varvec{X}}\right)-{\widehat{\eta }}_{1}^{K+1}\right\}$$
$${\widehat{{\varvec{g}}}}_{0}{\left({\varvec{X}}\right)}^{T}=\left\{{(1-\widehat{\pi }}^{1}\left({\varvec{X}}\right)\right)-{\widehat{\theta }}_{0}^{1},\dots ,\left(1-{\widehat{\pi }}^{L}\left({\varvec{X}}\right)\right)-{\widehat{\theta }}_{0}^{L}, \left(1-{\widehat{\pi }}^{L+1}\left({\varvec{X}}\right)\right)-{\widehat{\theta }}_{0}^{L+1}, {\widehat{m}}_{0}^{1}\left({\varvec{X}}\right)-{\widehat{\eta }}_{0}^{1},\dots ,{\widehat{m}}_{0}^{K}\left({\varvec{X}}\right)-{\widehat{\eta }}_{0}^{K},{\widehat{m}}_{0}^{K+1}\left({\varvec{X}}\right)-{\widehat{\eta }}_{0}^{K+1}\}$$

\({\widehat{{\varvec{\rho}}}}_{1}^{T}\) and \({\widehat{{\varvec{\rho}}}}_{0}^{T}\) are the \((J+K+2\))-dimensional Lagrange multipliers solving

$$\frac1{n_1}\sum\limits_{i\in\mathbb{I}}\frac{{\widehat{{\varvec{g}}}}_1\left(\boldsymbol{X}_i\right)}{1+{\widehat{{\varvec{\rho}}}}_1^T{\widehat{{\varvec{g}}}}_1\left(\boldsymbol{X}_i\right)}=0$$
$$\frac1{n_0}\sum\limits_{i\in\mathbb{J}}\frac{{\widehat{{\varvec{g}}}}_0\left(\boldsymbol{X}_j\right)}{1+{\widehat{{\varvec{\rho}}}}_0^T{\widehat{{\varvec{g}}}}_0\left(\boldsymbol{X}_j\right)}=0$$

\({\widehat{{\varvec{\rho}}}}_{1}^{T}\) and \({\widehat{{\varvec{\rho}}}}_{0}^{T}\) must satisfy \(1+{\widehat{{\varvec{\rho}}}}_{1}^{{\varvec{T}}}{\widehat{{\varvec{g}}}}_{1}\left({{\varvec{X}}}_{i}\right)>0\) and \(1+{\widehat{{\varvec{\rho}}}}_{0}^{T}{\widehat{{\varvec{g}}}}_{0}\left({{\varvec{X}}}_{j}\right)>0\) due to the non-negativity of \({w}_{i}\) and \({w}_{j}\), respectively. The estimation of \({\widehat{w}}_{i}\) and \({\widehat{w}}_{j}\) can be solved by the Newton–Raphson algorithm [9].

The ATE estimated by the proposed method is defined as

$${\widehat{\Delta }}_{mr}=\sum_{i\in {\mathbb{I}}}{\widehat{w}}_{i}{Y}_{i}-\sum_{j\in {\mathbb{J}}}{\widehat{w}}_{j}{Y}_{j}$$

Bootstrap confidence interval

The confidence interval of the estimators \(\Delta\) could be obtained by the bootstrap method, where \(\Delta\) maybe IPW, OR, or MR estimator. Specifically, \(n\) individuals first are resampled with replacement from the original data for \(B\) times to obtain \(B\) bootstrap sample, where \(B\) is the pre-specified number. For \(b=1,\dots ,B\), let \({\widehat{\Delta }}^{b}\) be the estimates of the estimator from the \(b\)-th bootstrap sample. Then the bootstrap variance estimator for \({\widehat{\Delta }}^{b}\) is given by

$$\widehat{var}\left({\widehat{\Delta }}^{b}\right)= \frac{1}{B-1}\sum_{b=1}^{B}({\widehat{\Delta }}^{b}-\frac{1}{B}\sum_{b=1}^{B}{\widehat{\Delta }}^{b}{)}^{2}$$

A normality-based 95% confidence interval for \(\Delta\) is \({\widehat{\Delta }}^{b}\pm 1.96\sqrt{\widehat{var}\left({\widehat{\Delta }}^{b}\right)}\)

Simulation study

Simulation design

We conducted comprehensive simulation studies to evaluate the performance of the proposed MR method. Ten covariates \({X}_{1}-{X}_{10}\) were simulated from the standard normal distribution, where \(\mathrm{corr}\left({X}_{1},{X}_{5}\right)=\mathrm{corr}\left({X}_{4},{X}_{9}\right)=0.9\) and \(\mathrm{corr}\left({X}_{2},{X}_{6}\right)=\mathrm{corr}\left({X}_{3},{X}_{8}\right)=0.2\). The treatment indicator was generated from a Bernoulli distribution with a PS model as followed

$$logit[P(Z=1|{\varvec{X}})]=0.8*{X}_{1}-0.25*{X}_{2} + 0.6*{X}_{3}-0.4*{X}_{4}-0.8*{X}_{5}-0.5*{X}_{6}+ 0.7*{X}_{7}$$

, which produced a treatment prevalence of approximately 50%. The outcome was simulated from

$$Y=-3.85 + 0.3*{X}_{1}-0.36*{X}_{2}-0.73*{X}_{3}-0.2*{X}_{4} + 0.71*{X}_{8} + 0.19*{X}_{9} + 0.26*{X}_{10} + 0.3*{X}_{1}^{2}-0.36*{X}_{2}^{2} + Z + \varepsilon$$

where \(\varepsilon \sim N(\mathrm{0,1})\). The true value of ATE is 1 in the simulation.

In the simulation, we specified three models, including a NN-PS model and two parametric models

$${\mathbb{A}}=\left\{\begin{array}{c}{\pi }^{1}(X)={f}_{1}({X}_{1},{X}_{2},{X}_{3},{{X}_{4},X}_{5},{X}_{6},{X}_{7})\\ {\pi }^{2}(X)={f}_{2}\left({X}_{1},{X}_{2},{X}_{3}{,{X}_{4},X}_{5},{X}_{6},{X}_{7}\right)\\ {\pi }^{3}(X)={f}_{3}\left({X}_{1}^{2},{X}_{2}^{2},{X}_{3}^{2},{X}_{4}^{2},{X}_{5}^{2},{X}_{6}^{2},{X}_{7}^{2}\right)\end{array}\right\}$$

for propensity score. Let \({\pi }^{1}({\varvec{X}})\) be the PS of NN-PS model, and \({\pi }^{2}({\varvec{X}})\) and \({\pi }^{3}\left({\varvec{X}}\right)\) be the PS of logistic models. And we also specified three models, including a OR-NNET model and two parametric models

$${\mathbb{B}}=\left\{\begin{array}{c}{m}^{1}(X,Z)={h}_{1}({X}_{1},{X}_{2},{X}_{3},{{X}_{4},X}_{8},{X}_{9},{X}_{10},Z)\\ \begin{array}{c}{m}^{2}\left({\varvec{X}},Z\right)={h}_{2}\left({X}_{1},{X}_{2},{X}_{3}{,X}_{8},{X}_{9},{X}_{10},{X}_{1}^{2},{X}_{2}^{2},Z\right)\\ {m}^{3}\left({\varvec{X}},Z\right)={h}_{3}\left({X}_{1}{X}_{2},{X}_{3}{X}_{4},{X}_{8}{X}_{9},{X}_{1}{X}_{8},{X}_{2}{X}_{9},{X}_{3}{X}_{10},Z\right)\end{array}\end{array}\right\}$$

for outcome regression. Let \({m}^{1}({\varvec{X}},Z)\) be the outcome of NN-OR model, and \({m}^{2}({\varvec{X}},Z)\) and \({m}^{3}\left({\varvec{X}},Z\right)\) be the outcome of linear regression models. According to the data-generating process, \({\pi }^{2}\left({\varvec{X}}\right)\) and \({m}^{2}\left({\varvec{X}},{\varvec{Z}}\right)\) were correctly specified. In order to distinguish these estimation methods, three IPW estimators were defined as “IPW.model1”, “IPW.model2” and “IPW.model3”, where “IPW.model1” refer to the IPW estimator with a NN-PS model; three OR estimators were defined as “OR.model1”, “OR.model2” and “OR.model3”, where “OR.model1” refer to the OR estimator with NN-OR model. For the MR estimators, each estimator is denoted as “MR000000” where each digit of the six numbers, from left to right, indicates if \({\pi }^{1}\left({\varvec{X}}\right)\), \({\pi }^{2}\left({\varvec{X}}\right)\), \({\pi }^{3}\left({\varvec{X}}\right)\),\({m}^{1}\left({\varvec{X}},Z\right), {m}^{2}\left({\varvec{X}},Z\right)\) or \({m}^{3}\left({\varvec{X}},Z\right)\) is included in the estimator (“1” means yes, and “0” means no).

And we studied the performance of MR estimators when all parametric models were misspecified as follows

$${\mathbb{A}}=\left\{\begin{array}{c}{\pi }^{1}(X)={f}_{1}({X}_{1},{X}_{2},{X}_{3},{{X}_{4},X}_{5},{X}_{6},{X}_{7})\\ {\pi }^{2}\left({\varvec{X}}\right)={f}_{2}\left({X}_{1},{X}_{2}{,X}_{5}\right)\\ {\pi }^{3}(X)={f}_{3}\left({X}_{1}^{2},{X}_{2}^{2},{X}_{3}^{2},{X}_{4}^{2},{X}_{5}^{2},{X}_{6}^{2},{X}_{7}^{2}\right)\end{array}\right\}$$

for propensity score

$${\mathbb{B}}=\left\{\begin{array}{c}{m}^{1}(X,Z)={h}_{1}({X}_{1},{X}_{2},{X}_{3},{{X}_{4},X}_{8},{X}_{9},{X}_{10},Z)\\ \begin{array}{c}{m}^{2}\left({\varvec{X}},Z\right)={h}_{2}\left({X}_{1},{X}_{2},{X}_{10},Z\right)\\ {m}^{3}\left({\varvec{X}},Z\right)={h}_{3}\left({X}_{1}{X}_{2},{X}_{3}{X}_{4},{X}_{8}{X}_{9},{X}_{1}{X}_{8},{X}_{2}{X}_{9},{X}_{3}{X}_{10},Z\right)\end{array}\end{array}\right\}$$

for outcome regression.

In addition, which factors related to the treatment and outcome were unknown in practice; hence, we also explored the performance of MR estimators when including NNET models with all covariates in situations where parametric models included correct models or did not include any correct model.

In the simulation, we calculated the mean relative bias, RMSE, and coverage rate to assess the performance of the proposed MR method. All results were based on 1000 simulation replications, and the sample sizes \(n=200, 500 \mathrm{\ and}\ 2000\).

Simulation results

Table 1 and Figure S1 showed the simulation results of estimating ATE where the proposed MR method included the correct parametric models. And Table 2 and Figure S2 showed the simulation results when there were no correct parametric models in MR method. We could get a few conclusions from the simulation studies.

Table 1 Simulation results with different sample sizes = 200, 500 or 2000 in the situation where the parametric models included the correct models and the neural network model included true covariates
Table 2 Simulation results with different sample sizes = 200, 500 or 2000 in the situation where the parametric models did not include the correct models and the neural network model included true covariates

According to Table 1 and Figure S1, biases of IPW, OR, or MR estimators were ignorable when the parametric models were correctly specified or NNET models were included. The RMSEs of estimators with correct PS or NN-PS models were larger than those with correct OR or NN-OR, respectively. And the RMSEs of the estimators with the correct parametric OR model (OR.model2, MR000010) had the smallest RMSEs among all estimators, and were significantly less than those with NN-OR model (OR.model1, MR000100). However, it can easily be seen that the biases and RMSEs of the estimator with a wrong parametric model are much larger (OR.model3, MR000001).

Together with Table 2 and Figure S2, the proposed MR estimators improved the robustness to the model misspecification even if the parametric models were all incorrectly modeled (MR111000, MR000111, MR111011, MR011111, MR111111 in Table 2 and Figure S2). Although the biases of the six estimators are negligible, the MR000111 had the smallest RMSE among the six estimators. Further, MR000111 with a correct OR model (MR000111 in Table 1 and Figure S1) had a smaller RMSE than the MR estimators with both parametric PS and OR models; and the RMSE of MR000111 was small as that of OR estimator with the correct parametric model (OR.model2 in Table 1 and Figure S1). Even if the two parametric OR models are incorrectly specified, the RMSE of MR000111 is similar to that of NN-OR model.

In addition, Table 3 and Figure S3 showed the simulation results when NNET models included all covariates where parametric models included a correct model; and Table 4 and Figure S4 showed the simulation results when NNET models included all covariates where parametric models did not include a correct model. Similar results were observed in Tables 3 and 4 and Figures S3 and S4.

Table 3 Simulation results with different sample sizes = 200, 500 or 2000 in the situation where the parametric models included the correct models and the neural network model included all covariates
Table 4 Simulation results with different sample sizes = 200, 500 or 2000 in the situation where the parametric models did not include the correct models and the neural network model included all covariates

In conclusion, the simulation results showed that the proposed MR estimators were robust to model misspecification even if all parametric models were incorrectly specified. Further, considering the robustness to model misspecification and RMSE, the MR estimators with only OR models, where one of the models was a NNET model, was the most recommended. The recommended estimators were robust to model misspecification and tended to have the smallest RMSE when the estimators included a correct OR model; and the performance of the recommended estimators was comparative even if all parametric models were misspecified.

Empirical study

The China Health and Retirement Longitudinal Study (CHARLS) is a large-scale, nationally representative longitudinal survey of people aged 45 or older and their spouses in China, including assessments of the social, economic, and health status of community residents [25]. The study aimed to estimate the treatment effect of social activity on the depression level in the real-world data from CHARLS. The depression level was evaluated by CES-D Depression Scale, and the total score was between 0 and 30: a higher total score denotes a higher depression level, while a lower total score denotes a lower depression level. The self-reported social activity includes 11 categories based on individual responses to the question, “Have you done any of these activities in the last month”. The value of the variable is 1 if the participant takes part in any activities; otherwise, the value is 0. The group with a value 1 was defined as a social activity group, and the group with a value 0 was defined as a non-social activity group. Baseline information included age, marital status, sex, region, smoke status, self-reported hypertension, self-reported diabetes, self-reported heart disease, and self-reported stroke. Inclusion criteria are: (1) participants who took part in the survey in 2011–2012 (2) complete baseline. A total of 10,119 participants were included in the analysis. The baseline information was summarized in Table S1. We found that the non-social activity group had a higher level of depression.

In the study, we specified three PS models (including one NN-PS model and two logistic models) and three OR models (including one NN-OR model and two linear regression) in the MR method. For the NNET models, we included all covariates. For the parametric models, we explored the association of social activity group/non-social activity group with all covariates via a logistic model, and the association of depression with all covariates through a linear model; and we identified candidate models with a significant level at 0.05 and 0.01. Hence, three set of covariates in [\({\pi }^{1}\left({\varvec{X}}\right),{\pi }^{2}\left({\varvec{X}}\right),{\pi }^{3}({\varvec{X}})\)] are: (i) age, marital status, sex, region, smoke status, self-reported hypertension, self-reported diabetes, self-reported heart disease and self-reported stroke; (ii) age, marital status, sex, region, smoke status, and self-reported diabetes; (iii) age, region and smoke status. Three sets of covariates in [\({m}^{1}({\varvec{X}},Z),{m}^{2}({\varvec{X}},Z),{m}^{3}({\varvec{X}},Z)\)] are: (i) age, marital status, sex, region, smoke status, self-reported hypertension, self-reported diabetes, self-reported heart disease, self-reported stroke and activity group; (ii) marital status, sex, region, self-reported diabetes, self-reported heart disease, self-reported stroke, and social activity; (iii) marital status, sex, region, self-reported heart disease, self-reported stroke and activity group. We applied the MR methods to estimate the effect. The results were shown in Table 5 and Figure S5.

Table 5 Estimating effect of social activity on depression level (non-social activity group as reference group)

From Table 5, all estimates showed that the social activity group had lower depression scores than the non-social activity group. However, when only specifying NN-PS model and NN-OR model (MR100000, MR000100, MR100100), the estimators tended to have higher estimated values and higher standard errors. The other estimators had similar estimates, and MR000111 tended to have the smallest standard errors.

Discussion

In this study, we considered estimating ATE between treatment and control groups in observational studies. The proposed MR estimators combined parametric and nonparametric models based on the previous MR method [8]. Our simulation study showed that the MR estimators with only outcome regression (OR) models, where one of the models was a nonparametric model, were the most recommended because of the robustness to model misspecification and the lowest root mean square error (RMSE) when including a correct parametric OR model. And the performance of the recommended estimators was comparative even if all parametric models were misspecified. We mainly focused on estimating ATE in the study, and our proposed method can be easily to estimate other causal parameters, such as the average treatment effect on the treated, \(E({Y}^{1}-{Y}^{0}|Z=1)\) [26], and log of causal risk ratio for binary outcome \(log\frac{E({Y}^{1})}{E({Y}^{0})}\) [27].

Our simulation results showed that the IPW, direct confounding adjustment, or MR estimators with only parametric models might gain a large bias, large RMSE, and low coverage rate when the parametric models were misspecified. By contrast, when adding a NNET model to the MR estimators, the bias was ignorable and coverage rate was close to 95% even if we misspecified all parametric models. In addition, the estimators with only OR models are the most recommended because of the robustness to the model specification and the smallest RMSE when including a correct OR model. Further, NNET can extract the complex relationship among variables without prior information so that we put some variables unrelated to exposure or outcome in the model and could still get similar results.

A limitation of study is that no relevant theoretical proof was provided, and future research will focus on theoretical proofs and properties of the proposed method. Further, we focused on estimating ATE in a non-survival context in the study, but there are lots of time-to-event outcomes in observational studies, and the extension of the proposed method in the survival outcomes studies will be a topic of our future research.

Conclusions

In this study, we proposed a new MR estimator, considering nonparametric and parametric models, which is more robust to model misspecification.

Availability of data and materials

Publicly available datasets were analyzed in this study. These data can be downloaded from http://charls.pku.edu.cn/.

Abbreviations

MR:

Multiply robust

ATE:

Average treatment effect

RMSE:

Root mean square error

IPW:

Inverse probability weighting

PS:

Propensity score

OR:

Outcome regression

NNET:

Neural network

CHARLS:

China Health and Retirement Longitudinal Study

References

  1. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55.

    Article  Google Scholar 

  2. Hernán MA, Robins JM. Causal inference. Boca Raton: CRC; 2010.

  3. Mansournia MA, Altman DG. Inverse probability weighting. BMJ. 2016;352.

  4. Robins J. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Math Model. 1986;7(9–12):1393–512.

    Article  Google Scholar 

  5. Bang H, Robins JM. Doubly robust estimation in missing data and causal inference models. Biometrics. 2005;61(4):962–73.

    Article  PubMed  Google Scholar 

  6. Cao W, Tsiatis AA, Davidian M. Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data. Biometrika. 2009;96(3):723–34.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Glynn AN, Quinn KM. An introduction to the augmented inverse propensity weighted estimator. Polit Anal. 2010;18(1):36–56.

    Article  Google Scholar 

  8. Han P, Wang L. Estimation with missing data: beyond double robustness. Biometrika. 2013;100(2):417–30.

    Article  Google Scholar 

  9. Han P. A further study of the multiply robust estimator in missing data analysis. JASA. 2014;109(507):1159–73.

    CAS  Google Scholar 

  10. Wang L. Multiple robustness estimation in causal inference. Commun Stat Theory Methods. 2019;48(23):5701–18.

    Article  Google Scholar 

  11. Shu D, Han P, Wang R, Toh S. Estimating the marginal hazard ratio by simultaneously using a set of propensity score models: a multiply robust approach. Stat Med. 2021;40(5):1224–42.

    Article  PubMed  Google Scholar 

  12. Westreich D, Lessler J, Funk MJ. Propensity score estimation: neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression. J Clin Epidemiol. 2010;63(8):826–33.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29(3):337–46.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Cannas M, Arpino B. A comparison of machine learning algorithms and covariate balance measures for propensity score matching and weighting. Biom J. 2019;61(4):1049–72.

    Article  PubMed  Google Scholar 

  15. Setoguchi S, Schneeweiss S, Brookhart MA, Glynn RJ, Cook EF. Evaluating uses of data mining techniques in propensity score estimation: a simulation study. Pharmacoepidemiol Drug Saf. 2008;17(6):546–55.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Gharibzadeh S, Mansournia MA, Rahimiforoushani A, Alizadeh A, Amouzegar A, Mehrabani-Zeinabad K, et al. Comparing different propensity score estimation methods for estimating the marginal causal effect through standardization to propensity scores. Commun Stat Simul Comput. 2018;47(4):964–76.

    Article  Google Scholar 

  17. Chernozhukov V, Escanciano JC, Ichimura H, Newey WK, Robins JM. Locally robust semiparametric estimation. Econometrica. 2022;90(4):1501–35.

    Article  Google Scholar 

  18. Le Borgne F, Chatton A, Léger M, Lenain R, Foucher Y. G-computation and machine learning for estimating the causal effects of binary exposure statuses on binary outcomes. Sci Rep. 2021;11(1):1–12.

    Article  Google Scholar 

  19. Probst P, Boulesteix A-L, Bischl B. Tunability: importance of hyperparameters of machine learning algorithms. J Mach Learn Res. 2019;20(1):1934–65.

    Google Scholar 

  20. Yang L, Shami A. On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing. 2020;415:295–316.

    Article  Google Scholar 

  21. Colangelo K, Lee Y-Y. Double debiased machine learning nonparametric inference with continuous treatments. arXiv preprint arXiv:200403036. 2020.

  22. Kennedy EH, Ma Z, McHugh MD, Small DS. Non-parametric methods for doubly robust estimation of continuous treatment effects. J R Stat Soc Series B Stat Methodol. 2017;79(4):1229–45.

    Article  PubMed  Google Scholar 

  23. Benkeser D, Carone M, Laan MVD, Gilbert P. Doubly robust nonparametric inference on the average treatment effect. Biometrika. 2017;104(4):863–80.

    Article  CAS  PubMed  Google Scholar 

  24. Hernan M, Robins J. Causal Inference: What if. Boca Raton: Chapman & Hill/CRC; 2020.

    Google Scholar 

  25. Zhao Y, Hu Y, Smith JP, Strauss J, Yang G. Cohort profile: the China health and retirement longitudinal study (CHARLS). Int J Epidemiol. 2014;43(1):61–8.

    Article  PubMed  Google Scholar 

  26. Hartman E, Grieve R, Ramsahai R, Sekhon JS. From sample average treatment effect to population average treatment effect on the treated: combining experimental with observational studies to estimate population treatment effects. J R Stat Soc A Stat Soc. 2015;178(3):757–78.

    Article  Google Scholar 

  27. Wei K, Qin G, Zhang J, Sui X. Multiply robust estimation of the average treatment effect with missing outcomes. J Stat Comput and Simul. 2023;93(10):1479–95.

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by National Natural Science Foundation of China (No. 82273730 and 82173612), Shanghai Rising-Star Program (21QA1401300), Shanghai Municipal Natural Science Foundation (22ZR1414900) and Shanghai Municipal Science and Technology Major Project (ZD2021CY001). The sponsors had no role in study design, data collection, data analysis, data interpretation, or writing of this report.

Author information

Authors and Affiliations

Authors

Contributions

Ce Wang, Kecheng Wei, Yongfu Yu, and Guoyou Qin propose the study concept and design; Ce Wang, Kecheng Wei and Chen Huang drafted the manuscript; Ce Wang and Kecheng Wei performed statistical analysis; Yongfu Yu and Guoyou Qin conducted study supervision; and all authors critically revised the manuscript for important intellectual content.

Corresponding authors

Correspondence to Yongfu Yu or Guoyou Qin.

Ethics declarations

Ethics approval and consent to participate

Since the simulated datasets did not involve any human data, ethics approval was not applicable; and the real data is publicly available, thus ethics approval was not required.

Consent for publication

Not applicable.

Competing interests

None of the co-authors have a conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, C., Wei, K., Huang, C. et al. An improved multiply robust estimator for the average treatment effect. BMC Med Res Methodol 23, 231 (2023). https://doi.org/10.1186/s12874-023-02056-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12874-023-02056-7

Keywords