Skip to main content

High-dimensional mediation analysis for continuous outcome with confounders using overlap weighting method in observational epigenetic study



Mediation analysis is a powerful tool to identify factors mediating the causal pathway of exposure to health outcomes. Mediation analysis has been extended to study a large number of potential mediators in high-dimensional data settings. The presence of confounding in observational studies is inevitable. Hence, it’s an essential part of high-dimensional mediation analysis (HDMA) to adjust for the potential confounders. Although the propensity score (PS) related method such as propensity score regression adjustment (PSR) and inverse probability weighting (IPW) has been proposed to tackle this problem, the characteristics with extreme propensity score distribution of the PS-based method would result in the biased estimation.


In this article, we integrated the overlapping weighting (OW) technique into HDMA workflow and proposed a concise and powerful high-dimensional mediation analysis procedure consisting of OW confounding adjustment, sure independence screening (SIS), de-biased Lasso penalization, and joint-significance testing underlying the mixture null distribution. We compared the proposed method with the existing method consisting of PS-based confounding adjustment, SIS, minimax concave penalty (MCP) variable selection, and classical joint-significance testing.


Simulation studies demonstrate the proposed procedure has the best performance in mediator selection and estimation. The proposed procedure yielded the highest true positive rate, acceptable false discovery proportion level, and lower mean square error. In the empirical study based on the GSE117859 dataset in the Gene Expression Omnibus database using the proposed method, we found that smoking history may lead to the estimated natural killer (NK) cell level reduction through the mediation effect of some methylation markers, mainly including methylation sites cg13917614 in CNP gene and cg16893868 in LILRA2 gene.


The proposed method has higher power, sufficient false discovery rate control, and precise mediation effect estimation. Meanwhile, it is feasible to be implemented with the presence of confounders. Hence, our method is worth considering in HDMA studies.

Peer Review reports


The analysis of the mediating effect was first proposed by Baron and Kenny (1986) [1] and was broadly applied in many scientific fields, such as psychological, sociological, and biomedical studies [2,3,4]. Mediation analysis has become a powerful tool to investigate the underlying mechanism of environmental exposures on health outcomes and identify the factors mediating the effect of exposures on outcomes [5]. Currently, analytical methods including the single mediator model [6, 7], multiple-mediators model [8], and high-dimensional mediation model [9] are proposed and available for researchers in many scientific fields.

With the development of advanced data collection techniques, high-dimensional data has become common in biomedical research. For example, in the epigenetic study, the Illumina Infinium HumanMethylation450 BeadChip array platform allows to measure the DNA methylation levels of roughly 480 K probes [10] and generates high dimensional data. Focusing on practical research, smoking affects lung function, and some DNA methylation sites may mediate the effect of smoking on lung function [11, 12]. To identify the significant mediators (CpG sites) between smoking and lung function, we can conduct mediation analysis in the collected high-dimensional data [9, 13, 14]. Obviously, this method can be used to identify the methylation sites mediating the association between environmental factors other than smoking and other health outcomes including some physical signs and diseases.

However, there are also some issues in high dimensional mediation analysis (HDMA), such as the curse of dimensionality, the false positive rate inflation caused by multiplicity and the confounding existing in observational research. To overcome these issues, scholars have proposed a series of statistical methods. Zhang et al. [9]. proposed the HIMA model consisting of variable screening based on sure independence screening (SIS), variable selection techniques based on minimax concave penalty (MCP) estimation and joint significance test. HIMA extends the multiple mediator framework to the high-dimensional setting by incorporating variable screening and variable selection techniques into multiple mediation analysis. The following high-dimensional mediation analysis methods also employ the generic procedure [13,14,15,16], which reduces dimensionality from high to moderate or low scale and then conducts multiple mediation test. For example, the HIMA2 procedure proposed by Perera et al. [17], which employs the SIS method based on the indirect effect of every single mediator and conducts debiased Lasso to obtain more accurate estimates, then utilizes the multiple-testing procedure proposed by James et al. [18] to control the false discovery rate. Moreover, to adjust the confounders of observational epigenetic studies, researchers tried to integrate propensity score (PS) into the high-dimensional mediation model by weighting or considering it as a covariate [14, 16], except for the classic regression adjustment.

Although many works have been made to tackle these problems, there are still some issues remaining in the dimensionality reduction and adjustment for confounders. For high dimensional mediation analysis, the previous studies don’t take confounders into account, just consider them as covariates [15, 19], such as HDMA, HIMA, and HIMA2 [5, 9, 17]. As is known to all, the multivariable model cannot adequately account for confounding effects in the presence of a large number of confounders [20]. If we only control confounding during the mediation test, but not in the dimension reduction stage, then a biased variable selection result may be obtained [14]. Thus, it is necessary to adjust confounders to improve the performance of variable selection.

To address this issue, researchers have adopted the PS-based method including PS regression adjustment (termed PSR) and classical PS weighting (also called inverse probability weighting, IPW) to adjust confounding during both stages [14]. However, the adjustment for confounders using the IPW based on PS still faces the issue of extreme weights caused by extreme PS distribution [21, 22]. To address the issue of extreme PS distribution, Li et al. [23]. proposed the overlap weighting (OW) method, which emphasizes individuals with the most overlap in their observed characteristics and is beneficial to provide a consistent estimator of the effect of exposure on outcome in the presence of extreme PS tails. OW belongs to the weighting confounding adjustment method based on PS and is gaining more popularity because of excellent statistical properties [24, 25]. However, the above OW method is only applied to traditional epidemic analysis, which needs to be extended to mediation analysis and high-dimensional data setting. Besides, most of the existing methods all hold the independent assumption between potential mediators, which is hard to ensure in high dimensional epigenetic data analysis [5, 9, 13,14,15, 18].

In this article, we incorporated the OW method into HIMA [9] and HIMA2 [17] models, respectively. In order to develop the accuracy of the screening of potential mediators, we modified the framework of variable screening in the original HIMA2 procedure. Eventually, we proposed the OW-based modified HIMA2 (mHIMA2) procedure for HDMA. We evaluated the performance of the proposed procedure and the existing models through simulation studies. All the above evaluations are based on the simulation study and real data application.

The rest of the article is structured as follows. In the next section, we introduced the notions, assumptions, models, and the procedure of adjustment for confounders in the high-dimensional mediation analysis model. Then, we conducted the Monte Carlo simulation study to evaluate the performance of various methods of confounding adjustment and two different mediation test approaches. Additionally, we applied the proposed method to the dataset GSE117859 in the Gene Expression Omnibus (GEO) databases and identified some DNA methylation markers that mediate the effect of smoking on the estimated natural killer (NK) cell level. Finally, we concluded the advantages and limitations of this study.


Model definitions

Our high-dimensional mediation model is shown in Fig. 1. Let \(X\) be the exposure variable, where \(X=1\) represents the exposed group and \(X=0\) represents the controlled group. Denote the outcome as Y, here we mainly focus on continuous outcome. Let \(M={\left({M}_{1},{M}_{2},\cdots ,{M}_{P}\right)}^{T}\) be the set of the \(p\)-dimensional potential mediators, where \(p\gg n\), \(n\) is the sample size. Let \(C={\left({C}_{1},{C}_{2},\cdots ,{C}_{q}\right)}^{T}\) be the \(q\)-dimension baseline confounders which influence the relation of exposure-mediator, mediator-outcome, and exposure-outcome. For individual \(i\), \(i=\text{1,2},\cdots ,n\), we have the high-dimensional mediation models as follows:

$$\begin{array}{c}{M}_{ki}={a}_{k}+{\alpha }_{k}X+{{\phi }_{k}^{T}C}_{i}+{e}_{ki},k\in \left[p\right],\end{array}$$
$$\begin{array}{c}{Y}_{i}=a+{\gamma X}_{i}+{{\beta }^{T}M}_{i}+{\eta }^{T}{C}_{i}+{\epsilon }_{i}.\end{array}$$
Fig. 1
figure 1

Causal diagram. High-dimensional mediation model with confounders between exposure, mediator and outcome

where \(\alpha ={\left({\alpha }_{1},{\alpha }_{2},\cdots ,{\alpha }_{p}\right)}^{T}\) is the coefficient vector relating the exposure to the mediators, \(\beta ={\left({\beta }_{1},{\beta }_{2},\cdots ,{\beta }_{p}\right)}^{T}\) represents the effect of the mediators on the outcome, \({\alpha }_{k}{\beta }_{k}\) corresponds to the mediation effect of \({M}_{k}\) according to the definition of coefficients product method, and \(\left[p\right]\) denotes the set of \(\left\{\text{1,2},\cdots ,p\right\}\). One can consider whether \({M}_{k}\) is the statistically significant mediator or not by testing the null hypothesis \({H}_{0}:{\alpha }_{k}{\beta }_{k}=0\). \({\phi }_{k}\) and \(\eta\) are the effect of \(C\) on \(M\) and \(C\) on \(Y\), respectively. \({a}_{k}\) and \(a\) are the intercept term in the Eqs. 1 and 2, respectively. The same as above, \({e}_{k}\) and \(\epsilon\) are each the corresponding error term. We will compare the different variable selection strategies and methods of adjusting confounders.


To ensure the identification of path-specific mediating effects, some assumptions need to be held as below. These assumptions were proposed referring to necessary condition required for high-dimensional mediation analysis suggested in published studies [8, 15, 17, 19, 26, 27]:

A1: There is no causal association between mediators. This means the proposed model contains only parallel mediators.

A2: Sequential ignorability. That consists of four assumptions listed below:

(A2.1) There are no unmeasured confounders between the exposure and the outcome;

(A2.2) There are no unmeasured confounders between the mediators and the outcome;

(A2.3) There are no unmeasured confounders between the exposure and the mediators;

(A2.4) There is no exposure-induced confounding between the mediators and the outcome.

A3: Stable unit treatment value assumption (SUTVA) [28, 29] for both the mediators and the outcome. That is to say, there is no interference between individuals.

A4: Consistency for the mediators and the outcome. That is to say, there are no measurement errors in the mediators.

A5: Positivity assumption [30]. Every individual has some positive probability of being exposed to the factor of interest.

Proposed Procedure

We improved the HIMA procedure proposed by Zhang et al. (2016) [9] and the HIMA2 procedure proposed by Perera et al. (2022) [17] under the condition of adjusting confounders in observational data.

In this study, we developed two processes to conduct the confounding-controlled high-dimensional mediation analysis. The detailed procedure is described in the following text.

Step 1: PS-based methods for adjusting confounders

Since there are always some baseline confounders in observational data, we integrate propensity score (PS) into mediators (and/or outcome) models to reduce the selection bias and acquire as accurate estimates of the mediation effect as possible. Due to the PS approaches allowing the inclusion of a large scale of confounders, PS is widely used in observational research.

PS is defined as the conditional probability that a study individual with baseline covariates \(C=\left({C}_{1},{C}_{2},\cdots ,{C}_{l}\right)\) would be exposed to certain study factors of interest [31]:

$$PS=P\left(X=1|{C}_{1},\cdots ,{C}_{l}\right).$$

PS can be estimated by classic multivariable statistical methods such as logistic regression [32] or by machine learning methods such as random forest (RF) and generalized boosted model (GBM) [33, 34]. In practice, logistic regression is the most commonly used. The PS of \(i\)th individual \({\pi }_{i}=P\left({X}_{i}=1|{C}_{1i},\cdots ,{C}_{li}\right)\) can be expressed as follows:

$$\begin{array}{c}logit\left({\pi }_{i}=P\left({X}_{i}=1\right)\right)={\theta }_{0}+{\theta }_{1}{C}_{1i}+\cdots +{\theta }_{l}{C}_{li},\end{array}$$

where \(\theta ={\left({\theta }_{1},{\theta }_{2},\cdots ,{\theta }_{l}\right)}^{T}\) represents the effect of the confounders on the exposure. Then we can adopt some PS-based techniques to adjust confounders such as matching [35], stratification [36], regression [31], and weighting [37]. Here, we focus on PS regression (PSR) and PS weighting [14] (PSW, also called IPW short for inverse probability weighting) techniques to adjust potential confounders between exposure, mediators and outcome.

PSR approach incorporates PS as a covariate into the original regression model to adjust for the probability of being exposed to study factors and to reduce confounding [32]. That is similar to taking all confounders as covariates in a classical regression approach which usually uses the linear regression model for continuous outcomes and the logistic regression model for binary outcomes [38]. For the PSR approach, we can estimate the effect through the models below:

$$\begin{aligned} \begin{array}{ll}{M}_{ki}={a}_{k}+{\alpha }_{k}X+{{\phi }_{k}^{{\prime }}\pi }_{i}+{e}_{ki},k\in \left[p\right],\end{array}\\ \begin{array}{c}{Y}_{i}=a+{\gamma X}_{i}+{{\beta }^{T}M}_{i}+{PS}_{i}+{\epsilon }_{i}.\end{array}\end{aligned}$$

The PSW approach constructs the inverse probability weights by taking the reciprocal of PS. For binary exposure, the weight of the exposed group \(X=1\) is given as \(\frac{1}{PS}\), and that of the controlled group \(X=0\) as \(\frac{1}{1-PS}\). For \(i\)th individual:

$${ipw}_{i}=\frac{1}{P\left({X}_{i}=1|C\right)}=\frac{{X}_{i}}{{\pi }_{i}}+\frac{{(1-X}_{i})}{\left(1-{\pi }_{i}\right)} .$$

Then, we can estimate the coefficients of X in pathways \(X\to M\) and \(M\to Y\) by weighted estimation:

$$\begin{aligned} \begin{array}{ll}{M}_{ki}={a}_{k}+{\alpha }_{k,ipw}X+{{\phi }_{k}^{T}C}_{i}+{e}_{ki},k\in \left[p\right],\end{array}\\ \begin{array}{c}{Y}_{i}=a+{{\gamma }_{k,ipw}X}_{i}+{{\beta }^{T}M}_{i}+{\eta }^{T}{C}_{i}+{\epsilon }_{i},\end{array}\end{aligned}$$

where \({\alpha }_{k,ipw}\) and \({\gamma }_{ipw}\) are the weighted estimation according to the \(ipw\) weight vector. However, the IPW often faces extreme PSs issue which may lead to extreme weights and result in biased estimates and excessive variance [23, 24].

The overlap weighting (OW) approach was proposed to address the issue of extreme PSs [23]. The overlap weight is given as \(1-PS\) for the group \(X=1\) and \(PS\) for the group \(X=0\). Note that, individuals with \(PS\) of 0.5 make the largest contribution to the effect estimate, and individuals with \(PS\) close to 0 and 1 make the smallest contribution. OW is likely to be beneficial in the presence of extreme tail weights [23, 39]. For individual \(i\):

$${ow}_{i}=\left\{\begin{array}{c}1-{\pi }_{i}, { X}_{i}=1 \\ {\pi }_{i}, {X}_{i}=0 \end{array}\right. .$$

Then, the effect estimation of OW is similar to that of the PSW procedure:

$$\begin{aligned} \begin{array}{c}{M}_{ki}={a}_{k}+{\alpha }_{k,ow}X+{{\phi }_{k}^{T}C}_{i}+{e}_{ki},k\in \left[p\right],\end{array}\\ \begin{array}{c}{Y}_{i}=a+{{\gamma }_{k,ow}X}_{i}+{{\beta }^{T}M}_{i}+{\eta }^{T}{C}_{i}+{\epsilon }_{i},\end{array}\end{aligned}$$

In the same way, \({\alpha }_{k,ow}\) and \({\gamma }_{ow}\) are the weighted estimation using \(ow\) weight vector.

Step 2: Confounding-controlled SIS approach for dimensionality reduction

The SIS procedure is a general technique to reduce accurately high dimensions to below sample size [40]. We adopt the SIS method to reduce dimension \(p\) from ultra-high dimension to moderate scale \(d=\left[\frac{2n}{\text{log}\left(n\right)}\right]\) [9, 15].

In this study, we considered two preliminary screening strategies as described in HIMA [9] and HIMA2 [17], based on the effects of \(M\) on \(Y\) (\({\beta }_{k}\)) and the indirect effect \(\left|{\alpha }_{k}{\beta }_{k}\right|\) respectively. Because the indirect effects can be both positive and negative effects, to address the influence of the signs of the estimated indirect effects, the HIMA2 approach uses the absolute values of the indirect effect to obtain the size of the effect estimate regardless of the direction. This approach ensures that mediators with large effect size can be selected.

Due to the lack of screening accuracy in SIS based on indirect effects in the presence of confounders, we conducted the SIS screening based on the effects on the path \(M\to Y\) controlling confounding effects using the OW approach.

In simulation, we found that it is hard to select the true mediators based on \(\left|{\alpha }_{k}{\beta }_{k}\right|\) in the presence of confounding factors as applied in the original HIMA2 approach. So, we modified the frame of the HIMA2 method and both adopt SIS based on the effects on the path \(M\to Y\)\({\beta }_{k}\) in the preliminary screening to select the subset of potential mediators \({M}_{SIS}=\left\{{M}_{k}:{M}_{k} \text{i}\text{s} \text{a}\text{m}\text{o}\text{n}\text{g} \text{t}\text{h}\text{e} \text{t}\text{o}\text{p} d \text{l}\text{a}\text{r}\text{g}\text{e}\text{s}\text{t} \text{e}\text{f}\text{f}\text{e}\text{c}\text{t} of {\beta }_{k}\right\}\).

Noticing that we need to adopt a two-step weighting method [14] to estimate \({\beta }_{k}\) for the PSW and OW methods.

First, \({\gamma }_{k,w}\) can be obtained from the following sub-model:

$$\begin{array}{c}{Y}_{i}=a+{{\widehat{\gamma }}_{k,w}X}_{i}+{{\beta }_{k}M}_{ki}+{\epsilon }_{ki}\end{array}$$

where \({\widehat{\gamma }}_{k,w}\) is the estimator of \({\gamma }_{k,ow}\) or \({\gamma }_{k,ipw}\) for each \({M}_{k}\). In addition, the residual \({\widehat{e}}_{k}\) can be derived:

$${\widehat{e}}_{k}=Y-{\widehat{\gamma }}_{k,w}X.$$

Then \({\beta }_{k}\) can be estimated by regressing \({\widehat{e}}_{k}\) on \({M}_{k}\) without weighting. Through the above SIS procedure, we can identify the important mediators and achieve the goal of dimensionality reduction.

Step 3: Penalized estimation

According to the HIMA procedure, after the preliminary selection of candidate mediators, further variable selection can be accomplished by the penalized estimation method. Here, we adopt the MCP [41] rather than other penalty functions, since the MCP approach has the oracle property which can select the correct model with probability tending to 1 as \(n\to \infty\) [15, 41, 42].

For the \(d\)-dimensional subset \({M}_{SIS}\), we employed the MCP-penalized estimation to further select significant mediators set \({M}_{MCP}=\left\{{M}_{k}:{\beta }_{k}\ne 0,{M}_{k}\in {M}_{SIS}\right\}\), MCP penalty function can be defined as below:

$$\begin{array}{c}{P}_{\lambda ,\delta }\left({\beta }_{k}\right)=\lambda \left[\left|{\beta }_{k}\right|-\frac{{\left|{\beta }_{k}\right|}^{2}}{2\lambda \delta }\right]I\left\{0\le \left|{\beta }_{k}\right|<\delta \right\}+\frac{{\lambda }^{2}\delta }{2}I\left\{\left|\beta \right|\ge \delta \lambda \right\}\end{array}$$

where \(\lambda >0\) is the regularization parameter which can be selected by AIC or BIC, and \(\delta >0\) is the tuning parameter which determines the concavity of MCP. The MCP procedure can be implemented through the R package ncvreg [43]. Through MCP penalty estimation, we filtered out the mediators with too weak effects by combining SIS and MCP procedures and then acquired the small number of mediators that needed to be tested. That will help to obtain more accurate effect estimates.

Following the original HIMA2 procedure, the penalized estimation adopts the de-biased Lasso method to get the estimator \({\widehat{\beta }}_{k}\) and standard error \({\widehat{\sigma }}_{{\beta }_{k}}\). The sub-model of the de-biased Lasso method can be described below:

$$\begin{array}{c}Y=a+\gamma X+{{\beta }_{SIS}^{T}M}_{SIS}+{\eta }^{T}C+\epsilon \end{array}$$

where \({\beta }_{SIS}\) denote the effects of \({M}_{k}\in {M}_{SIS}\) on \(Y\). The corresponding P-values \({P}_{{\beta }_{k}}\) are given as:

$$\begin{array}{c}{P}_{{\beta }_{k}}=2\left\{1-{\Phi }\left(\left|{\widehat{\beta }}_{k}\right|/{\widehat{\sigma }}_{{\beta }_{k}}\right)\right\},\end{array}$$

where \({\Phi }\left(.\right)\) is the cumulative distribution function of standard normal distribution \(N\left(\text{0,1}\right)\). The de-biased Lasso method can be implemented with the R package hdi.

Step 4: PS-based multiple mediation test

After MCP-based penalized estimation, we use the Joint significance test [3, 44] (termed JS-uniform) to test the mediation effect of \({M}_{k}\in {M}_{MCP}\). The Joint significance test considers the \({M}_{k}\) as a true mediator when \({\alpha }_{k}\) and \({\beta }_{k}\) is significant simultaneously. Here, \({\alpha }_{k}\) can be estimated through different confounding adjustment methods as shown in Eqs. 1, 3, 4, and 5. \({\beta }_{k}\) can be obtained using the linear regression with considering all confounders as covariates or only including PS (summary of all confounders) as a covariate.

In other words, that is based on the P-values for testing the path-specific effects \({H}_{0}:{\alpha }_{k}=0\) or \({H}_{0}:{\beta }_{k}=0\). The raw P-value for the joint significance test [3] is defined below:

\(\begin{array}{c}{P}_{raw,k}=\text{max}\left({P}_{raw,{\alpha }_{k}}, {P}_{raw,{\beta }_{k}}\right),\#\end{array}\)where \({P}_{raw,{\alpha }_{k}}\) and \({P}_{raw,{\beta }_{k}}\) are the P-values for testing \({H}_{0}:{\alpha }_{k}=0\) and \({H}_{0}:{\beta }_{k}=0\). \({P}_{raw,{\alpha }_{k}}\) and \({P}_{raw,{\beta }_{k}}\) can be obtained from the mediator model (e.g. Equations 1, 3, 4, and 5) and outcome model (Eq. 2), respectively.

For the multiplicity (Type I error inflation) issue in multiple mediation testing, we adopted the Benjamini–Hochberg (BH) method [45, 46] to acquire the adjusted \(p\)-values as below,

$$\begin{array}{c}{P}_{BH,k}=\text{min}\left({P}_{raw,k}\cdot \frac{q}{{r}_{k}},1\right),\end{array}$$

where \(q\) is the number of potential mediators in the set \({M}_{MCP}\), and \({r}_{k}\) is the location number of \({P}_{raw,k}\) when all the P-values \({P}_{raw,k}\) are sorted ascending.

However, the Joint significance test assumes \({P}_{raw,k}\) follows a uniform null distribution. Although \({P}_{{\alpha }_{k}}\) and \({P}_{{\beta }_{k}}\) are each uniformly distributed, their maximum may not. Therefore, the Joint significance test results in a valid but overly conservative test with lower power [13, 17, 47].

Hence, we adopt the PS-based joint significance with mixture null distribution method [18] (termed JS-mixture) approach to conduct multiple mediation test after de-biased Lasso penalized estimation [17, 48] referring to the classical HIMA2 procedure. The PS-based JS-mixture approach adopts a 3-component mixture distribution as below:

$$\begin{array}{c}H_{00,k}:\alpha_k=0\;\mathrm{or}\;\beta_k\mathit=0,\;\\ H_{01,k}:\alpha_k=0\;\mathrm{or}\;\beta_k\;\neq0,\\ H_{10,k}:\alpha_k\neq0\;\mathrm{or}\;\beta_k=0.\end{array}$$

The estimated pointwise FDR for testing mediation can be computed as:


where \(t\in \left[\text{0,1}\right]\), \({V}_{00}\left(t\right),{V}_{01}\left(t\right),{V}_{10}\left(t\right)\) denoting the numbers of the three types of false positives and \(R\left(t\right)={V}_{00}\left(t\right)+{V}_{01}\left(t\right)+{V}_{10}\left(t\right)+{V}_{11}\left(t\right)\). The \({V}_{00}\left(t\right),{V}_{01}\left(t\right),{V}_{10}\left(t\right)\) and \(\widehat{FDR}\left(t\right)\) can be obtained using the R package HDMT.

We set the significance level of 0.05 for all the tests. The detailed processes of the proposed method are summarized in Fig. 2.

Fig. 2
figure 2

The overall workflow for high-dimensional mediation analysis under the adjusting for confounders condition

Simulation studies

Simulation design

In this section, we conducted the simulation studies to evaluate the performance of the proposed method. The implementation of the simulation was based on R (version 4.3.0, R Foundation for Statistical Computing, Vienna, Austria) and RStudio (version 2023.9.0.463, RStudio: Integrated Development Environment for R, Boston, MA). The setting of simulation parameters was based on the published studies [9, 14, 16]. The number of replications in simulation study was set to be 500 for each combination of parameter setting referring to the replication times settings in published methodogical studies [9, 14,15,16,17, 19, 49].

The model structure is shown in Fig. 1. We consider 8 confounders \(C=\left({C}_{1},{C}_{2},\cdots ,{C}_{8}\right)\) affecting the relationship of \(X\), \(M\), \(Y\), in which continuous confounders \({C}_{1}-{C}_{4}\) follow a multivariate normal distribution \(N\left(\mu ,{\Sigma }\right)\) with a mean vector \(\mu ={\left(\text{0,0},\text{0,0}\right)}^{T}\) and a covariance matrix \({\Sigma }\):

$${\Sigma } = \left[\begin{array}{cccc}1& 0.3 &0.3& 0.3\\ 0.3& 1& 0.3& 0.3\\ 0.3& 0.3 & 1& 0.3\\ 0.3& 0.3 & 0.3& 1\\ \end{array}\right].$$

The last four binary confounders \({C}_{5}-{C}_{8}\) are independently generated from the Binary distribution \(B\left(n,0.3\right)\), where \(n\) is the sample size.

Then exposure \(X\) can be generated from Binary distribution \(B\left(n,{P}_{c}\right)\), where \(n\) is the sample size, \({P}_{c}=1/\left(1+{e}^{-\left({\theta }^{T}C\right)}\right)\), and \({\theta }^{T}=\left({\theta }_{1},{\theta }_{2},\cdots ,{\theta }_{8}\right)=\left(\text{0.2,0.2,0.3,0.3,0.2,0.2,0.3,0.3}\right)\).

Mediators \(M\) and the outcome variable \(Y\) are generated according to Eqs. 1 and 2, respectively. For simplicity, we set all the effects of \(C\) on \(M\) to be the same. Let \({\phi }_{k}={\left({\phi }_{k1},\cdots ,{\phi }_{k8}\right)}^{T}={\left(\text{0.2,0.2,0.3,0.3,0.2,0.2,0.3,0.3}\right)}^{T}\) represent the effect of C on M. Let \(\eta ={\left({\eta }_{1},{\eta }_{2},\cdots ,{\eta }_{8}\right)}^{T}={\left(\text{0.2,0.2,0.3,0.3,0.2,0.2,0.3,0.3}\right)}^{T}\) denote the effects of \(C\) on \(Y\).

We set the first four potential mediators \({M}_{1}-{M}_{4}\) as the true significant mediators in this study. Let \(\alpha ={\left({\alpha }_{1},{\alpha }_{2},\cdots ,{\alpha }_{p}\right)}^{T}=\left(\text{0.4,0.4,0.5,0.5,0.5,0.5,0},0,\cdots ,0\right)\); \(\beta ={\left({\beta }_{1},{\beta }_{2},\cdots ,{\beta }_{p}\right)}^{T}=\left(\text{0.4,0.5,0.5,0.6,0},\text{0,0.5,0.5,0},\cdots ,0\right)\). The elements of both \(\alpha\) and \(\beta\) are equal to zero except for the first eight elements, and the first four are the significant mediators. The mediation effect size of the true mediators \({M}_{1}-{M}_{4}\) is \({\alpha \beta }_{1-4}=\left(\text{0.16,0.2,0.25,0.3}\right)\).

Let \(\gamma =0.5\); \(a=0.5\); \(a_k\sim U(0,1)\), \(\epsilon\sim N(0,1)\). The error term \({e}_{k}\) are generated from \(N\left(\text{0,1.2}\right)\) and the correlation between mediators mostly falls between 0.15 and 0.35.

To evaluate the impacts of sample size and potential mediators dimension, we set two sample size levels \(n=300, 500\), and two dimension levels \(p\)=1000,10000.

In addition, we take the correlation between mediators into account in the condition of \(p\)=1000 dimension. We simulate the strong correlation between mediators by generating the error terms \({e}_{k}\) from \(N\left(0,{{\Sigma }}_{e}\right)\), where \({{\Sigma }}_{e}={\left({\rho }^{\left|k-{k}^{{\prime }}\right|}\right)}_{k,{k}^{{\prime }}}\). It means the correlation between two mediators will decrease as the absolute difference in mediators’ subscript \(\left|k-{k}^{{\prime }}\right|\) increases. We set four correlation levels \(\rho =0, 0.25, \text{0.5,0.75}\) with dimension \(p\)=1000 and sample size \(n=300, 500\). In the simulation setting \(\rho =0, 0.25, \text{0.5,0.75}\), the corresponding Pearson correlation coefficients between two adjacent mediators are around 0.4, 0.5, 0.7, and 0.8, respectively. We evaluated the performance of the mHIMA2 and PS-based HIMA by conducting 500 replications of simulated data sets for each scenario [9, 14,15,16,17, 19, 49].

Simulation results

Simulation results are presented in Tables 1 and 2. Evaluation of the performance of mediator selection of the proposed approach is shown in Table 1 by measuring the true positive rate (TPR) and false discovery proportion (FDP) of selection after the significance test for mediation effects. The mediators have higher TPR as the indirect effect increases (i.e., larger mediation effect, higher detection rate).

Table 1 TPR and FDP for the four true mediators (M1–M4)
Table 2 Estimation results of mediation effects, expressed as Mean (MSE)

As presented in Table 1. Under most settings, the mHIMA2 mediation test approach has a higher TPR than PS-based HIMA while a higher FDP at the same time. Overall, the mHIMA2 is more powerful than the PS-based HIMA and is less conservative in selecting significant mediators.

As shown in Table 1, for the mHIMA2 mediation test approach, TPR is ranked as OW > IPW > PSR > RA, and FDP is not more than 0.1 and gradually decreases to close to 0.05 as the sample size increases. Among all models, the mHIMA2 mediation test approach with OW adjustment has the highest power and acceptable false positive level. When using the PS-based HIMA mediation test approach, TPR is ranked consistently as RA > PSR > OW > IPW, and all four models also keep FDP at an extremely low level.

Table 2 presents the estimation of mediation effects with the mean and mean square error (MSE). The estimators approach the true values as the mediation effect increases. All models tend to be more accurate as \(n\) gets larger and \(p\) gets smaller. Overall, the mHIMA2 mediation test approach has a smaller MSE than the PS-based HIMA approach in most cases. RA adjustment has a higher MSE than other adjustment methods especially when facing the large mediation effect, OW adjustment has the lower MSE among the four adjustment methods.

As shown in Table 2, similarly, the mHIMA2 approach with OW adjustment has the smallest MSE among all models. Moreover, similar results can be seen in the different strong correlation settings in Table S1-S8 in the supplementary file. The mHIMA2 methods have lower MSE (i.e. more precise estimation) and apparently higher TPR. That means the de-biased Lasso technique in mHIMA2 methods performs better when handling the moderate correlation between mediators. However, the FDP of all models slightly increases as the correlation between mediators increases. When correlation among the mediators is strong (for example, \(r\)>0.7), all models suffer in terms of increased MSE.

Data application

Smoking is an important environmental factor affecting the immune system and blood cell composition [50, 51]. Previous studies have demonstrated smokers had lower natural killer (NK) cell counts and activity [50, 51]. Smoking has also been found to be associated with DNA methylation levels [52]. Meanwhile, DNA methylation levels have also been found to be associated with associated with human NK cell activation [53, 54]. Therefore, DNA methylation may mediate the association between smoking and NK cell level. So we implemented the proposed high-dimensional mediation analysis methods to identify the specific functional CpG sites that may mediate the relationship between smoking and the estimated NK cell level.

Here we apply our method to the GSE117859 dataset obtained from the Gene Expression Omnibus (GEO) database. The aim of the study in which GSE117859 was originally measured is to explore the smoking-associated DNA methylation features linked to AIDS outcomes in the HIV-positive population [55]. The blood samples from the Veteran Aging Cohort Study (VACS) were collected in that study. The HumanMethylation450 BeadChip platform was used to measure the DNA methylation levels.

In total 608 samplesand 485,577 probes were included in the dataset. Clinical information such as age, sex, race, smoking history, adherence of antiretroviral therapy (ART), estimated CD4 T cells, estimated CD8 T cells, and estimated NK cells were collected. The estimated CD4/CD8/NK were obtained using a methylation-based cell type deconvolution algorithm proposed by Housman et al. [56]. To some extent, the estimated CD4 and CD8 levels can represent AIDS severity.

Smoking status was collected based on self-report. All included patients were classified into the smoker and the non-smoker groups according to their reported smoking history. After removing the individuals without available clinical information and DNAm sites with missing values, a total of 587 samples and 485,503 probes were included in the analysis.

We adjusted the potential confounders including age, race, adherence of antiretroviral therapy, estimated CD4 T cells, and estimated CD8 T cells. Demographic and clinical variables included in our analysis are presented in Table 3.

Table 3 Baseline characteristics of the HIV-positive patients included in the analysis

The analysis results using the proposed mHIMA2 method are presented in Table 4. Here, we mainly presented the CpGs mediators with a total effect proportion greater than 5%. Due to the limitation of text content, we didn’t present the whole summary results of the PS-based HIMA method, but that can be seen in Table S9 in the supplementary file.

Table 4 Summary of the selected CpGs mediators with a %TE > 5 by the mHIMA2 models

As shown in Table 4, we identified two methylation sites cg13917614 in CNP gene and cg16893868 in LILRA2 gene by most of mHIMA2 based methods. The similar result can be seen in Table S9 in the supplementary file. The existing studies have already demonstrated the site cg13917614 is associated with smoking [52, 57]. Although we don’t find direct evidence that the CNP gene is associated with immune function based on the existing literature, relevant studies showed a link between CNP and inflammatory responses in which the mechanism remains further study [58, 59].

The encoded protein of the LILRA2 gene can suppress innate immune response [60, 61]. The results reveal that smoking will promote the demethylation of cg16893868, leading to an increase in gene LILRA2 expression and ultimately reducing the estimated NK cell level. It has been found that the remaining CpG sites cg20460771, cg03164561, cg03605454, cg09529165, and cg01500140 are all associated with smoking [11, 52, 62,63,64]. Further insights into the discovered CpG mediators in genome-wide epigenetic studies will be meaningful.


The causal relationship obtained in high-dimensional mediation analysis usually depends on no-confounding assumption. However, confounding is almost inevitable in observational studies owing to the lack of randomization of the baseline covariates in practice. Previous studies show the utilization of PS method such as PS-adjustment and IPW in high-dimensional mediation analysis, but those face the issue of extreme PS distribution.

In this article, we integrated OW approach into the high-dimensional mediation model, which can address extreme PS distribution and better adjust for confounding. Finally, we developed a high-dimensional mediation analysis workflow consisting of OW confounding adjustment, SIS, de-biased Lasso penalization for potential mediator screening, and the high-dimensional mediation test underlying the mixture null distribution of P-values.

Simulation results indicate that the mHIMA2 with OW approach presented in this study performs best among all the compared models with the highest TPR, acceptable FDP level, and the smallest MSE in mediating effect estimation. In addition, the mHIMA2 embedded de-biased Lasso method performs better when moderate correlations between mediators exist.

Simulation study also suggestedthe proposed method would perform better when the sample size was increased. This result suggests that when the proposed method is used for the analysis of mediating effects on real data, a sufficient sample size should also be ensured. Such a feature is also consistent with other existing methods [5, 9, 14, 17, 19, 49]. Furthermore, the dimensionality of potential mediators has little effect on the performance of the proposed method.

In most of the previous studies [5, 9, 13, 17], it didn’t take confounding adjustment into account in the SIS process. However, we adopted the PS-based method to adjust confounding, thus improving the accuracy of mediators screening. Moreover, it has been assumed that mediators are linearly independent of each other, but such an assumption is often not strictly valid in real data. The violation of the mediators’ independence assumption often affects the accuracy of mediators selection and precision of mediating effect estimation. The proposed method can effectively deal with this issue which can tolerate the correlation between the mediators and ensure the robustness of mediators selection, multiple mediation testing, and mediating effect estimation.

Similar to other two-step approaches, the error of the first model may be introduced and cumulated in the second step, because the first-step can not quarantee 100% correctness. To avoid this, we set a relatively loose screening criterion with \(d=2n/log(n)\) to select the top \(d\) largest effect mediators [15,16,17, 49] in the first step to control false negative while avoiding the increase of false positive error according to the application recommendation of SIS approach. Though the errors cannot be totally avoid, this can reduce the error in the preliminary screening of mediators and prevent serious error cumulation in the second step to some extend. As shown in the simulation, the proposed two-step model performed well. Besides, previous published studies also have demonstrated the error cumulation issue in two-step models can be controlled well in the similar way as we did, and well not cause serious bias in the final results [14, 65,66,67,68,69,70].

Meanwhile, we applied the proposed method to the dataset GSE117859 obtained from the GEO databases and identified several significant DNAm mediators, including the sites cg13917614, cg16893868, cg20460771, cg03164561, cg03605454, cg09529165, and cg01500140. Among them, site cg16893868 in LILRA2 gene has been demonstrated to be associated with smoking and immune function [60, 61]. That indicates that the proposed method can identify reliable mediators in empirical data analysis.

The presence of confounding in observation studies always is a major challenge to obtaining causal relationships. Currently, most genetic studies are based on observational research without randomization of baseline characteristics. Particularly, the high-dimensional mediation analysis always faces some issues, such as the accuracy of the high-dimensional mediation selection and the low power of multiple mediation test [13, 14, 17, 18]. Although the utilization of PSR and IPW offers a solution of confounding adjustment in classical HDMA workflow, it still faces the issue of extreme PS distribution.

The proposed OW-based method can provide a more precise and stable mediating effect estimation. However, the misspecification of the outcome model and PS model can not be avoid in practice. Hence, the doubly robust methods may be desirable to be applied in HDMA workflow in future study. Even if the JS-mixture method was proposed to improve the power of multiple mediation testing, other more powerful test methods still are appealing in large-scale genome-wide epigenetic studies [13, 18]. Conducting further simulation and methodology studies to compare different powerful test methods may provide useful reference for future studies. It should also be noticed that the existence of unmeasured confounding is out of the scope of this paper. Previews published studies have provided serval applicable methods to deal with this issue [49, 71].


Overall, the mHIMA2 with OW adjustment has sufficient power in selecting potential true mediators and obtaining precise estimation for mediation effects. It can be recommended in practical high-dimensional mediation analysis, especially in epigenetic study.

Availability of data and materials

The dataset GSE117859 obtained from GEO database in our real data analysis can be accessed (at acc=GSE117859) without limitation. Our procedure is implemented using the R software. The corresponding R code can be found at



False discovery proportion


Gene Expression Omnibus


High-dimensional mediation analysis


Inverse probability weighting


Joint significant test with uniform distribution


Joint significance test with mixture null distribution


Mean square error


Modified HIMA2 model

NK cell:

Natural killer cell


Overlapping weighting


Propensity score regression adjustment


Propensity score


Sure independence screening


True positive rate


  1. Baron RM, Kenny DA. The moderator-mediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. J Personal Soc Psychol. 1986;51(6):1173–82.

    Article  CAS  Google Scholar 

  2. Huan T, Joehanes R, Schurmann C, Schramm K, Pilling LC, Peters MJ, et al. A whole-blood transcriptome meta-analysis identifies gene expression signatures of cigarette smoking. Hum Mol Genet. 2016;25(21):4611–23. .

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. MacKinnon DP, Lockwood CM, Hoffman JM, West SG, Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychol Methods. 2002;7(1):83–104.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Biesanz JC, Falk CF, Savalei V. Assessing Mediational models: testing and interval estimation for Indirect effects. Multivar Behav Res. 2010;45(4):661–701.

    Article  Google Scholar 

  5. Gao Y, Yang H, Fang R, Zhang Y, Goode EL, Cui Y. Testing mediation effects in high-dimensional epigenetic studies. Front Genet. 2019;10:1195.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Taylor AB, MacKinnon DP. Four applications of permutation methods to testing a single-mediator model. Behav Res Methods. 2012;44(3):806–44.

    Article  PubMed  PubMed Central  Google Scholar 

  7. VanderWeele TJ. Marginal structural models for the estimation of direct and indirect effects. Epidemiol (Cambridge Mass). 2009;20(1):18–26. .

    Article  Google Scholar 

  8. VanderWeele TJ, Vansteelandt S. Mediation analysis with multiple mediators. Epidemiol Methods. 2014;2(1):95–115. .

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Zhang H, Zheng Y, Zhang Z, Gao T, Joyce B, Yoon G, et al. Estimating and testing high-dimensional mediation effects in epigenetic studies. Bioinf (Oxford England). 2016;32(20):3150–4. .

    Article  CAS  Google Scholar 

  10. Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, et al. High density DNA methylation array with single CpG site resolution. Genomics. 2011;98(4):288–95.

    Article  CAS  PubMed  Google Scholar 

  11. Harlid S, Xu Z, Panduri V, Sandler DP, Taylor JA. CpG sites associated with cigarette smoking: analysis of epigenome-wide data from the Sister Study. Environ Health Perspect. 2014;122(7):673–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Toyooka S, Maruyama R, Toyooka KO, McLerran D, Feng Z, Fukuyama Y, et al. Smoke exposure, histologic type and geography-related differences in the methylation profiles of non-small cell lung cancer. Int J Cancer. 2003;103(2):153–60. .

    Article  CAS  PubMed  Google Scholar 

  13. Liu Z, Shen J, Barfield R, Schwartz J, Baccarelli AA, Lin X. Large-scale hypothesis testing for Causal Mediation effects with Applications in Genome-wide epigenetic studies. J Am Stat Assoc. 2022;117(537):67–81.

    Article  CAS  PubMed  Google Scholar 

  14. Luo L, Yan Y, Cui Y, Yuan X, Yu Z. Linear high-dimensional mediation models adjusting for confounders using propensity score method. Front Genet. 2022;13:961148.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Luo C, Fa B, Yan Y, Wang Y, Zhou Y, Zhang Y, et al. High-dimensional mediation analysis in survival models. PLoS Comput Biol. 2020;16(4):e1007768.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Yu Z, Cui Y, Wei T, Ma Y, Luo C. High-dimensional mediation analysis with confounders in Survival models. Front Genet. 2021;12:688871.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Perera C, Zhang H, Zheng Y, Hou L, Qu A, Zheng C, et al. HIMA2: high-dimensional mediation analysis and its application in epigenome-wide DNA methylation data. BMC Bioinformatics. 2022;23(1):296.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Dai JY, Stanford JL, LeBlanc M. A multiple-testing procedure for high-dimensional mediation hypotheses. J Am Stat Assoc. 2022;117(537):198–213.

    Article  CAS  PubMed  Google Scholar 

  19. Zhang H, Zheng Y, Hou L, Zheng C, Liu L. Mediation analysis for survival data with high-dimensional mediators. Bioinf (Oxford England). 2021;37(21):3815–21. .

    Article  CAS  Google Scholar 

  20. Heinze G, Jüni P. An overview of the objectives of and the approaches to propensity score analyses. Eur Heart J. 2011;32(14):1704–8.

    Article  PubMed  Google Scholar 

  21. Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Science: Rev J Inst Math Stat. 2010;25(1):1–21. .

    Article  PubMed  PubMed Central  Google Scholar 

  22. Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One. 2011;6(3):e18174.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Li F, Thomas LE, Li F. Addressing Extreme Propensity scores via the Overlap weights. Am J Epidemiol. 2019;188(1):250–7. .

    Article  PubMed  Google Scholar 

  24. Thomas LE, Li F, Pencina MJ. Overlap weighting: a propensity score method that mimics attributes of a Randomized Clinical Trial. JAMA. 2020;323(23):2417.

    Article  PubMed  Google Scholar 

  25. Mlcoch T, Hrnciarova T, Tuzil J, Zadak J, Marian M, Dolezal T. Propensity Score Weighting Using Overlap Weights: A New Method Applied to Regorafenib Clinical Data and a Cost-Effectiveness Analysis. Value in Health. 2019;22(12):1370-7. doi: 10.1016/j.jval.2019.06.010.

    Article  PubMed  Google Scholar 

  26. Vanderweele TJ, Vansteelandt S, Robins JM. Effect decomposition in the presence of an exposure-induced mediator-outcome confounder. Epidemiol (Cambridge Mass). 2014;25(2):300–6. .

    Article  Google Scholar 

  27. Perera C, Zhang H, Zheng Y, Hou L, Qu A, Zheng C, et al. HIMA2: high-dimensional mediation analysis and its application in epigenome-wide DNA methylation data. BMC Bioinformatics. 2022;23(1).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Basu D. Randomization analysis of Experimental Data: the Fisher randomization test. J Am Stat Assoc. 1980;75(371):575–82.

    Article  Google Scholar 

  29. Rubin DB. Comment. J American Statis Assoc. 1986;81(396):961–2.

    Article  Google Scholar 

  30. Imbens GW, Rubin DB. Causal inference for statistics, Social, and Biomedical sciences: an introduction. Cambridge: Cambridge University Press; 2015.

    Book  Google Scholar 

  31. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41–55.

    Article  Google Scholar 

  32. Haukoos JS, Lewis RJ. The Propensity score. JAMA. 2015;314(15):1637–8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29(3):337–46.

    Article  PubMed  PubMed Central  Google Scholar 

  34. Abdia Y, Kulasekera KB, Datta S, Boakye M, Kong M. Propensity scores based methods for estimating average treatment effect and average treatment effect among treated: a comparative study. Biometrical J Biometrische Z. 2017;59(5):967–85. .

    Article  PubMed  Google Scholar 

  35. Rosenbaum PR, Rubin DB. Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score. Am Stat. 1985;39(1):33–8.

    Article  Google Scholar 

  36. Rosenbaum PR, Rubin DB. Reducing Bias in Observational studies using subclassification on the Propensity score. J Am Stat Assoc. 1984;79(387):516–24.

    Article  Google Scholar 

  37. Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc. 1995;90(429):106–21.

    Article  Google Scholar 

  38. Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivar Behav Res. 2011;46(3):399–424.

    Article  Google Scholar 

  39. Li F, Morgan KL, Zaslavsky AM. Balancing covariates via Propensity score weighting. J Am Stat Assoc. 2018;113(521):390–400.

    Article  CAS  Google Scholar 

  40. Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space. J Royal Stat Soc Ser B: Stat Methodol. 2008;70(5):849–911. .

    Article  Google Scholar 

  41. Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. Annals Stat. 2010;38(2):894–942. .

    Article  Google Scholar 

  42. Maity AK, Basu S. Highest posterior model computation and variable selection via simulated annealing. The New England J Statis Data Sci. 2023;1(2):200–7.

    Article  Google Scholar 

  43. Breheny P, Huang J. Coordinate Descent algorithms for Nonconvex Penalized Regression, with applications to Biological feature selection. Annals Appl Stat. 2011;5(1):232–53. .

    Article  Google Scholar 

  44. Huang Y-T, Pan W-C. Hypothesis test of mediation effect in causal mediation model with high-dimensional continuous mediators. Biometrics. 2016;72(2):402–13.

    Article  PubMed  Google Scholar 

  45. Benjamini Y, Hochberg Y. Controlling the false Discovery rate: a practical and powerful Approach to multiple testing. J Roy Stat Soc: Ser B (Methodol). 1995;57(1):289–300.

    Article  Google Scholar 

  46. Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988;75(4):800–2.

    Article  Google Scholar 

  47. Huang Y-T. Joint significance tests for mediation effects of socioeconomic adversity on adiposity via epigenetics. Annals Appl Stat. 2018;12(3):1535–57.

    Article  Google Scholar 

  48. Fang EX, Ning Y, Liu H. Testing and confidence intervals for high Dimensional Proportional hazards Model. J Royal Stat Soc Ser B Stat Methodol. 2017;79(5):1415–37. .

    Article  Google Scholar 

  49. Chen F, Hu W, Cai J, Chen S, Si A, Zhang Y, et al. Instrumental variable-based high-dimensional mediation analysis with unmeasured confounders for survival data in the observational epigenetic study. Front Genet. 2023;14:1092489.

    Article  PubMed  PubMed Central  Google Scholar 

  50. Qiu F, Liang C-L, Liu H, Zeng Y-Q, Hou S, Huang S, et al. Impacts of cigarette smoking on immune responsiveness: up and down or upside down? Oncotarget. 2017;8(1):268–84.

    Article  PubMed  Google Scholar 

  51. Elisia I, Lam V, Cho B, Hay M, Li MY, Yeung M, et al. The effect of smoking on chronic inflammation, immune function and blood cell composition. Sci Rep. 2020;10:19480.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Breitling LP, Yang R, Korn B, Burwinkel B, Brenner H. Tobacco-smoking-related differential DNA methylation: 27K discovery and replication. Am J Hum Genet. 2011;88(4):450–7. .

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Wiencke JK, Butler R, Hsuang G, Eliot M, Kim S, Sepulveda MA, et al. The DNA methylation profile of activated human natural killer cells. Epigenetics. 2016;11(5):363–80.

    Article  PubMed  PubMed Central  Google Scholar 

  54. Gao X, Jia M, Zhang Y, Breitling LP, Brenner H. DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies. Clin Epigenetics. 2015;7:113.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Zhang X, Hu Y, Aouizerat BE, Peng G, Marconi VC, Corley MJ, et al. Machine learning selected smoking-associated DNA methylation signatures that predict HIV prognosis and mortality. Clin Epigenetics. 2018;10(1):155.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13(1):86.

    Article  PubMed  PubMed Central  Google Scholar 

  57. Wan ES, Qiu W, Baccarelli A, Carey VJ, Bacherman H, Rennard SI, et al. Cigarette smoking behaviors and time since quitting are associated with differential DNA methylation across the human genome. Hum Mol Genet. 2012;21(13):3073–82.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Bao Q, Zhang B, Zhou L, Yang Q, Mu X, Liu X, et al. CNP Ameliorates Macrophage Inflammatory Response and Atherosclerosis. Circ Res. 0(0).

  59. Bae C-R, Hino J, Hosoda H, Arai Y, Son C, Makino H, et al. Overexpression of C-type natriuretic peptide in endothelial cells protects against Insulin Resistance and inflammation during Diet-induced obesity. Sci Rep. 2017;7(1):9807.

    Article  PubMed  PubMed Central  Google Scholar 

  60. Lu HK, Mitchell A, Endoh Y, Hampartzoumian T, Huynh O, Borges L, et al. LILRA2 selectively modulates LPS-mediated cytokine production and inhibits phagocytosis by monocytes. PLoS ONE. 2012;7(3):e33478.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Lewis Marffy AL, McCarthy AJ. Leukocyte Immunoglobulin-Like receptors (LILRs) on human neutrophils: modulators of infection and immunity. Front Immunol. 2020;11:857.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  62. Sikdar S, Joehanes R, Joubert BR, Xu C-J, Vives-Usano M, Rezwan FI, et al. Comparison of smoking-related DNA methylation between newborns from prenatal exposure and adults from personal smoking. Epigenomics. 2019;11(13):1487–500.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Joehanes R, Just AC, Marioni RE, Pilling LC, Reynolds LM, Mandaviya PR, et al. Epigenetic signatures of cigarette smoking. Circulation Cardiovasc Genet. 2016;9(5):436–47. .

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Sun YV, Smith AK, Conneely KN, Chang Q, Li W, Lazarus A, et al. Epigenomic association analysis identifies smoking-related DNA methylation sites in African americans. Hum Genet. 2013;132(9):1027–37.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Luo C, Wang G, Hu F. Two-Step Gene Feature Selection Algorithm Based on Permutation Test. In: Yao J, Yang Y, Słowiński R, Greco S, Li H, Mitra S, et al., editors. Springer; 2012. p. 249−58.

  66. Liu D, Yeung EH, McLain AC, Xie Y, Buck Louis GM, Sundaram R. A two-step Approach for Analysis of Nonignorable Missing outcomes in Longitudinal Regression: an application to Upstate KIDS Study. Paediatr Perinat Epidemiol. 2017;31(5):468–78.

    Article  PubMed  PubMed Central  Google Scholar 

  67. Newcombe PJ, Connolly S, Seaman S, Richardson S, Sharp SJ. A two-step method for variable selection in the analysis of a case-cohort study. Int J Epidemiol. 2018;47(2):597–604.

    Article  CAS  PubMed  Google Scholar 

  68. Song J, Shin SJ. A two-step approach for variable selection in linear regression with measurement error. Commun Stat Appl Methods. 2019;26(1):47–55. .

    Article  Google Scholar 

  69. Liu Y, Qin SJ. A Novel two-step sparse Learning Approach for Variable Selection and Optimal Predictive modeling. IFAC-PapersOnLine. 2022;55(7):57–64.

    Article  Google Scholar 

  70. Chamlal H, Benzmane A, Ouaderhman T. A Two-Step Feature Selection Procedure to Handle High-Dimensional Data in Regression Problems. 2023 International Conference on Decision Aid Sciences and Applications (DASA). 2023. p. 592–6.

    Google Scholar 

  71. Wickramarachchi DS, Lim LHM, Sun B. Mediation analysis with multiple mediators under unmeasured mediator-outcome confounding. Stat Med. 2023;42(4):422–32.

    Article  PubMed  Google Scholar 

Download references


We acknowledge GEO database for providing their platforms and contributors for uploading their meaningful datasets.


This work was supported by the National Social Science Found of China (21CTJ009) and National Nature Science Foundation of China (81703325).

Author information

Authors and Affiliations



F.C., H.Y. and W.H. led the conception and design of the work. W.H. and S.C. conducted the simulation study. W.H., S.C., and J.C. finished data cleaning and implemented real data application. W.H. completed the original draft. F.C., J.C., and Y.Y. guided analyses and provided advice. F.C. critically reviewed and edited the manuscript.

Corresponding author

Correspondence to Fangyao Chen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information


Additional file 1: High-dimensional mediation analysis for continuous outcome with confounders using overlap weighting method in observational epigenetic study. Simulation results of different correlation levels  \(\rho\)=0,0.25,0.5,0.75 with dimension \(\rho\)=1000 and sample size \(n\)=300,500 were presented in the Table S1-8. Analysis result using the PS-based HIMA methods was shown in Table S9.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, W., Chen, S., Cai, J. et al. High-dimensional mediation analysis for continuous outcome with confounders using overlap weighting method in observational epigenetic study. BMC Med Res Methodol 24, 125 (2024).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: