 Research
 Open access
 Published:
Highdimensional mediation analysis for continuous outcome with confounders using overlap weighting method in observational epigenetic study
BMC Medical Research Methodology volumeÂ 24, ArticleÂ number:Â 125 (2024)
Abstract
Background
Mediation analysis is a powerful tool to identify factors mediating the causal pathway of exposure to health outcomes. Mediation analysis has been extended to study a large number of potential mediators in highdimensional data settings. The presence of confounding in observational studies is inevitable. Hence, itâ€™s an essential part of highdimensional mediation analysis (HDMA) to adjust for the potential confounders. Although the propensity score (PS) related method such as propensity score regression adjustment (PSR) and inverse probability weighting (IPW) has been proposed to tackle this problem, the characteristics with extreme propensity score distribution of the PSbased method would result in the biased estimation.
Methods
In this article, we integrated the overlapping weighting (OW) technique into HDMA workflow and proposed a concise and powerful highdimensional mediation analysis procedure consisting of OW confounding adjustment, sure independence screening (SIS), debiased Lasso penalization, and jointsignificance testing underlying the mixture null distribution. We compared the proposed method with the existing method consisting of PSbased confounding adjustment, SIS, minimax concave penalty (MCP) variable selection, and classical jointsignificance testing.
Results
Simulation studies demonstrate the proposed procedure has the best performance in mediator selection and estimation. The proposed procedure yielded the highest true positive rate, acceptable false discovery proportion level, and lower mean square error. In the empirical study based on the GSE117859 dataset in the Gene Expression Omnibus database using the proposed method, we found that smoking history may lead to the estimated natural killer (NK) cell level reduction through the mediation effect of some methylation markers, mainly including methylation sites cg13917614 in CNP gene and cg16893868 in LILRA2 gene.
Conclusions
The proposed method has higher power, sufficient false discovery rate control, and precise mediation effect estimation. Meanwhile, it is feasible to be implemented with the presence of confounders. Hence, our method is worth considering in HDMA studies.
Background
The analysis of the mediating effect was first proposed by Baron and Kenny (1986) [1] and was broadly applied in many scientific fields, such as psychological, sociological, and biomedical studies [2,3,4]. Mediation analysis has become a powerful tool to investigate the underlying mechanism of environmental exposures on health outcomes and identify the factors mediating the effect of exposures on outcomes [5]. Currently, analytical methods including the single mediator model [6, 7], multiplemediators model [8], and highdimensional mediation model [9] are proposed and available for researchers in many scientific fields.
With the development of advanced data collection techniques, highdimensional data has become common in biomedical research. For example, in the epigenetic study, the Illumina Infinium HumanMethylation450 BeadChip array platform allows to measure the DNA methylation levels of roughly 480 K probes [10] and generates high dimensional data. Focusing on practical research, smoking affects lung function, and some DNA methylation sites may mediate the effect of smoking on lung function [11, 12]. To identify the significant mediators (CpG sites) between smoking and lung function, we can conduct mediation analysis in the collected highdimensional data [9, 13, 14]. Obviously, this method can be used to identify the methylation sites mediating the association between environmental factors other than smoking and other health outcomes including some physical signs and diseases.
However, there are also some issues in high dimensional mediation analysis (HDMA), such as the curse of dimensionality, the false positive rate inflation caused by multiplicity and the confounding existing in observational research. To overcome these issues, scholars have proposed a series of statistical methods. Zhang et al. [9]. proposed the HIMA model consisting of variable screening based on sure independence screening (SIS), variable selection techniques based on minimax concave penalty (MCP) estimation and joint significance test. HIMA extends the multiple mediator framework to the highdimensional setting by incorporating variable screening and variable selection techniques into multiple mediation analysis. The following highdimensional mediation analysis methods also employ the generic procedure [13,14,15,16], which reduces dimensionality from high to moderate or low scale and then conducts multiple mediation test. For example, the HIMA2 procedure proposed by Perera et al. [17], which employs the SIS method based on the indirect effect of every single mediator and conducts debiased Lasso to obtain more accurate estimates, then utilizes the multipletesting procedure proposed by James et al. [18] to control the false discovery rate. Moreover, to adjust the confounders of observational epigenetic studies, researchers tried to integrate propensity score (PS) into the highdimensional mediation model by weighting or considering it as a covariate [14, 16], except for the classic regression adjustment.
Although many works have been made to tackle these problems, there are still some issues remaining in the dimensionality reduction and adjustment for confounders. For high dimensional mediation analysis, the previous studies donâ€™t take confounders into account, just consider them as covariates [15, 19], such as HDMA, HIMA, and HIMA2 [5, 9, 17]. As is known to all, the multivariable model cannot adequately account for confounding effects in the presence of a large number of confounders [20]. If we only control confounding during the mediation test, but not in the dimension reduction stage, then a biased variable selection result may be obtained [14]. Thus, it is necessary to adjust confounders to improve the performance of variable selection.
To address this issue, researchers have adopted the PSbased method including PS regression adjustment (termed PSR) and classical PS weighting (also called inverse probability weighting, IPW) to adjust confounding during both stages [14]. However, the adjustment for confounders using the IPW based on PS still faces the issue of extreme weights caused by extreme PS distribution [21, 22]. To address the issue of extreme PS distribution, Li et al. [23]. proposed the overlap weighting (OW) method, which emphasizes individuals with the most overlap in their observed characteristics and is beneficial to provide a consistent estimator of the effect of exposure on outcome in the presence of extreme PS tails. OW belongs to the weighting confounding adjustment method based on PS and is gaining more popularity because of excellent statistical properties [24, 25]. However, the above OW method is only applied to traditional epidemic analysis, which needs to be extended to mediation analysis and highdimensional data setting. Besides, most of the existing methods all hold the independent assumption between potential mediators, which is hard to ensure in high dimensional epigenetic data analysis [5, 9, 13,14,15, 18].
In this article, we incorporated the OW method into HIMA [9] and HIMA2 [17] models, respectively. In order to develop the accuracy of the screening of potential mediators, we modified the framework of variable screening in the original HIMA2 procedure. Eventually, we proposed the OWbased modified HIMA2 (mHIMA2) procedure for HDMA. We evaluated the performance of the proposed procedure and the existing models through simulation studies. All the above evaluations are based on the simulation study and real data application.
The rest of the article is structured as follows. In the next section, we introduced the notions, assumptions, models, and the procedure of adjustment for confounders in the highdimensional mediation analysis model. Then, we conducted the Monte Carlo simulation study to evaluate the performance of various methods of confounding adjustment and two different mediation test approaches. Additionally, we applied the proposed method to the dataset GSE117859 in the Gene Expression Omnibus (GEO) databases and identified some DNA methylation markers that mediate the effect of smoking on the estimated natural killer (NK) cell level. Finally, we concluded the advantages and limitations of this study.
Methods
Model definitions
Our highdimensional mediation model is shown in Fig. 1. Let \(X\) be the exposure variable, where \(X=1\) represents the exposed group and \(X=0\) represents the controlled group. Denote the outcome as Y, here we mainly focus on continuous outcome. Let \(M={\left({M}_{1},{M}_{2},\cdots ,{M}_{P}\right)}^{T}\) be the set of the \(p\)dimensional potential mediators, where \(p\gg n\), \(n\) is the sample size. Let \(C={\left({C}_{1},{C}_{2},\cdots ,{C}_{q}\right)}^{T}\) be the \(q\)dimension baseline confounders which influence the relation of exposuremediator, mediatoroutcome, and exposureoutcome. For individual \(i\), \(i=\text{1,2},\cdots ,n\), we have the highdimensional mediation models as follows:
where \(\alpha ={\left({\alpha }_{1},{\alpha }_{2},\cdots ,{\alpha }_{p}\right)}^{T}\) is the coefficient vector relating the exposure to the mediators, \(\beta ={\left({\beta }_{1},{\beta }_{2},\cdots ,{\beta }_{p}\right)}^{T}\) represents the effect of the mediators on the outcome, \({\alpha }_{k}{\beta }_{k}\) corresponds to the mediation effect of \({M}_{k}\) according to the definition of coefficients product method, and \(\left[p\right]\) denotes the set of \(\left\{\text{1,2},\cdots ,p\right\}\). One can consider whether \({M}_{k}\) is the statistically significant mediator or not by testing the null hypothesis \({H}_{0}:{\alpha }_{k}{\beta }_{k}=0\). \({\phi }_{k}\) and \(\eta\) are the effect of \(C\) on \(M\) and \(C\) on \(Y\), respectively. \({a}_{k}\) and \(a\) are the intercept term in the Eqs. 1 and 2, respectively. The same as above, \({e}_{k}\) and \(\epsilon\) are each the corresponding error term. We will compare the different variable selection strategies and methods of adjusting confounders.
Assumptions
To ensure the identification of pathspecific mediating effects, some assumptions need to be held as below. These assumptions were proposed referring to necessary condition required for highdimensional mediation analysis suggested in published studies [8, 15, 17, 19, 26, 27]:
A1: There is no causal association between mediators. This means the proposed model contains only parallel mediators.
A2: Sequential ignorability. That consists of four assumptions listed below:
(A2.1) There are no unmeasured confounders between the exposure and the outcome;
(A2.2) There are no unmeasured confounders between the mediators and the outcome;
(A2.3) There are no unmeasured confounders between the exposure and the mediators;
(A2.4) There is no exposureinduced confounding between the mediators and the outcome.
A3: Stable unit treatment value assumption (SUTVA) [28, 29] for both the mediators and the outcome. That is to say, there is no interference between individuals.
A4: Consistency for the mediators and the outcome. That is to say, there are no measurement errors in the mediators.
A5: Positivity assumption [30]. Every individual has some positive probability of being exposed to the factor of interest.
Proposed Procedure
We improved the HIMA procedure proposed by Zhang et al. (2016) [9] and the HIMA2 procedure proposed by Perera et al. (2022) [17] under the condition of adjusting confounders in observational data.
In this study, we developed two processes to conduct the confoundingcontrolled highdimensional mediation analysis. The detailed procedure is described in the following text.
Step 1: PSbased methods for adjusting confounders
Since there are always some baseline confounders in observational data, we integrate propensity score (PS) into mediators (and/or outcome) models to reduce the selection bias and acquire as accurate estimates of the mediation effect as possible. Due to the PS approaches allowing the inclusion of a large scale of confounders, PS is widely used in observational research.
PS is defined as the conditional probability that a study individual with baseline covariates \(C=\left({C}_{1},{C}_{2},\cdots ,{C}_{l}\right)\) would be exposed to certain study factors of interest [31]:
PS can be estimated by classic multivariable statistical methods such as logistic regression [32] or by machine learning methods such as random forest (RF) and generalized boosted model (GBM) [33, 34]. In practice, logistic regression is the most commonly used. The PS of \(i\)th individual \({\pi }_{i}=P\left({X}_{i}=1{C}_{1i},\cdots ,{C}_{li}\right)\) can be expressed as follows:
where \(\theta ={\left({\theta }_{1},{\theta }_{2},\cdots ,{\theta }_{l}\right)}^{T}\) represents the effect of the confounders on the exposure. Then we can adopt some PSbased techniques to adjust confounders such as matching [35], stratification [36], regression [31], and weighting [37]. Here, we focus on PS regression (PSR) and PS weighting [14] (PSW, also called IPW short for inverse probability weighting) techniques to adjust potential confounders between exposure, mediators and outcome.
PSR approach incorporates PS as a covariate into the original regression model to adjust for the probability of being exposed to study factors and to reduce confounding [32]. That is similar to taking all confounders as covariates in a classical regression approach which usually uses the linear regression model for continuous outcomes and the logistic regression model for binary outcomes [38]. For the PSR approach, we can estimate the effect through the models below:
The PSW approach constructs the inverse probability weights by taking the reciprocal of PS. For binary exposure, the weight of the exposed group \(X=1\) is given as \(\frac{1}{PS}\), and that of the controlled group \(X=0\) as \(\frac{1}{1PS}\). For \(i\)th individual:
Then, we can estimate the coefficients of X in pathways \(X\to M\) and \(M\to Y\) by weighted estimation:
where \({\alpha }_{k,ipw}\) and \({\gamma }_{ipw}\) are the weighted estimation according to the \(ipw\) weight vector. However, the IPW often faces extreme PSs issue which may lead to extreme weights and result in biased estimates and excessive variance [23, 24].
The overlap weighting (OW) approach was proposed to address the issue of extreme PSs [23]. The overlap weight is given as \(1PS\) for the group \(X=1\) and \(PS\) for the group \(X=0\). Note that, individuals with \(PS\) of 0.5 make the largest contribution to the effect estimate, and individuals with \(PS\) close to 0 and 1 make the smallest contribution. OW is likely to be beneficial in the presence of extreme tail weights [23, 39]. For individual \(i\):
Then, the effect estimation of OW is similar to that of the PSW procedure:
In the same way, \({\alpha }_{k,ow}\) and \({\gamma }_{ow}\) are the weighted estimation using \(ow\) weight vector.
Step 2: Confoundingcontrolled SIS approach for dimensionality reduction
The SIS procedure is a general technique to reduce accurately high dimensions to below sample size [40]. We adopt the SIS method to reduce dimension \(p\) from ultrahigh dimension to moderate scale \(d=\left[\frac{2n}{\text{log}\left(n\right)}\right]\) [9, 15].
In this study, we considered two preliminary screening strategies as described in HIMA [9] and HIMA2 [17], based on the effects of \(M\) on \(Y\) (\({\beta }_{k}\)) and the indirect effect \(\left{\alpha }_{k}{\beta }_{k}\right\) respectively. Because the indirect effects can be both positive and negative effects, to address the influence of the signs of the estimated indirect effects, the HIMA2 approach uses the absolute values of the indirect effect to obtain the size of the effect estimate regardless of the direction. This approach ensures that mediators with large effect size can be selected.
Due to the lack of screening accuracy in SIS based on indirect effects in the presence of confounders, we conducted the SIS screening based on the effects on the path \(M\to Y\) controlling confounding effects using the OW approach.
In simulation, we found that it is hard to select the true mediators based on \(\left{\alpha }_{k}{\beta }_{k}\right\) in the presence of confounding factors as applied in the original HIMA2 approach. So, we modified the frame of the HIMA2 method and both adopt SIS based on the effects on the path \(M\to Y\)\({\beta }_{k}\) in the preliminary screening to select the subset of potential mediators \({M}_{SIS}=\left\{{M}_{k}:{M}_{k} \text{i}\text{s} \text{a}\text{m}\text{o}\text{n}\text{g} \text{t}\text{h}\text{e} \text{t}\text{o}\text{p} d \text{l}\text{a}\text{r}\text{g}\text{e}\text{s}\text{t} \text{e}\text{f}\text{f}\text{e}\text{c}\text{t} of {\beta }_{k}\right\}\).
Noticing that we need to adopt a twostep weighting method [14] to estimate \({\beta }_{k}\) for the PSW and OW methods.
First, \({\gamma }_{k,w}\) can be obtained from the following submodel:
where \({\widehat{\gamma }}_{k,w}\) is the estimator of \({\gamma }_{k,ow}\) or \({\gamma }_{k,ipw}\) for each \({M}_{k}\). In addition, the residual \({\widehat{e}}_{k}\) can be derived:
Then \({\beta }_{k}\) can be estimated by regressing \({\widehat{e}}_{k}\) on \({M}_{k}\) without weighting. Through the above SIS procedure, we can identify the important mediators and achieve the goal of dimensionality reduction.
Step 3: Penalized estimation
According to the HIMA procedure, after the preliminary selection of candidate mediators, further variable selection can be accomplished by the penalized estimation method. Here, we adopt the MCP [41] rather than other penalty functions, since the MCP approach has the oracle property which can select the correct model with probability tending to 1 as \(n\to \infty\) [15, 41, 42].
For the \(d\)dimensional subset \({M}_{SIS}\), we employed the MCPpenalized estimation to further select significant mediators set \({M}_{MCP}=\left\{{M}_{k}:{\beta }_{k}\ne 0,{M}_{k}\in {M}_{SIS}\right\}\), MCP penalty function can be defined as below:
where \(\lambda >0\) is the regularization parameter which can be selected by AIC or BIC, and \(\delta >0\) is the tuning parameter which determines the concavity of MCP. The MCP procedure can be implemented through the R package ncvreg [43]. Through MCP penalty estimation, we filtered out the mediators with too weak effects by combining SIS and MCP procedures and then acquired the small number of mediators that needed to be tested. That will help to obtain more accurate effect estimates.
Following the original HIMA2 procedure, the penalized estimation adopts the debiased Lasso method to get the estimator \({\widehat{\beta }}_{k}\) and standard error \({\widehat{\sigma }}_{{\beta }_{k}}\). The submodel of the debiased Lasso method can be described below:
where \({\beta }_{SIS}\) denote the effects of \({M}_{k}\in {M}_{SIS}\) on \(Y\). The corresponding Pvalues \({P}_{{\beta }_{k}}\) are given as:
where \({\Phi }\left(.\right)\) is the cumulative distribution function of standard normal distribution \(N\left(\text{0,1}\right)\). The debiased Lasso method can be implemented with the R package hdi.
Step 4: PSbased multiple mediation test
After MCPbased penalized estimation, we use the Joint significance test [3, 44] (termed JSuniform) to test the mediation effect of \({M}_{k}\in {M}_{MCP}\). The Joint significance test considers the \({M}_{k}\) as a true mediator when \({\alpha }_{k}\) and \({\beta }_{k}\) is significant simultaneously. Here, \({\alpha }_{k}\) can be estimated through different confounding adjustment methods as shown in Eqs. 1, 3, 4, and 5. \({\beta }_{k}\) can be obtained using the linear regression with considering all confounders as covariates or only including PS (summary of all confounders) as a covariate.
In other words, that is based on the Pvalues for testing the pathspecific effects \({H}_{0}:{\alpha }_{k}=0\) or \({H}_{0}:{\beta }_{k}=0\). The raw Pvalue for the joint significance test [3] is defined below:
\(\begin{array}{c}{P}_{raw,k}=\text{max}\left({P}_{raw,{\alpha }_{k}}, {P}_{raw,{\beta }_{k}}\right),\#\end{array}\)where \({P}_{raw,{\alpha }_{k}}\) and \({P}_{raw,{\beta }_{k}}\) are the Pvalues for testing \({H}_{0}:{\alpha }_{k}=0\) and \({H}_{0}:{\beta }_{k}=0\). \({P}_{raw,{\alpha }_{k}}\) and \({P}_{raw,{\beta }_{k}}\) can be obtained from the mediator model (e.g. Equations 1, 3, 4, and 5) and outcome model (Eq. 2), respectively.
For the multiplicity (Type I error inflation) issue in multiple mediation testing, we adopted the Benjaminiâ€“Hochberg (BH) method [45, 46] to acquire the adjusted \(p\)values as below,
where \(q\) is the number of potential mediators in the set \({M}_{MCP}\), and \({r}_{k}\) is the location number of \({P}_{raw,k}\) when all the Pvalues \({P}_{raw,k}\) are sorted ascending.
However, the Joint significance test assumes \({P}_{raw,k}\) follows a uniform null distribution. Although \({P}_{{\alpha }_{k}}\) and \({P}_{{\beta }_{k}}\) are each uniformly distributed, their maximum may not. Therefore, the Joint significance test results in a valid but overly conservative test with lower power [13, 17, 47].
Hence, we adopt the PSbased joint significance with mixture null distribution method [18] (termed JSmixture) approach to conduct multiple mediation test after debiased Lasso penalized estimation [17, 48] referring to the classical HIMA2 procedure. The PSbased JSmixture approach adopts a 3component mixture distribution as below:
The estimated pointwise FDR for testing mediation can be computed as:
where \(t\in \left[\text{0,1}\right]\), \({V}_{00}\left(t\right),{V}_{01}\left(t\right),{V}_{10}\left(t\right)\) denoting the numbers of the three types of false positives and \(R\left(t\right)={V}_{00}\left(t\right)+{V}_{01}\left(t\right)+{V}_{10}\left(t\right)+{V}_{11}\left(t\right)\). The \({V}_{00}\left(t\right),{V}_{01}\left(t\right),{V}_{10}\left(t\right)\) and \(\widehat{FDR}\left(t\right)\) can be obtained using the R package HDMT.
We set the significance level of 0.05 for all the tests. The detailed processes of the proposed method are summarized in Fig. 2.
Simulation studies
Simulation design
In this section, we conducted the simulation studies to evaluate the performance of the proposed method. The implementation of the simulation was based on R (version 4.3.0, R Foundation for Statistical Computing, Vienna, Austria) and RStudio (version 2023.9.0.463, RStudio: Integrated Development Environment for R, Boston, MA). The setting of simulation parameters was based on the published studies [9, 14, 16]. The number of replications in simulation study was set to be 500 for each combination of parameter setting referring to the replication times settings in published methodogical studies [9, 14,15,16,17, 19, 49].
The model structure is shown in Fig. 1. We consider 8 confounders \(C=\left({C}_{1},{C}_{2},\cdots ,{C}_{8}\right)\) affecting the relationship of \(X\), \(M\), \(Y\), in which continuous confounders \({C}_{1}{C}_{4}\) follow a multivariate normal distribution \(N\left(\mu ,{\Sigma }\right)\) with a mean vector \(\mu ={\left(\text{0,0},\text{0,0}\right)}^{T}\) and a covariance matrix \({\Sigma }\):
The last four binary confounders \({C}_{5}{C}_{8}\) are independently generated from the Binary distribution \(B\left(n,0.3\right)\), where \(n\) is the sample size.
Then exposure \(X\) can be generated from Binary distribution \(B\left(n,{P}_{c}\right)\), where \(n\) is the sample size, \({P}_{c}=1/\left(1+{e}^{\left({\theta }^{T}C\right)}\right)\), and \({\theta }^{T}=\left({\theta }_{1},{\theta }_{2},\cdots ,{\theta }_{8}\right)=\left(\text{0.2,0.2,0.3,0.3,0.2,0.2,0.3,0.3}\right)\).
Mediators \(M\) and the outcome variable \(Y\) are generated according to Eqs. 1 and 2, respectively. For simplicity, we set all the effects of \(C\) on \(M\) to be the same. Let \({\phi }_{k}={\left({\phi }_{k1},\cdots ,{\phi }_{k8}\right)}^{T}={\left(\text{0.2,0.2,0.3,0.3,0.2,0.2,0.3,0.3}\right)}^{T}\) represent the effect of C on M. Let \(\eta ={\left({\eta }_{1},{\eta }_{2},\cdots ,{\eta }_{8}\right)}^{T}={\left(\text{0.2,0.2,0.3,0.3,0.2,0.2,0.3,0.3}\right)}^{T}\) denote the effects of \(C\) on \(Y\).
We set the first four potential mediators \({M}_{1}{M}_{4}\) as the true significant mediators in this study. Let \(\alpha ={\left({\alpha }_{1},{\alpha }_{2},\cdots ,{\alpha }_{p}\right)}^{T}=\left(\text{0.4,0.4,0.5,0.5,0.5,0.5,0},0,\cdots ,0\right)\); \(\beta ={\left({\beta }_{1},{\beta }_{2},\cdots ,{\beta }_{p}\right)}^{T}=\left(\text{0.4,0.5,0.5,0.6,0},\text{0,0.5,0.5,0},\cdots ,0\right)\). The elements of both \(\alpha\) and \(\beta\) are equal to zero except for the first eight elements, and the first four are the significant mediators. The mediation effect size of the true mediators \({M}_{1}{M}_{4}\) is \({\alpha \beta }_{14}=\left(\text{0.16,0.2,0.25,0.3}\right)\).
Let \(\gamma =0.5\); \(a=0.5\); \(a_k\sim U(0,1)\), \(\epsilon\sim N(0,1)\). The error term \({e}_{k}\) are generated from \(N\left(\text{0,1.2}\right)\) and the correlation between mediators mostly falls between 0.15 and 0.35.
To evaluate the impacts of sample size and potential mediators dimension, we set two sample size levels \(n=300, 500\), and two dimension levels \(p\)=1000,10000.
In addition, we take the correlation between mediators into account in the condition of \(p\)=1000 dimension. We simulate the strong correlation between mediators by generating the error terms \({e}_{k}\) from \(N\left(0,{{\Sigma }}_{e}\right)\), where \({{\Sigma }}_{e}={\left({\rho }^{\leftk{k}^{{\prime }}\right}\right)}_{k,{k}^{{\prime }}}\). It means the correlation between two mediators will decrease as the absolute difference in mediatorsâ€™ subscript \(\leftk{k}^{{\prime }}\right\) increases. We set four correlation levels \(\rho =0, 0.25, \text{0.5,0.75}\) with dimension \(p\)=1000 and sample size \(n=300, 500\). In the simulation setting \(\rho =0, 0.25, \text{0.5,0.75}\), the corresponding Pearson correlation coefficients between two adjacent mediators are around 0.4, 0.5, 0.7, and 0.8, respectively. We evaluated the performance of the mHIMA2 and PSbased HIMA by conducting 500 replications of simulated data sets for each scenario [9, 14,15,16,17, 19, 49].
Simulation results
Simulation results are presented in Tables 1 and 2. Evaluation of the performance of mediator selection of the proposed approach is shown in Table 1 by measuring the true positive rate (TPR) and false discovery proportion (FDP) of selection after the significance test for mediation effects. The mediators have higher TPR as the indirect effect increases (i.e., larger mediation effect, higher detection rate).
As presented in Table 1. Under most settings, the mHIMA2 mediation test approach has a higher TPR than PSbased HIMA while a higher FDP at the same time. Overall, the mHIMA2 is more powerful than the PSbased HIMA and is less conservative in selecting significant mediators.
As shown in Table 1, for the mHIMA2 mediation test approach, TPR is ranked as OWâ€‰>â€‰IPWâ€‰>â€‰PSRâ€‰>â€‰RA, and FDP is not more than 0.1 and gradually decreases to close to 0.05 as the sample size increases. Among all models, the mHIMA2 mediation test approach with OW adjustment has the highest power and acceptable false positive level. When using the PSbased HIMA mediation test approach, TPR is ranked consistently as RAâ€‰>â€‰PSRâ€‰>â€‰OWâ€‰>â€‰IPW, and all four models also keep FDP at an extremely low level.
Table 2 presents the estimation of mediation effects with the mean and mean square error (MSE). The estimators approach the true values as the mediation effect increases. All models tend to be more accurate as \(n\) gets larger and \(p\) gets smaller. Overall, the mHIMA2 mediation test approach has a smaller MSE than the PSbased HIMA approach in most cases. RA adjustment has a higher MSE than other adjustment methods especially when facing the large mediation effect, OW adjustment has the lower MSE among the four adjustment methods.
As shown in Table 2, similarly, the mHIMA2 approach with OW adjustment has the smallest MSE among all models. Moreover, similar results can be seen in the different strong correlation settings in Table S1S8 in the supplementary file. The mHIMA2 methods have lower MSE (i.e. more precise estimation) and apparently higher TPR. That means the debiased Lasso technique in mHIMA2 methods performs better when handling the moderate correlation between mediators. However, the FDP of all models slightly increases as the correlation between mediators increases. When correlation among the mediators is strong (for example, \(r\)>0.7), all models suffer in terms of increased MSE.
Data application
Smoking is an important environmental factor affecting the immune system and blood cell composition [50, 51]. Previous studies have demonstrated smokers had lower natural killer (NK) cell counts and activity [50, 51]. Smoking has also been found to be associated with DNA methylation levels [52]. Meanwhile, DNA methylation levels have also been found to be associated with associated with human NK cell activation [53, 54]. Therefore, DNA methylation may mediate the association between smoking and NK cell level. So we implemented the proposed highdimensional mediation analysis methods to identify the specific functional CpG sites that may mediate the relationship between smoking and the estimated NK cell level.
Here we apply our method to the GSE117859 dataset obtained from the Gene Expression Omnibus (GEO) database. The aim of the study in which GSE117859 was originally measured is to explore the smokingassociated DNA methylation features linked to AIDS outcomes in the HIVpositive population [55]. The blood samples from the Veteran Aging Cohort Study (VACS) were collected in that study. The HumanMethylation450 BeadChip platform was used to measure the DNA methylation levels.
In total 608 samplesand 485,577 probes were included in the dataset. Clinical information such as age, sex, race, smoking history, adherence of antiretroviral therapy (ART), estimated CD4 T cells, estimated CD8 T cells, and estimated NK cells were collected. The estimated CD4/CD8/NK were obtained using a methylationbased cell type deconvolution algorithm proposed by Housman et al. [56]. To some extent, the estimated CD4 and CD8 levels can represent AIDS severity.
Smoking status was collected based on selfreport. All included patients were classified into the smoker and the nonsmoker groups according to their reported smoking history. After removing the individuals without available clinical information and DNAm sites with missing values, a total of 587 samples and 485,503 probes were included in the analysis.
We adjusted the potential confounders including age, race, adherence of antiretroviral therapy, estimated CD4 T cells, and estimated CD8 T cells. Demographic and clinical variables included in our analysis are presented in Table 3.
The analysis results using the proposed mHIMA2 method are presented in Table 4. Here, we mainly presented the CpGs mediators with a total effect proportion greater than 5%. Due to the limitation of text content, we didnâ€™t present the whole summary results of the PSbased HIMA method, but that can be seen in Table S9 in the supplementary file.
As shown in Table 4, we identified two methylation sites cg13917614 in CNP gene and cg16893868 in LILRA2 gene by most of mHIMA2 based methods. The similar result can be seen in Table S9 in the supplementary file. The existing studies have already demonstrated the site cg13917614 is associated with smoking [52, 57]. Although we donâ€™t find direct evidence that the CNP gene is associated with immune function based on the existing literature, relevant studies showed a link between CNP and inflammatory responses in which the mechanism remains further study [58, 59].
The encoded protein of the LILRA2 gene can suppress innate immune response [60, 61]. The results reveal that smoking will promote the demethylation of cg16893868, leading to an increase in gene LILRA2 expression and ultimately reducing the estimated NK cell level. It has been found that the remaining CpG sites cg20460771, cg03164561, cg03605454, cg09529165, and cg01500140 are all associated with smoking [11, 52, 62,63,64]. Further insights into the discovered CpG mediators in genomewide epigenetic studies will be meaningful.
Discussion
The causal relationship obtained in highdimensional mediation analysis usually depends on noconfounding assumption. However, confounding is almost inevitable in observational studies owing to the lack of randomization of the baseline covariates in practice. Previous studies show the utilization of PS method such as PSadjustment and IPW in highdimensional mediation analysis, but those face the issue of extreme PS distribution.
In this article, we integrated OW approach into the highdimensional mediation model, which can address extreme PS distribution and better adjust for confounding. Finally, we developed a highdimensional mediation analysis workflow consisting of OW confounding adjustment, SIS, debiased Lasso penalization for potential mediator screening, and the highdimensional mediation test underlying the mixture null distribution of Pvalues.
Simulation results indicate that the mHIMA2 with OW approach presented in this study performs best among all the compared models with the highest TPR, acceptable FDP level, and the smallest MSE in mediating effect estimation. In addition, the mHIMA2 embedded debiased Lasso method performs better when moderate correlations between mediators exist.
Simulation study also suggestedthe proposed method would perform better when the sample size was increased. This result suggests that when the proposed method is used for the analysis of mediating effects on real data, a sufficient sample size should also be ensured. Such a feature is also consistent with other existing methods [5, 9, 14, 17, 19, 49]. Furthermore, the dimensionality of potential mediators has little effect on the performance of the proposed method.
In most of the previous studies [5, 9, 13, 17], it didnâ€™t take confounding adjustment into account in the SIS process. However, we adopted the PSbased method to adjust confounding, thus improving the accuracy of mediators screening. Moreover, it has been assumed that mediators are linearly independent of each other, but such an assumption is often not strictly valid in real data. The violation of the mediatorsâ€™ independence assumption often affects the accuracy of mediators selection and precision of mediating effect estimation. The proposed method can effectively deal with this issue which can tolerate the correlation between the mediators and ensure the robustness of mediators selection, multiple mediation testing, and mediating effect estimation.
Similar to other twostep approaches, the error of the first model may be introduced and cumulated in the second step, because the firststep can not quarantee 100% correctness. To avoid this, we set a relatively loose screening criterion with \(d=2n/log(n)\) to select the top \(d\) largest effect mediators [15,16,17, 49] in the first step to control false negative while avoiding the increase of false positive error according to the application recommendation of SIS approach. Though the errors cannot be totally avoid, this can reduce the error in the preliminary screening of mediators and prevent serious error cumulation in the second step to some extend. As shown in the simulation, the proposed twostep model performed well. Besides, previous published studies also have demonstrated the error cumulation issue in twostep models can be controlled well in the similar way as we did, and well not cause serious bias in the final results [14, 65,66,67,68,69,70].
Meanwhile, we applied the proposed method to the dataset GSE117859 obtained from the GEO databases and identified several significant DNAm mediators, including the sites cg13917614, cg16893868, cg20460771, cg03164561, cg03605454, cg09529165, and cg01500140. Among them, site cg16893868 in LILRA2 gene has been demonstrated to be associated with smoking and immune function [60, 61]. That indicates that the proposed method can identify reliable mediators in empirical data analysis.
The presence of confounding in observation studies always is a major challenge to obtaining causal relationships. Currently, most genetic studies are based on observational research without randomization of baseline characteristics. Particularly, the highdimensional mediation analysis always faces some issues, such as the accuracy of the highdimensional mediation selection and the low power of multiple mediation test [13, 14, 17, 18]. Although the utilization of PSR and IPW offers a solution of confounding adjustment in classical HDMA workflow, it still faces the issue of extreme PS distribution.
The proposed OWbased method can provide a more precise and stable mediating effect estimation. However, the misspecification of the outcome model and PS model can not be avoid in practice. Hence, the doubly robust methods may be desirable to be applied in HDMA workflow in future study. Even if the JSmixture method was proposed to improve the power of multiple mediation testing, other more powerful test methods still are appealing in largescale genomewide epigenetic studies [13, 18]. Conducting further simulation and methodology studies to compare different powerful test methods may provide useful reference for future studies. It should also be noticed that the existence of unmeasured confounding is out of the scope of this paper. Previews published studies have provided serval applicable methods to deal with this issue [49, 71].
Conclusion
Overall, the mHIMA2 with OW adjustment has sufficient power in selecting potential true mediators and obtaining precise estimation for mediation effects. It can be recommended in practical highdimensional mediation analysis, especially in epigenetic study.
Availability of data and materials
The dataset GSE117859 obtained from GEO database in our real data analysis can be accessed (at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE117859) without limitation. Our procedure is implemented using the R software. The corresponding R code can be found at https://github.com/huww1998/CONF_mHIMA2.
Abbreviations
 FDP:

False discovery proportion
 GEO:

Gene Expression Omnibus
 HDMA:

Highdimensional mediation analysis
 IPW:

Inverse probability weighting
 JSuniform:

Joint significant test with uniform distribution
 JSmixture:

Joint significance test with mixture null distribution
 MSE:

Mean square error
 mHIMA2:

Modified HIMA2 model
 NK cell:

Natural killer cell
 OW:

Overlapping weighting
 PSR:

Propensity score regression adjustment
 PS:

Propensity score
 SIS:

Sure independence screening
 TPR:

True positive rate
References
Baron RM, Kenny DA. The moderatormediator variable distinction in social psychological research: conceptual, strategic, and statistical considerations. J Personal Soc Psychol. 1986;51(6):1173â€“82. https://doi.org/10.1037//00223514.51.6.1173.
Huan T, Joehanes R, Schurmann C, Schramm K, Pilling LC, Peters MJ, et al. A wholeblood transcriptome metaanalysis identifies gene expression signatures of cigarette smoking. Hum Mol Genet. 2016;25(21):4611â€“23. https://doi.org/10.1093/hmg/ddw288 .
MacKinnon DP, Lockwood CM, Hoffman JM, West SG, Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychol Methods. 2002;7(1):83â€“104. https://doi.org/10.1037/1082989x.7.1.83.
Biesanz JC, Falk CF, Savalei V. Assessing Mediational models: testing and interval estimation for Indirect effects. Multivar Behav Res. 2010;45(4):661â€“701. https://doi.org/10.1080/00273171.2010.498292.
Gao Y, Yang H, Fang R, Zhang Y, Goode EL, Cui Y. Testing mediation effects in highdimensional epigenetic studies. Front Genet. 2019;10:1195.https://doi.org/10.3389/fgene.2019.01195.
Taylor AB, MacKinnon DP. Four applications of permutation methods to testing a singlemediator model. Behav Res Methods. 2012;44(3):806â€“44. https://doi.org/10.3758/s134280110181x.
VanderWeele TJ. Marginal structural models for the estimation of direct and indirect effects. Epidemiol (Cambridge Mass). 2009;20(1):18â€“26. https://doi.org/10.1097/EDE.0b013e31818f69ce .
VanderWeele TJ, Vansteelandt S. Mediation analysis with multiple mediators. Epidemiol Methods. 2014;2(1):95â€“115. https://doi.org/10.1515/em20120010 .
Zhang H, Zheng Y, Zhang Z, Gao T, Joyce B, Yoon G, et al. Estimating and testing highdimensional mediation effects in epigenetic studies. Bioinf (Oxford England). 2016;32(20):3150â€“4. https://doi.org/10.1093/bioinformatics/btw351 .
Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, et al. High density DNA methylation array with single CpG site resolution. Genomics. 2011;98(4):288â€“95. https://doi.org/10.1016/j.ygeno.2011.07.007.
Harlid S, Xu Z, Panduri V, Sandler DP, Taylor JA. CpG sites associated with cigarette smoking: analysis of epigenomewide data from the Sister Study. Environ Health Perspect. 2014;122(7):673â€“8. https://doi.org/10.1289/ehp.1307480.
Toyooka S, Maruyama R, Toyooka KO, McLerran D, Feng Z, Fukuyama Y, et al. Smoke exposure, histologic type and geographyrelated differences in the methylation profiles of nonsmall cell lung cancer. Int J Cancer. 2003;103(2):153â€“60. https://doi.org/10.1002/ijc.10787 .
Liu Z, Shen J, Barfield R, Schwartz J, Baccarelli AA, Lin X. Largescale hypothesis testing for Causal Mediation effects with Applications in Genomewide epigenetic studies. J Am Stat Assoc. 2022;117(537):67â€“81. https://doi.org/10.1080/01621459.2021.1914634.
Luo L, Yan Y, Cui Y, Yuan X, Yu Z. Linear highdimensional mediation models adjusting for confounders using propensity score method. Front Genet. 2022;13:961148. https://doi.org/10.3389/fgene.2022.961148.
Luo C, Fa B, Yan Y, Wang Y, Zhou Y, Zhang Y, et al. Highdimensional mediation analysis in survival models. PLoS Comput Biol. 2020;16(4):e1007768. https://doi.org/10.1371/journal.pcbi.1007768.
Yu Z, Cui Y, Wei T, Ma Y, Luo C. Highdimensional mediation analysis with confounders in Survival models. Front Genet. 2021;12:688871. https://doi.org/10.3389/fgene.2021.688871.
Perera C, Zhang H, Zheng Y, Hou L, Qu A, Zheng C, et al. HIMA2: highdimensional mediation analysis and its application in epigenomewide DNA methylation data. BMC Bioinformatics. 2022;23(1):296. https://doi.org/10.1186/s12859022047481.
Dai JY, Stanford JL, LeBlanc M. A multipletesting procedure for highdimensional mediation hypotheses. J Am Stat Assoc. 2022;117(537):198â€“213. https://doi.org/10.1080/01621459.2020.1765785.
Zhang H, Zheng Y, Hou L, Zheng C, Liu L. Mediation analysis for survival data with highdimensional mediators. Bioinf (Oxford England). 2021;37(21):3815â€“21. https://doi.org/10.1093/bioinformatics/btab564 .
Heinze G, JÃ¼ni P. An overview of the objectives of and the approaches to propensity score analyses. Eur Heart J. 2011;32(14):1704â€“8. https://doi.org/10.1093/eurheartj/ehr031.
Stuart EA. Matching methods for causal inference: a review and a look forward. Stat Science: Rev J Inst Math Stat. 2010;25(1):1â€“21. https://doi.org/10.1214/09STS313 .
Lee BK, Lessler J, Stuart EA. Weight trimming and propensity score weighting. PLoS One. 2011;6(3):e18174. https://doi.org/10.1371/journal.pone.0018174.
Li F, Thomas LE, Li F. Addressing Extreme Propensity scores via the Overlap weights. Am J Epidemiol. 2019;188(1):250â€“7. https://doi.org/10.1093/aje/kwy201 .
Thomas LE, Li F, Pencina MJ. Overlap weighting: a propensity score method that mimics attributes of a Randomized Clinical Trial. JAMA. 2020;323(23):2417. https://doi.org/10.1001/jama.2020.7819.
Mlcoch T, Hrnciarova T, Tuzil J, Zadak J, Marian M, Dolezal T. Propensity Score Weighting Using Overlap Weights: A New Method Applied to Regorafenib Clinical Data and a CostEffectiveness Analysis. Value in Health. 2019;22(12):13707. doi: 10.1016/j.jval.2019.06.010.
Vanderweele TJ, Vansteelandt S, Robins JM. Effect decomposition in the presence of an exposureinduced mediatoroutcome confounder. Epidemiol (Cambridge Mass). 2014;25(2):300â€“6. https://doi.org/10.1097/EDE.0000000000000034 .
Perera C, Zhang H, Zheng Y, Hou L, Qu A, Zheng C, et al. HIMA2: highdimensional mediation analysis and its application in epigenomewide DNA methylation data. BMC Bioinformatics. 2022;23(1).Â https://doi.org/10.1186/s12859022047481.
Basu D. Randomization analysis of Experimental Data: the Fisher randomization test. J Am Stat Assoc. 1980;75(371):575â€“82. https://doi.org/10.1080/01621459.1980.10477512.
Rubin DB. Comment. J American Statis Assoc. 1986;81(396):961â€“2. https://doi.org/10.1080/01621459.1986.10478355.
Imbens GW, Rubin DB. Causal inference for statistics, Social, and Biomedical sciences: an introduction. Cambridge: Cambridge University Press; 2015.
Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika. 1983;70(1):41â€“55. https://doi.org/10.1093/biomet/70.1.41.
Haukoos JS, Lewis RJ. The Propensity score. JAMA. 2015;314(15):1637â€“8. https://doi.org/10.1001/jama.2015.13480.
Lee BK, Lessler J, Stuart EA. Improving propensity score weighting using machine learning. Stat Med. 2010;29(3):337â€“46. https://doi.org/10.1002/sim.3782.
Abdia Y, Kulasekera KB, Datta S, Boakye M, Kong M. Propensity scores based methods for estimating average treatment effect and average treatment effect among treated: a comparative study. Biometrical J Biometrische Z. 2017;59(5):967â€“85. https://doi.org/10.1002/bimj.201600094 .
Rosenbaum PR, Rubin DB. Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score. Am Stat. 1985;39(1):33â€“8. https://doi.org/10.1080/00031305.1985.10479383.
Rosenbaum PR, Rubin DB. Reducing Bias in Observational studies using subclassification on the Propensity score. J Am Stat Assoc. 1984;79(387):516â€“24. https://doi.org/10.1080/01621459.1984.10478078.
Robins JM, Rotnitzky A, Zhao LP. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc. 1995;90(429):106â€“21. https://doi.org/10.1080/01621459.1995.10476493.
Austin PC. An introduction to propensity score methods for reducing the effects of confounding in observational studies. Multivar Behav Res. 2011;46(3):399â€“424. https://doi.org/10.1080/00273171.2011.568786.
Li F, Morgan KL, Zaslavsky AM. Balancing covariates via Propensity score weighting. J Am Stat Assoc. 2018;113(521):390â€“400. https://doi.org/10.1080/01621459.2016.1260466.
Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space. J Royal Stat Soc Ser B: Stat Methodol. 2008;70(5):849â€“911. https://doi.org/10.1111/j.14679868.2008.00674.x .
Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Annals Stat. 2010;38(2):894â€“942. https://doi.org/10.1214/09AOS729 .
Maity AK, Basu S. Highest posterior model computation and variable selection via simulated annealing. The New England J Statis Data Sci. 2023;1(2):200â€“7. https://doi.org/10.51387/23NEJSDS40.
Breheny P, Huang J. Coordinate Descent algorithms for Nonconvex Penalized Regression, with applications to Biological feature selection. Annals Appl Stat. 2011;5(1):232â€“53. https://doi.org/10.1214/10AOAS388 .
Huang YT, Pan WC. Hypothesis test of mediation effect in causal mediation model with highdimensional continuous mediators. Biometrics. 2016;72(2):402â€“13. https://doi.org/10.1111/biom.12421.
Benjamini Y, Hochberg Y. Controlling the false Discovery rate: a practical and powerful Approach to multiple testing. J Roy Stat Soc: Ser B (Methodol). 1995;57(1):289â€“300. https://doi.org/10.1111/j.25176161.1995.tb02031.x.
Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988;75(4):800â€“2. https://doi.org/10.1093/biomet/75.4.800.
Huang YT. Joint significance tests for mediation effects of socioeconomic adversity on adiposity via epigenetics. Annals Appl Stat. 2018;12(3):1535â€“57. https://doi.org/10.1214/17AOAS1120.
Fang EX, Ning Y, Liu H. Testing and confidence intervals for high Dimensional Proportional hazards Model. J Royal Stat Soc Ser B Stat Methodol. 2017;79(5):1415â€“37. https://doi.org/10.1111/rssb.12224 .
Chen F, Hu W, Cai J, Chen S, Si A, Zhang Y, et al. Instrumental variablebased highdimensional mediation analysis with unmeasured confounders for survival data in the observational epigenetic study. Front Genet. 2023;14:1092489. https://doi.org/10.3389/fgene.2023.1092489.
Qiu F, Liang CL, Liu H, Zeng YQ, Hou S, Huang S, et al. Impacts of cigarette smoking on immune responsiveness: up and down or upside down? Oncotarget. 2017;8(1):268â€“84. https://doi.org/10.18632/oncotarget.13613.
Elisia I, Lam V, Cho B, Hay M, Li MY, Yeung M, et al. The effect of smoking on chronic inflammation, immune function and blood cell composition. Sci Rep. 2020;10:19480. https://doi.org/10.1038/s41598020765567.
Breitling LP, Yang R, Korn B, Burwinkel B, Brenner H. Tobaccosmokingrelated differential DNA methylation: 27K discovery and replication. Am J Hum Genet. 2011;88(4):450â€“7. https://doi.org/10.1016/j.ajhg.2011.03.003 .
Wiencke JK, Butler R, Hsuang G, Eliot M, Kim S, Sepulveda MA, et al. The DNA methylation profile of activated human natural killer cells. Epigenetics. 2016;11(5):363â€“80. https://doi.org/10.1080/15592294.2016.1163454.
Gao X, Jia M, Zhang Y, Breitling LP, Brenner H. DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies. Clin Epigenetics. 2015;7:113. https://doi.org/10.1186/s1314801501483.
Zhang X, Hu Y, Aouizerat BE, Peng G, Marconi VC, Corley MJ, et al. Machine learning selected smokingassociated DNA methylation signatures that predict HIV prognosis and mortality. Clin Epigenetics. 2018;10(1):155. https://doi.org/10.1186/s131480180591z.
Houseman EA, Accomando WP, Koestler DC, Christensen BC, Marsit CJ, Nelson HH, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13(1):86. https://doi.org/10.1186/147121051386.
Wan ES, Qiu W, Baccarelli A, Carey VJ, Bacherman H, Rennard SI, et al. Cigarette smoking behaviors and time since quitting are associated with differential DNA methylation across the human genome. Hum Mol Genet. 2012;21(13):3073â€“82. https://doi.org/10.1093/hmg/dds135.
Bao Q, Zhang B, Zhou L, Yang Q, Mu X, Liu X, et al. CNP Ameliorates Macrophage Inflammatory Response and Atherosclerosis. Circ Res. 0(0). https://doi.org/10.1161/CIRCRESAHA.123.324086.
Bae CR, Hino J, Hosoda H, Arai Y, Son C, Makino H, et al. Overexpression of Ctype natriuretic peptide in endothelial cells protects against Insulin Resistance and inflammation during Dietinduced obesity. Sci Rep. 2017;7(1):9807. https://doi.org/10.1038/s41598017102401.
Lu HK, Mitchell A, Endoh Y, Hampartzoumian T, Huynh O, Borges L, et al. LILRA2 selectively modulates LPSmediated cytokine production and inhibits phagocytosis by monocytes. PLoS ONE. 2012;7(3):e33478. https://doi.org/10.1371/journal.pone.0033478.
Lewis Marffy AL, McCarthy AJ. Leukocyte ImmunoglobulinLike receptors (LILRs) on human neutrophils: modulators of infection and immunity. Front Immunol. 2020;11:857. https://doi.org/10.3389/fimmu.2020.00857.
Sikdar S, Joehanes R, Joubert BR, Xu CJ, VivesUsano M, Rezwan FI, et al. Comparison of smokingrelated DNA methylation between newborns from prenatal exposure and adults from personal smoking. Epigenomics. 2019;11(13):1487â€“500. https://doi.org/10.2217/epi20190066.
Joehanes R, Just AC, Marioni RE, Pilling LC, Reynolds LM, Mandaviya PR, et al. Epigenetic signatures of cigarette smoking. Circulation Cardiovasc Genet. 2016;9(5):436â€“47. https://doi.org/10.1161/CIRCGENETICS.116.001506 .
Sun YV, Smith AK, Conneely KN, Chang Q, Li W, Lazarus A, et al. Epigenomic association analysis identifies smokingrelated DNA methylation sites in African americans. Hum Genet. 2013;132(9):1027â€“37. https://doi.org/10.1007/s0043901313116.
Luo C, Wang G, Hu F. TwoStep Gene Feature Selection Algorithm Based on Permutation Test. In: Yao J, Yang Y, SÅ‚owiÅ„ski R, Greco S, Li H, Mitra S, et al., editors. Springer; 2012. p. 249âˆ’58.Â https://doi.org/10.1111/ppe.12382.
Liu D, Yeung EH, McLain AC, Xie Y, Buck Louis GM, Sundaram R. A twostep Approach for Analysis of Nonignorable Missing outcomes in Longitudinal Regression: an application to Upstate KIDS Study. Paediatr Perinat Epidemiol. 2017;31(5):468â€“78. https://doi.org/10.1111/ppe.12382.
Newcombe PJ, Connolly S, Seaman S, Richardson S, Sharp SJ. A twostep method for variable selection in the analysis of a casecohort study. Int J Epidemiol. 2018;47(2):597â€“604. https://doi.org/10.1093/ije/dyx224.
Song J, Shin SJ. A twostep approach for variable selection in linear regression with measurement error. Commun Stat Appl Methods. 2019;26(1):47â€“55. https://doi.org/10.29220/CSAM.2019.26.1.047 .
Liu Y, Qin SJ. A Novel twostep sparse Learning Approach for Variable Selection and Optimal Predictive modeling. IFACPapersOnLine. 2022;55(7):57â€“64. https://doi.org/10.1016/j.ifacol.2022.07.422.
Chamlal H, Benzmane A, Ouaderhman T. A TwoStep Feature Selection Procedure to Handle HighDimensional Data in Regression Problems. 2023 International Conference on Decision Aid Sciences and Applications (DASA). 2023. p. 592â€“6.
Wickramarachchi DS, Lim LHM, Sun B. Mediation analysis with multiple mediators under unmeasured mediatoroutcome confounding. Stat Med. 2023;42(4):422â€“32. https://doi.org/10.1002/sim.9624.
Acknowledgements
We acknowledge GEO database for providing their platforms and contributors for uploading their meaningful datasets.
Funding
This work was supported by the National Social Science Found of China (21CTJ009) and National Nature Science Foundation of China (81703325).
Author information
Authors and Affiliations
Contributions
F.C., H.Y. and W.H. led the conception and design of the work. W.H. and S.C. conducted the simulation study. W.H., S.C., and J.C. finished data cleaning and implemented real data application. W.H. completed the original draft. F.C., J.C., and Y.Y. guided analyses and provided advice. F.C. critically reviewed and edited the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
12874_2024_2254_MOESM1_ESM.docx
Additional file 1: Highdimensional mediation analysis for continuous outcome with confounders using overlap weighting method in observational epigenetic study. Simulation results of different correlation levels Â \(\rho\)=0,0.25,0.5,0.75 with dimension \(\rho\)=1000 and sample size \(n\)=300,500 were presented in the Table S18. Analysis result using the PSbased HIMA methods was shown in Table S9.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Hu, W., Chen, S., Cai, J. et al. Highdimensional mediation analysis for continuous outcome with confounders using overlap weighting method in observational epigenetic study. BMC Med Res Methodol 24, 125 (2024). https://doi.org/10.1186/s1287402402254x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287402402254x