 Research
 Open access
 Published:
An empirical evaluation of approximate and exact regressionbased causal mediation approaches for a binary outcome and a continuous or a binary mediator for casecontrol study designs
BMC Medical Research Methodology volume 24, Article number: 72 (2024)
Abstract
Background
In the causal mediation analysis framework, several parametric regressionbased approaches have been introduced in past years for decomposing the total effect of an exposure on a binary outcome into a direct effect and an indirect effect through a target mediator. In this context, a wellknown strategy involves specifying a logistic model for the outcome and invoking the rare outcome assumption (ROA) to simplify estimation. Recently, exact estimators for natural direct and indirect effects have been introduced to circumvent the challenges prompted by the ROA. As for the approximate approaches relying on the ROA, these exact approaches cannot be used as is on casecontrol data where the sampling mechanism depends on the outcome.
Methods
Considering a continuous or a binary mediator, we empirically compare the approximate and exact approaches using simulated data under various casecontrol scenarios. An illustration of these approaches on casecontrol data is provided, where the natural mediation effects of longterm use of oral contraceptives on ovarian cancer, with lifetime number of ovulatory cycles as the mediator, are estimated.
Results
In the simulations, we found few differences between the performances of the approximate and exact approaches when the outcome was rare, both marginally and conditionally on variables. However, the performance of the approximate approaches degraded as the prevalence of the outcome increased in at least one stratum of variables. Differences in behavior were also observed among the approximate approaches. In the data analysis, all studied approaches were in agreement with respect to the natural direct and indirect effects estimates.
Conclusions
In the case where a violation of the ROA applies or is expected, approximate mediation approaches should be avoided or used with caution, and exact estimators favored.
Introduction
Mediation analysis aims at decomposing the effect of an exposure on an outcome into a direct effect and an indirect effect through a target mediator, thereby allowing for a better understanding of the mechanisms by which the exposure affects the outcome [1]. In causal mediation analysis, the total effect decomposition is accomplished via natural mediation effects [2, 3]. When the outcome is binary, standard regressionbased approaches for the estimation of natural effects use a logistic model for the outcome and either a linear or a logistic model for the continuous or binary mediator, respectively [4]. Wellknown approaches for performing causal mediation analysis with a binary outcome were developed using the rare outcome assumption (ROA), yielding simplified inference [5, 6]. To address the approximate nature of these approaches, there has recently been an interest in developing socalled exact regressionbased approaches for natural effects estimation, where the exact estimators circumvent the ROA and are applicable independently of the rareness or commonness of the outcome [7,8,9,10,11].
In Samoilenko, Blais and Lefebvre [7] and Samoilenko and Lefebvre [8, 9], the authors found that the exact estimators yielded more accurate estimates of natural effects than the approximate ones in simulation scenarios where the outcome was common, but also when the outcome was rare marginally but not conditionally. It is relevant to mention that these studies implemented and compared approaches on data which allowed estimation of the outcome and mediator models’ parameters consistently using standard fitting procedures, that is, on data arising from cohort or populationbased designs. As the casecontrol design is indicated when the frequency of the outcome is small or smaller than that of the exposure [12], it is pertinent to investigate whether the gain from using an exact mediation approach that does not rely on the ROA is preserved when implemented on casecontrol data. As detailed in the sequel, implementation of approximate and exact approaches on casecontrol data requires more attention than when implemented on cohort data since some parameters of the outcome and mediator models might not be estimated consistently using standard fitting procedures (e.g., [5, 6, 13, 14]). Causal mediation analysis with casecontrol data has been discussed in key articles such as [5, 6, 13], and a number of studies have been performed to shed further light on this topic [14,15,16,17]. However, as the latter articles do not specifically address the comparison of approximate versus exact approaches, we believe it worthwhile to investigate this issue.
As alluded to previously, approximate mediation analysis approaches relying on the ROA should not be used as is on casecontrol data where the sampling mechanism depends on the outcome. This wellknown fact occurs as a result of the selection of individuals based on their outcome status, which can yield biased estimators of the regression coefficients of the mediator model. Such a bias notably happens when there are arrows pointing from the exposure to the outcome and from the mediator to the outcome in a causal mediation diagram. For these approaches, Valeri and VanderWeele [6] have proposed to account for the casecontrol design by fitting the mediator model on the controls only. Alternatively, VanderWeele and Vansteelandt [5] and Valeri and VanderWeele [6] have suggested using inverse probability weighting (IPW) for fitting this model, but implementing IPW requires knowledge of the frequency of the outcome, the latter interpreted either as a prevalence or an incidence. Similar issues regarding the estimation of the mediator model prevail for the exact approaches. However, there is an additional difficulty for these, since, unlike for the approximate approaches [5], the casecontrol design also needs to be accounted for when estimating the outcome logistic model. This occurs because the intercept coefficient of this model is involved in the definition of the exact natural direct and indirect effects estimands [8, 9].
The objective of this work is to empirically compare the performance of approximate versus exact approaches for the estimation of natural mediation effects odds ratios (ORs) in the context of casecontrol study designs where the outcome is binary and the mediator is either continuous or binary. In our comparisons, we focus on the regressionbased approximate approaches available in the R package CMAverse [18], where the estimation of the mediator model is done according to either of the two aforementioned strategies. Similarly, to account for the casecontrol design in the exact approach, we use the IPW strategy implemented in the R package ExactMed [19], where the weighting is applied to both the mediator and outcome models. We also consider the unified likelihood approach of Satten et al. [17], which accounts for the casecontrol design using a joint prospective likelihood for the outcome and mediator. This approach does not require knowledge of the outcome prevalence (incidence) but nevertheless relies on the ROA.
The article is structured as follows. In Methods section, we present the definitions of mediation effects and associated models, and provide details on the approaches compared. In this section, we also describe the simulation study performed and report on the results in Results section. In Real data analysis section, we apply the approaches compared on casecontrol data from the PRevention of OVArian Cancer in Quebec (PROVAQ) study [20]. These analyses are performed to evaluate the direct and indirect effects of longterm use of oral contraceptives on ovarian cancer using the lifetime number of ovulatory cycles as the potential mediator.
Methods
Definitions, models and approaches
We first define the nested counterfactual outcome \(Y(a,M(a^*))\) which is the outcome that would be realized if the exposure were set to a and the mediator were set to the value it would have taken if the exposure had been set to \(a^*\). The conditional natural direct effect (NDE) and natural indirect effect (NIE) ORs corresponding to a change in the exposure level from \(A = a^*\) to \(A = a\) are defined as follows:
The total effect (TE) OR is defined as the product of the NDE and NIE ORs:
Identification of natural direct and indirect effects is accomplished via the mediation formula [21], which is established using consistency, positivity and conditional independence assumptions [22, 23]. Mediation formulas corresponding to a binary outcome and a continuous or binary mediator, respectively, are:
where \(\varvec{C}\) is a set of covariates sufficient to achieve ignorability for the \(AY\), \(AM\), and \(MY\) relationships [4].
Throughout, we assume the following logistic regression model for Y for consideration in mediation formulas (34):
Moreover, we use either of the following linear or logistic models, respectively:
In the case of a continuous M, a Gaussian distribution is assumed for the mediator in formula (3), that is \(M =\mathbb {E}(M  A, \varvec{C}) + \epsilon\) with \(\epsilon \sim N(0, \sigma ^2)\) and f is the corresponding density, while model (7) is used in formula (4).
Standard approximate approaches
The regressionbased approaches for a binary outcome and a continuous or binary mediator proposed by VanderWeele and Vansteelandt [5] and Valeri and VanderWeele [6] for the estimation of natural direct and indirect effects on the OR scale (\(OR^{NDE}\) and \(OR^{NIE}\)) rely on models (5) and either (6) or (7); these approaches also invoke the ROA for providing approximate closedform expressions for the natural effects. For a binary exposure coded 0/1, the expressions pertaining to a continuous mediator are:
while, for a binary mediator, these are:
The corresponding estimators for the natural effects ORs are obtained by substituting the parameters in (89) or (1011) by corresponding estimators obtained according to the type of mediator.
Implementation of these approximate approaches can be done using the R package CMAverse with the cmest function and option rb. In the context of a casecontrol study, the first strategy, which consists of fitting the mediator model among the controls only, is implemented by setting the options casecontrol and yrare to TRUE. The theoretical basis for this strategy uses the argument that when the outcome is rare, one could assume that
While the above equivalences are deemed valid when assuming the ROA, exclusion of cases when fitting the mediator model leads to a loss of efficiency for the estimators of the regression parameters of the mediator model. This is the rationale invoked by Satten et al. [17] for introducing their unified approach. More subtly, we conjecture, in light of Samoilenko and Lefebvre [8, 9], that equivalences (12) and (13) may not hold if one relies on the marginal prevalence (incidence) of the outcome only to establish the ROA. This is explored via simulation in the present work. The second strategy, based on IPW, is implemented in CMAverse by setting the option casecontrol to TRUE and specifying a value for the parameter yprevalence. Concretely, for this second strategy, each case is weighted by \(w=\pi /p\) and each control by \(w=(1\pi )/(1p)\), where p is the proportion of cases in the sample and \(\pi\) is the outcome’s prevalence or incidence (\(\equiv\) yprevalence) in the population. In general, there is an overrepresentation of cases in the sample as compared to the population (that is, \(p> \pi\)), and therefore each case (control) gets down weighted (up weighted) appropriately. These designbased weights are then used in a weighted regression using all the sample. A drawback of this second strategy, as opposed to the first one, is that it requires knowledge of the prevalence (incidence) of the outcome in the population.
Exact approaches
The exact regressionbased approaches for a binary outcome and a continuous or binary mediator proposed by Samoilenko and Lefebvre [8, 9] estimate natural direct and indirect effects also from models (5) and either (6) or (7). However, these approaches do not invoke the ROA, at the cost of more complex estimators for the NDE and NIE.
Under these approaches, the modelbased nested counterfactual outcome probabilities are not algebraically simplified, which yields
for a continuous (Gaussian) mediator, and
for a binary mediator, where \(\text {expit}(\alpha ) = \frac{\exp (\alpha )}{1 + \exp (\alpha )}\). It should be noted that integral (14) does not allow for a closedform formula [9], unlike the corresponding integral in the approximate approach.
Estimators for these probabilities are defined by first substituting the parameters in (14) or (15) by corresponding estimators, and for a continuous mediator, resorting to numerical integration for computing the integral in (14). The estimators for the probabilities are then pluggedback in (1) and (2) to provide estimators of natural effects on the OR scale.
Samoilenko and Lefebvre’s exact approaches are implemented in the R package ExactMed, where the functions exactmed_c and exactmed are available for estimating \(OR^{NDE}\) and \(OR^{NIE}\) according to the type of mediator (continuous or binary). The package allows for using an IPW strategy accounting for casecontrol data, also via the use of option yprevalence. The same weights as for the approximate approaches are used for fitting the mediator model. These weights are also used when fitting the outcome model to achieve consistent estimation for the intercept coefficient \(\theta _0\) involved in (1415). We refer readers to Additional file 1 for information on variance estimation and confidence intervals (CIs) for these approaches.
Unified likelihood approach
Satten et al. [17] recently introduced a joint likelihood approach for estimating natural effects ORs based on standard casecontrol data, assuming a binary outcome and either a continuous or a binary mediator. This socalled unified likelihood approach accounts for the casecontrol design while incorporating all case information in the likelihood and eliminating the need for a userspecified outcome prevalence (incidence) value. More precisely, they considered the joint likelihood \(L_p = \prod _i \textrm{P}(\mathit{Y} = \mathit{y}_{\mathit{i}}, \mathit{M} = \mathit{m}_{\mathit{i}} \mathit{A} = \mathit{a}_{\mathit{i}}, \boldsymbol{C}= \boldsymbol{c}_i)\), where i indexes individuals, which can be factored into
As in the Valeri and VanderWeele’s [6] “controls only" strategy described previously, Satten et al. [17] proposed to model \(\textrm{P}(\mathit{M}= \mathit{m}  \mathit{Y} =0,\mathit{A} =\mathit{a}, \boldsymbol{C}= \boldsymbol{c})\) invoking the ROA. For a binary mediator, their approach thus assumes that \(\text {logit}(\textrm{P}(\mathit{M} = 1  \mathit{Y} = 0, \mathit{A} = \mathit{a}, \boldsymbol{C}= \boldsymbol{c})) \approx \beta _0 + \beta _1 \mathit{a} + \boldsymbol{\beta }'_2 \boldsymbol{c}\). For a Gaussian mediator, a normal density function with mean \(\mathbb {E}(M = m Y = 0, A =a, \boldsymbol{C}= \boldsymbol{c}) \approx \beta _0 + \beta _1 a + \boldsymbol{\beta }'_2 \boldsymbol{c}\) and variance \(\sigma ^2\) defines \(\textrm{P}(\mathit{M} = \mathit{m} \mathit{Y} = 0, \mathit{A} =\mathit{a}, \boldsymbol{C}= \boldsymbol{c})\).
This joint likelihood approach then expresses the mediator model among the cases as a function of the outcome odds and the mediator model among the controls:
where
The denominator in (17) is written as:
that is, it corresponds to the outcome odds given exposure and covariates. Hence
and the mediator model among the cases can be reexpressed as:
Combining these results, the following joint likelihood is obtained:
Maximum likelihood estimators for the parameters of the mediator and outcome models are defined as arguments of the maxima of (21). Estimators for the natural effects ORs are then formed by substituting the parameters estimators in the approximate expressions (89) or (1011).
This approach can be implemented using the R code provided by Satten et al. [17] (https://github.com/epsteinsoftware/MediationCC).
Simulation study
The objective of the simulation study was to assess the performance of the four approaches described previously in different casecontrol design scenarios, namely the : 1) approximate approach with the mediator model fitted among the controls only (Approx_C) ; 2) approximate approach with IPW (Approx_IPW) ; 3) exact approach with IPW (Exact_IPW) ; 4) unified likelihood approach (Unified). For reference, we also obtained results using the approximate and exact approaches which do not account for the casecontrol design (Approx_Naive, Exact_Naive). All six approaches were evaluated with respect to both continuous and binary mediators.
Data generation
We considered five scenarios in the continuous mediator case. In all scenarios, covariates \(C_1\) and \(C_2\) were generated independently as Bernoulli(0.5) and \(\mathcal {N}(0,0.75^2)\) random variables, respectively. The binary exposure A was generated as a \(Bernoulli(p_A)\), where \(p_A =\text {expit}(0.5 + 0.1 c_1 0.15 c_2)\). The mediator M was generated as a \(\mathcal {N}(\beta _0 + \beta _1 a + \beta _{21} c_1 + \beta _{22} c_2, 0.5^2)\), and the binary outcome Y was generated as a \(Bernoulli(p_Y)\) with \(p_Y =\text {expit}(\theta _0 + \theta _1 a + \theta _2 m \, + \theta _3 a m + \theta _{41} c_1 + \theta _{42} c_2)\). The mediator and outcome simulation parameters used for each scenario in the continuous mediator case are presented in Table A1 of Additional file 2.
As in the continuous mediator case, we considered five scenarios in the binary mediator case. In all scenarios, covariates \(C_1\) and \(C_2\) were generated independently as Bernoulli(0.5) and \(\mathcal {N}(0,1)\), respectively. The binary exposure A was generated as a \(Bernoulli(p_A)\), where \(p_A =\text {expit}(0.5 + 0.1 c_1 0.15 c_2)\). The binary mediator M was generated as a \(Bernoulli(p_M)\) with \(p_M =\text {expit}(\beta _0 + \beta _1 a + \beta _{21} c_1 + \beta _{22} c_2)\), and the binary outcome as a \(Bernoulli(p_Y)\) with \(p_Y =\text {expit}(\theta _0 + \theta _1 a + \theta _2 m + \theta _3 a m + \theta _{41} c_1 + \theta _{42} c_2)\). The mediator and outcome simulation parameters used for each scenario in the binary mediator case are presented in Table A2 of Additional file 2.
We selected simulation scenarios to yield different marginal and conditional outcome prevalences (see Tables A3 and A4 of Additional file 2 for prevalences in the continuous and binary mediator cases, respectively). Values of the models’ parameters were also selected so to induce biased estimators of the regression parameters for the mediator model based on the selected samples. Specifically, in each scenario, a nonzero coefficient was specified for the mediator in the outcome model. Moreover, all variables (exposure and covariates) also had nonzero coefficients in the outcome model. It should be noted that the magnitude of selection bias induced is a function of the combined magnitude of these coefficients.
For both mediator types and each associated scenario, we constituted 1000 casecontrol samples from the corresponding population with an equal number of cases and controls, i.e., n/2 cases and n/2 controls for total sample sizes of \(n=500\) and 1000.
Analysis
To provide an indication of the impact of the casecontrol design on the estimation of the mediator and outcome models’ parameters, we calculated the averages of the parameter values obtained across each set of 1000 casecontrol samples of size \(n=1000\) according to different estimation strategies. First, by ignoring the design (Naive), second by using IPW in both the mediator and outcome models (IPW), third by estimating the mediator model using the controls only and the outcome model using all the sample (Controls), and fourth by using the joint likelihood (21) (Unified). For the first three strategies, a (possibly weighted) linear regression with the R function lm was used for the continuous mediator and a (possibly weighted) logistic regression using the R function glm was used for the binary variables. For each strategy and scenario, onesample bilateral ttests were performed to detect departures from the true values of the parameters.
We applied all six approaches to each casecontrol sample generated in order to estimate conditional natural direct and indirect effects on the OR scale. The implementation of the approximate approaches was done using the CMAverse package (available at https://bs1125.github.io/CMAverse/) with the R version 4.1.2. The implementation of the exact approaches was done using the ExactMed package version 0.3.0. All effects were computed by setting the covariates equal to the samplespecific means (that is, \(\boldsymbol{C}=\boldsymbol{\bar{c}}\)). For the approaches based on IPW, we used the true outcome prevalence (see Tables A3 and A4 in Additional file 2) as value for yprevalence. The bias, standard deviation (SD) and root mean squared error (RMSE) were calculated for each point estimator. Coverage probabilities of 95% CI estimators based on the delta method and percentile bootstrap were obtained for all approaches except for the unified likelihood approach where only the delta method is available from the R code provided by the authors. For each scenario, the true values of the NDE and NIE were computed using the true parameter values and population mean for the covariates.
For both mediator types we also investigated the impact of misspecifying the outcome prevalence \(\pi\) on the natural effects estimates obtained from the approximate and exact approaches with IPW. Specifically, for each scenario, we considered a grid of values for \(\pi\) which corresponded to a relative percentage of misspecification between 99% to 100%, by 5% from 95%. For each value of the prevalence’s grid, we computed the average natural effects based on the 1000 casecontrol samples of size \(n=1000\) that were generated for each scenario. Results are presented visually next. Results for the approximate approach with the controls only and the unified approach, which both do not require the user to specify a value for \(\pi\), are reported on corresponding figures for reference only.
Results
The average estimated regression parameters for the continuous mediator case when \(n=1000\) are presented in Table A1 of Additional file 3. From this table, we observe that for all scenarios, all coefficients estimated using IPW were globally in agreement with the true values. A few IPW averages showed small deviations from the true values, as highlighted by some smaller pvalues. The parameters of the mediator model estimated from the controls only and using the joint likelihood of the unified approach were generally close. Important departures from the true values were observed for these two strategies, especially for Scenarios 2 and 5 where the ROA does not apply at least conditionally. For the outcome regression coefficients, only the intercept term was seen markedly affected by the design and only IPW correctly estimated it. In general, the unified approach slightly departed from the other approaches for the outcome model coefficients.
The natural effects results for the continuous mediator case when \(n=1000\) are presented in Tables 1, 2, 3, 4, and 5. For Scenario 1, where the outcome is rare both marginally and conditionally, all approaches investigated, including the naive approaches that do not account for the casecontrol design, showed absolute relative biases less than 3% for all effects (NDE, NIE and TE). No significant undercoverage was observed throughout. For Scenario 2, where the outcome is rare marginally but is not conditionally rare in all quartiles of the mediator, both naive approaches yielded more important biases for the NDE and NIE. The approximate approach with IPW yielded relative biases below 10% for both the NDE and NIE but greater than 10% for the TE. The biases for the approximate approach with the mediator model fitted using the controls only were smaller, in absolute values, than for the approximate approach with IPW. The exact approach with IPW was the least biased for the NDE and NIE among all approaches compared. We observed that all approaches except the approximate naive and IPW approaches showed small biases for the TE. The results for Scenario 3, which features an outcome that is relatively common both marginally and conditionally, were similar to those obtained for Scenario 1. Although the outcome was relatively prevalent in Scenario 4 (marginal prevalence of 27.6%, conditional prevalences between 23% and 36%), the relative biases were small for all approaches and effects. In Scenario 5, the approximate approach with IPW yielded positive relative biases exceeding 10% for both NDE and NIE, largely impacting the TE (relative bias of 22.8%). The approximate approach with the mediator model fitted among the controls only yielded smaller biases in absolute value. For this approach, the positive bias seen for the NDE estimator coupled with the negative bias for the NIE estimator produced a TE estimator with only small bias. The exact approach with IPW yielded small relative biases for all effects in this scenario. Across all five scenarios, the unified approach often presented the smallest variability, however its performance in terms of RMSE depended on the amount of bias exhibited in a given scenario.
The results on the misspecification of the prevalence parameter \(\pi\) when \(n=1000\) are found in Figures A1A5 in Additional file 3. The impact of the misspecification of \(\pi\) on the natural direct and indirect effects results was generally minor for relative errors between 20% and 20% for the exact approach with IPW. When the parameter \(\pi\) was importantly underestimated (corresponding to relative errors towards 99%), all the approaches accounting for the casecontrol design behaved similarly and could exhibit large departures from the true values of the effects. In this extreme case of prevalence misspecification, the \(\theta _0\) parameter was largely underestimated by the exact approach with IPW in each scenario (results not shown), which yielded diminished differences between natural effects estimates obtained from the exact and approximate approaches. The exact approach was also seen affected when \(\pi\) was importantly overestimated, most notably in Scenarios 4 and 5, but such relative errors implied that the posited outcome prevalences were larger than the exposure prevalences in these cases (corresponding to relative errors larger than 45%). For the TE, the approximate approach with IPW was noticeably more impacted than the exact approach with IPW. For this effect, the latter approach exhibited highly stable average estimates through the misspecification grid. This is not unexpected given that the exposure coefficient of an outcome logistic model in a non mediation analysis would not be impacted by the casecontrol design.
The average estimated regression coefficients for the binary mediator case when \(n=1000\) are presented in Table A2 of Additional file 3. We observed greater differences between the regression coefficients of the mediator model estimated from the studied approaches in the binary mediator case than in the continuous one. Nonetheless, the qualitative conclusions regarding the impact of design on the estimated regression coefficients for the binary mediator case were similar to those for the continuous mediator case.
The natural effects results for the binary mediator case when \(n=1000\) are presented in Tables 6, 7, 8, 9, and 10. For Scenario 1, where the outcome is rare both marginally and conditionally, all approaches except the approximate naive approach showed absolute relative biases less than 2% for all effects (NDE, NIE and TE). Some undercoverage was observed for the exact naive approach for the NIE. For Scenario 2, where the outcome is rare marginally but is not conditionally rare in strata defined by the levels of the binary exposure and binary mediator, all approaches except the exact approach with IPW yielded large relative bias for the NDE. The latter approach was also found with minimal relative bias for the NIE. For this effect, a large relative bias was observed for the approximate approach with the mediator model fitted using controls only while the relative bias for the approximate IPW approach was near but below 10%. The exact approach with IPW was also found particularly performant in terms of RMSE for the NDE and NIE in this scenario. The approximate approach with IPW yielded a relative bias exceeding 50% for the TE, while all other approaches accounting for the design produced a relative bias below 3% for this effect. The results for Scenario 3, which features an outcome that is relatively common both marginally and conditionally, were similar to those obtained for Scenario 1. However, while the relative biases were small throughout, some undercoverage was observed for the NIE in this scenario. Upon inspection of CIs for the NIE in this scenario (see Table 8), all approaches yielded larger average widths for the bootstrap CIs compared to the delta CIs, most notably the approaches accounting for the casecontrol design (results not shown). This would explain why the corresponding bootstrap CIs were found having better coverage than the delta CIs. In Scenarios 4 and 5, which both feature a common outcome, only the exact approach with IPW exhibited acceptable relative biases and coverage. The bias and undercoverage of other approaches were larger in Scenario 5 than in Scenario 4. Similar to what was observed in the continuous mediator case, the approximate approach with the mediator model fitted using the controls only and the unified approach did not exhibit relative bias issues for the TE in Scenario 5, unlike the approximate approach with IPW.
The impact of misspecification of \(\pi\) on the natural effects estimates obtained from the IPW approaches was more visible in the binary mediator case with \(n=1000\) (see Figures A6A10 in Additional file 3). Nonetheless, the natural effects estimates from the exact approach with IPW were globally closer to the true effects over the middle of the misspecification grid for \(\pi\) (relative errors between 20% and 20%) in all scenarios except in Scenario 1 where the approximate approach with IPW was uniformly closest.
The results when \(n=500\) are presented in Additional file 4 (Tables A1A5 for the continuous mediator case and Tables A6A10 for the binary mediator case). The biases of the estimators were found generally larger when \(n=500\) as when \(n=1000\) and the bias patterns with respect to the studied estimators were roughly preserved. The coverage probabilities were found generally closer to the nominal value of 0.95 when \(n=500\) as opposed to when \(n=1000\). In the continuous mediator case, while the gain in using the exact approach with IPW was still visible from a bias perspective, it generally vanished when evaluated from a RMSE perspective. The exact approach with IPW still remained performant from a RMSE perspective in the binary mediator case.
Real data analysis
In this section, we apply the studied mediation analysis approaches to data from the PROVAQ study, a populationbased casecontrol study on ovarian cancer [20]. It is wellestablished that oral contraceptive use lowers the risk of developing epithelial ovarian cancer [24]. However, the mechanisms of this protection are not clear. A longstanding model of ovarian carcinogenesis is the “incessant ovulation hypothesis”, which posits that ovulation entails repeated trauma and repair of the ovarian surface epithelium, and thus increases the possibility of DNA mutations leading to cancer initiation [25]. The contraceptive mechanism of most oral contraceptive types is ovulation suppression [26], thus the reduced risk of ovarian cancer with oral contraceptive use supports this hypothesis. However, it has been suggested that the magnitude of risk reduction with oral contraceptive use is stronger than that would be expected based on number of ovulations alone, and thus other mechanisms may be involved [27, 28]. In this application, the aim was to estimate the association between oral contraceptive use and ovarian cancer risk considering the natural mediation effects via the total number of ovulatory cycles over the lifetime.
Participants in the PROVAQ study were recruited from 2011 to 2016 and included Canadian citizens aged 1879 years who resided in the greater Montreal area. Incident cases were identified in the major hospitals treating ovarian cancer in the study region while controls were selected from the Quebec electoral list and were frequency matched to cases by 5year age group and Montreal region. Data were collected in an inperson interview. The final number of eligible participants was 498 cases of borderline (\(n=134\)) and invasive (\(n=364\)) ovarian cancers and 908 controls. A detailed description of the PROVAQ study was published previously [20]. The current analysis was restricted to cases of invasive ovarian cancer (\(n=364\)), which is the ovarian cancer type that has been consistently associated with oral contraceptive use.
The binary exposure variable was defined as the duration of oral contraceptive use, dichotomized as \(\ge\) 10 years vs. < 10 years, the former level corresponding to the duration when a lower ovarian cancer risk is seen most consistently [24, 29]. The mediator, considered as a continuous variable, was defined as the lifetime number of ovulatory cycles, as calculated in the Cancer and Steroid Study (CASH) (equation 1) [30]. The binary outcome of casecontrol status represented incident invasive ovarian cancer cases and controls. Age and highest level of education attained were considered as potential confounding variables. Age was measured at diagnosis for cases and at interview for controls. Education was dichotomized as education level above high school or not. Lifetime number of ovulatory cycles could not be calculated for one control due to missing data, thus the final sample for the current analysis included 364 cases and 907 controls. Table 11 describes the cases and controls according to the variables used in our analysis.
All studied approaches were used to obtain conditional natural effects (NDE and NIE) assuming an interaction term between the mediator and exposure in the outcome regression model. As in the simulations, the conditioning values for the covariates were their average values in the sample (58.48 for age and 0.672 for education). In the exact and approximate approaches with IPW, we set \(\pi =13.5/100\,000\), which corresponds to the annual incidence rate of ovarian cancer in Canada [31]. For the approximate approach with the controls only, we note that, since the controls were obtained through incidence density sampling, the equivalence (12) should hold exactly rather than approximately [32]. Moreover, in this context, conditions to interpret the ORs as instantaneous rate ratios would be the proportionalhazards assumption over the 5year study period and the constant proportion of exposed (that is, longterm users of oral contraceptives) over that period [33].
Tables A1 and A2 of Additional file 5 show the estimated regression coefficients for the mediator and outcome models, respectively. The values shown in these tables correspond to those obtained using the exact approach with IPW (Exact_IPW). The point estimates are virtually the same as those obtained using the approximate approach with IPW (Approx_IPW), with very slight differences in the standard errors reported (results not shown). From Table A1 (Additional file 5), we see a strong association between the longterm use of oral contraceptives (exposure) and the lifetime number of ovulatory cycles (mediator), as expected. In the outcome model (see Table A2, Additional file 5), which included lifetime number of ovulatory cycles, longterm use of oral contraceptives was not found to be associated with ovarian cancer (outcome), either as a main effect term or as part of an interaction term with lifetime number of ovulatory cycles.
The results of the mediation analysis are found in Table 12. The total effect and the natural direct and indirect effects were found similar across the approaches. We note that the results obtained with the exact and approximate approaches based on IPW (Approx_IPW and Exact_IPW) are practically identical. The TE ORs suggest that the risk of ovarian cancer at any time point is reduced with the longterm use of oral contraceptives (exact approach TE estimate: 0.571; 95% CI: 0.414 to 0.787). Natural effects estimates suggest a protective effect of longterm use of oral contraceptives that is both direct and indirect, but the results are not statistically significant. NDE ORs were found to be farther away from the null effect value (\(OR=1\)) than the NIE ORs, suggesting that the decrease in risk with longterm use of oral contraceptives is more important through pathways not involving the total number of ovulatory cycles over life.
Because the exposuremediator interaction term included in the outcome model was not significant (Pvalue \(=0.59\), see Table A2 of Additional file 5), a secondary mediation analysis which excluded that term from the outcome model was performed. The corresponding estimated regression coefficients and mediation effects are presented in Tables A3A4 of Additional file 5, respectively. In this simpler outcome model, the exposure was found to be associated with the outcome (compare Tables A2 and A3 from Additional file 5). Some changes in the magnitude of the natural effects were observed : while the NDE ORs were again farther away from the null than the NIE ORs, the NDE and NIE ORs computed from this simpler model were respectively closer and farther to the null than when computed using the outcome model allowing for an exposuremediator term. Moreover, significance was achieved for the NIE. Specifically, considering a longterm use of oral contraceptives in all the population, we would obtain near 20% risk reduction for ovarian cancer if lifetime number of ovulatory cycles were allowed to vary according to the longterm use of oral contraceptives or not (exact approach NIE estimate: 0.814 and 95% CI: 0.693 to 0.955).
Discussion
In this article, we investigated the performance of different parametric regressionbased approaches for the estimation of natural direct and indirect effects with a binary outcome and either a continuous or a binary mediator using casecontrol data. We have found that all approaches investigated yielded essentially similar results when the outcome was rare or relatively rare both marginally and conditionally. However, some differences between approaches were observed when the outcome was more common marginally and/or conditionally. In particular, only the exact approach with IPW was found to yield acceptable results in all of the simulation scenarios investigated. Regarding both approximate approaches by VanderWeele and collaborators, we have observed that the approximate approach that used the controls for the estimation of the mediator model yielded an estimator of the total effect that was less biased than when IPW was used. Indeed, while the estimated regression coefficients were appropriately corrected for the casecontrol design using IPW, the closedform formulas used for the approximate NDE, NIE and TE estimands produced the biases observed for the natural effects estimators in some of the scenarios investigated. The unified approach proposed by Satten and collaborators was observed having similar issues with bias as the other approximate approaches for the estimation of the NDE and NIE. This unified approach was also found closest in behavior to the approximate approach with the controls only ; in particular, they agreed on the estimation of the regression parameters of the mediator model.
We have also investigated the impact of misspecifying the prevalence \(\pi\) in the approximate and exact approaches that rely on a userselected prevalence value (IPW). In our simulations, in which the relative misspecification of \(\pi\) was allowed to range between \(99\%\) and \(100\%\), we observed that the misspecification of \(\pi\) was less of a concern than the approximate or exact nature of the natural effects estimands when the misspecification was moderate. When the prevalence was importantly underestimated, all studied approaches were found to behave similarly. This is an interesting observation since one can thus view the “controls only” strategy as making implicitly an extreme choice for the user provided prevalence parameter \(\pi\). Indeed fitting the mediator model with controls only is equivalent to setting the yprevalence parameter to zero, in which situation the cases receive null weights when fitting this model.
In this work, we considered the exact estimators with IPW to allow for direct comparisons with the approximate approach with IPW and provide an evaluation of the ExactMed R package for the estimation of natural effects with casecontrol data. However, other strategies for the estimation of the regression parameters to be used in exact estimators could be envisaged in this context. In the case of a binary mediator, Doretti et al. [14] proposed Mestimation or maximum likelihood estimation for the simultaneous estimation of the regression coefficients of the mediator and outcome models, but nevertheless assume the population prevalence \(\pi\) known for implementing the correction related to the intercept coefficient of the outcome model. These authors found that such approaches yield estimators that both properly adjust for the casecontrol design and exhibit increased efficiency as compared to IPW.
Lastly, we believe worth raising the fact that, in the context of casecontrol study designs, the choice of the conditioning values of the covariates used for computing the conditional natural effects may produce interpretation issues. As pointed out in VanderWeele and Tchetgen Tchetgen [32], using the empirical means of the covariates \(\bar{\boldsymbol{C}}\) found in a selected sample may not well approximate the population averages E[C] (even with a large sample). Currently, and to the best of our understanding, this is the default procedure in packages CMAverse and ExactMed. Therefore, to the extent that \(\bar{\boldsymbol{C}}\) is not a convergent estimator for E[C], this has for consequence that the studied conditional natural effects estimators do not exactly target the correct estimands, which are conceptualized to be defined based on the population means. While this was found practically inconsequential in the simulations, it could be otherwise in other setsup. Automatically computing the conditional natural effects with the covariates means corrected using IPW could provide a sensible upgrade when a casecontrol option is used.
Conclusion
We have brought additional insights on existing regressionbased approaches for estimating natural direct and indirect effects for a binary outcome and a continuous or binary mediator using data from casecontrol study designs. Studied estimators rely either on the ROA or knowledge of the outcome prevalence (incidence) in the population, or both. Given that the former can be difficult to assess with respect to all relevant strata formed by the conditioning variables (exposure and mediator) of the outcome model and the latter difficult to specify exactly, we recommend evaluating the robustness of natural effects estimates by use of different estimators. However, approximate mediation approaches should be avoided or regarded with caution in situations where a violation of the ROA applies or is expected. As was found in the context of cohort study designs, the exact estimators investigated herein circumvented the difficulties associated with this assumption, and are thus to be favored in the previous situations. Nonetheless, the performance of these estimators, as the approximate estimators based on IPW, depends on the correct specification of the outcome prevalence (incidence) parameter \(\pi\) and we cannot eliminate the possibility that the exact estimators yield worse results than the approximate ones, even for moderate misspecification of \(\pi\). Sensitivity analyses with respect to the specification of \(\pi\) should be performed whenever there is significant uncertainty regarding the outcome prevalence (incidence) in the population.
As a final remark, the exact approaches studied herein have not been yet evaluated for mediation analysis with multiple mediators based on cohort data. Considering extant knowledge for multiple mediation analysis (e.g., [34]), it is reasonable to believe that these approaches for a single mediator could be used separately on each mediator when they are conditionally independent given the covariates in the population, and that IPW could be used to account for the design if implemented on casecontrol data. This interesting line of inquiry should be evaluated in future research.
Availability of data and materials
The data from PROVAQ that support these findings are available from the PROVAQ PI (AK) upon reasonable request and institutional approval.
Abbreviations
 boot:

Bootstrap
 CI:

Confidence interval
 CP:

Coverage probability
 IPW:

Inverse probability weighting
 NDE:

Natural direct effect
 NIE:

Natural indirect effect
 PROVAQ:

PRevention of OVArian Cancer in Quebec
 RMSE:

Root mean squared error
 ROA:

Rare outcome assumption
 SD:

Standard deviation
 SE:

Standard error
 TE:

Total effect
References
Baron RM, Kenny DA. The moderatormediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. J Pers Soc Psychol. 1986;51(3):1173–82. https://doi.org/10.1037/00223514.51.6.1173.
Robin JM, Greenland S. Identifiability and exchangeability for direct and indirect effects. Epidemiology. 1992;3(2):143–55. https://doi.org/10.1097/0000164819920300000013.
Pearl J. Direct and indirect effects. In: Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence. UAI’01. San Francisco: Morgan Kaufmann Publishers Inc.; 2001. p. 411–420.
VanderWeele TJ. Explanation in causal inference: Methods for mediation and interaction. New York: Oxford University Press; 2015.
VanderWeele TJ, Vansteelandt S. Odds ratios for mediation analysis for a dichotomous outcome. Am J Epidemiol. 2010;172(12):1339–48.
Valeri L, Vanderweele TJ. Mediation analysis allowing for exposuremediator interactions and causal interpretation: theoretical assumptions and implementation with SAS and SPSS macros. Psychol Methods. 2013;18(2):137–50.
Samoilenko M, Blais L, Lefebvre G. Comparing logistic and logbinomial models for causal mediation analyses of binary mediators and rare binary outcomes: evidence to support crosschecking of mediation results in practice. Observational Stud. 2018;4(1):193–216.
Samoilenko M, Lefebvre G. ParametricRegressionBased Causal Mediation Analysis of Binary Outcomes and Binary Mediators: Moving Beyond the Rareness or Commonness of the Outcome. Am J Epidemiol. 2021;190(9):1846–58.
Samoilenko M, Lefebvre G. An exact regressionbased approach for the estimation of natural direct and indirect effects with a binary outcome and a continuous mediator. Stat Med. 2023;42(3):353–87.
Cheng C, Spiegelman D, Li F. Estimating the natural indirect effect and the mediation proportion via the product method. BMC Med Res Methodol. 2021;21:253.
Doretti M, Raggi M, Stanghellini E. Exact parametric causal mediation analysis for a binary outcome with a binary mediator. Stat Methods Appl. 2022;31(1):87–108.
Jewell NP. Statistics for Epidemiology. Boca Raton: Chapman & Hall/CRC; 2004.
VanderWeele TJ. Mediation analysis: a practitioner’s guide. Annu Rev Public Health. 2016;37:17–32.
Doretti M, Genbäck M, Stanghellini E. Mediation analysis with casecontrol sampling: Identification and estimation in the presence of a binary mediator. 2022. arXiv preprint arXiv:2211.09420v1.
Wang J, Ning J, Shete S. Mediation analysis in a casecontrol study when the mediator is a censored variable. Stat Med. 2019;38(7):1213–29.
Kim YM, Cologne JB, Jang E, Lange T, Tatsukawa Y, Ohishi W, et al. Causal mediation analysis in nested casecontrol studies using conditional logistic regression. Biom J. 2020;62(8):1939–59. https://doi.org/10.1002/bimj.201900120.
Satten GA, Curtis SW, SolisLemus C, Leslie EJ, Epstein MP. Efficient estimation of indirect effects in casecontrol studies using a unified likelihood framework. Stat Med. 2022;41(15):2879–93.
Shi B, Choirat C, Coull BA, VanderWeele TJ, Valeri L. CMAverse: A Suite of Functions for Reproducible Causal Mediation Analyses. Epidemiology. 2021;32(5):e20–2.
Caubet M, Samoilenko M, Lefebvre G. ExactMed: exact mediation analysis for binary outcomes. 2023. R package version 0.3.0. https://cran.rproject.org/web/packages/ExactMed/index.html. Accessed date 22 Sept 2023.
Koushik A, Grundy A, Abrahamowicz M, Arseneau J, Gilbert L, Gotlieb W, et al. Hormonal and reproductive factors and the risk of ovarian cancer. Cancer Causes Control. 2017;28(5):393–403. https://doi.org/10.1007/s1055201608489.
Pearl J. The Causal Mediation Formula  A Guide to the Assessment of Pathways and Mechanisms. Prev Sci. 2012;13(4):426–36. https://doi.org/10.1007/s1112101102701.
VanderWeele TJ, Vansteelandt S. Conceptual issues concerning mediation, interventions and composition. Stat Interf. 2009;2(4):457–68.
Nguyen TQ, Schmid I, Ogburn EL, Stuart EA. Clarifying causal mediation analysis: Effect identification via three assumptions and five potential outcomes. J Causal Infer. 2022;10(1):246–79. https://doi.org/10.1515/jci20210049. [cited 20230905].
Collaborative Group on Epidemiological Studies of Ovarian Cancer, et al. Ovarian cancer and oral contraceptives: collaborative reanalysis of data from 45 epidemiological studies including 23 257 women with ovarian cancer and 87 303 controls. Lancet. 2008;371(9609):303–314.
Fathalla MF. Incessant ovulationa factor in ovarian neoplasia? Lancet. 1971;298(7716):163.
Milsom I, Korver T. Ovulation incidence with oral contraceptives: a literature review. BMJ Sex Reprod Health. 2008;34(4):237–46.
Ness RB, Cottreau C. Possible role of ovarian epithelial inflammation in ovarian cancer. J Natl Cancer Inst. 1999;91(17):1459–67.
Risch HA. Hormonal etiology of epithelial ovarian cancer, with a hypothesis concerning the role of androgens and progesterone. J Natl Cancer Inst. 1998;90(23):1774–86.
Havrilesky LJ, Moorman PG, Lowery WJ, Gierisch JM, Coeytaux RR, Urrutia RP, et al. Oral contraceptive pills as primary prevention for ovarian cancer: a systematic review and metaanalysis. Obstet Gynecol. 2013;122(1):139–47.
Schildkraut JM, Bastos E, Berchuck A. Relationship Between Lifetime Ovulatory Cycles and Overexpression of Mutant p53 in 932 Epithelial Ovarian Cancer. J Natl Cancer Inst. 1997;89(13):932–8.
Canadian Cancer Statistics Advisory Committee in collaboration with the Canadian Cancer Society, Statistics Canada and the Public Health Agency of Canada. Canadian Cancer Statistics 2021. Toronto: Canadian Cancer Society; 2021.
VanderWeele TJ, Tchetgen Tchetgen EJ. Mediation Analysis With Matched CaseControl Study Designs. Am J Epidemiol. 2016;183(9):869–70.
Kerr S, Greenland S, Jeffrey K, Millington T, Bedston S, Ritchie L, Simpson CR, Fagbamigbe AF, Kurdi A, Robertson C, Sheikh A, Rudan I. Understanding and reporting odds ratios as rateratio estimates in casecontrol studies. J Glob Health. 2023;13:04101.
VanderWeele T, Vansteelandt S. Mediation analysis with multiple mediators. Epidemiol Methods. 2014;2(1):95–115.
Acknowledgements
This research was enabled in part by support provided by Calcul Québec (www.calculquebec.ca) and the Digital Research Alliance of Canada (www.alliancecan.ca). The first and last authors thank Mariia Samoilenko for a review of this article.
Funding
This work was supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN202005473) and the Fonds de recherche du QuébecSanté (FRQS). Geneviève Lefebvre is a FRQS Research Scholar. Kevin L’Espérance is supported by a doctoral scholarship from FRQS.
Author information
Authors and Affiliations
Contributions
GL and MC designed the study; MC performed the simulations; AK provided the data from the PROVAQ study; KL and MC analyzed the PROVAQ data, and AK and KL interpreted the results; MC and GL prepared the manuscript; GL directed the project. All authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study was conducted according to the guidelines laid down in the Declaration of Helsinki, and all procedures involving human subjects were approved by the Research Ethics Committee of the Centre de recherche du CHUM. Written informed consent was obtained from all participants.
Written informed consent was obtained from participants, and the study was approved by the Institutional Review Boards of participating hospitals.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Variance of natural effects estimators. We explain how the variance of natural effects estimators and the confidence intervals for these effects were obtained via the exact approaches (naive and IPW) for both the continuous and the binary mediator cases.
Additional file 2.
Details on the simulation scenarios. We present the true values of the parameters used in the five simulation scenarios for each type of mediator (continuous and binary). We also present the corresponding prevalences for the outcome and exposure.
Additional file 3.
Additional simulation results when n = 1000. We present the average regression parameters estimated using the studied strategies in our main simulations with n = 1000. We also present figures showing the impact of the misspecification of the prevalence parameter \(\pi\) on the performance of the exact and approximate approaches with IPW in the simulations with n = 1000.
Additional file 4.
Simulation results with n = 500. We present the natural effects estimates obtained with the studied mediation approaches with generated samples of size n = 500.
Additional file 5.
Additional results for the realdata analysis. We present mediation analysis results based on the data from the PROVAQ study, including the estimated mediator and outcome models for the main analysis as well as results corresponding to the simpler outcome model which omits the interaction term between the exposure and mediator.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Caubet, M., L’Espérance, K., Koushik, A. et al. An empirical evaluation of approximate and exact regressionbased causal mediation approaches for a binary outcome and a continuous or a binary mediator for casecontrol study designs. BMC Med Res Methodol 24, 72 (2024). https://doi.org/10.1186/s1287402402156y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1287402402156y