Empirical confounder identification strategies
Overview
Five strategies were used, namely significance criteria with cutoff levels of p-values fixed at ≤0.05 and 0.2 (in which a putative confounder is adjusted for if the p-value of the t-test of the null hypothesis testing its effect on outcome equals zero is smaller than the cutoff levels), and CIE criterion with three different cutoff levels (fixed a prior at 10 %, with type I error controlled to a desired level, and with type II error controlled to a desired level). The observed change in estimate due to covariate Z is calculated as Δ=|(θ
0
– θ
Z
)/(θ
0
)|, where θ
0
is the effect estimate of interest not adjusted for suspected confounder Z and θ
Z
is the effect estimate adjusted for suspected confounder Z. When CIE approach is used, a covariate Z is included in the final model if its inclusion in regression model produces Δ ≥ δ
c
, where δ
c
is 0.1 in the 10 % CIE approach, or δ
c
is determined by simulations as described below. We will describe simulation-based CIE approaches in more detail below, as well as pre-screening aimed to reduce confounding by a risk factor by chance.
Simulation-based change in estimate (CIE) approach
As a way of improving on an empirical approach with criteria fixed a prior, we previously proposed a simulation-informed CIE strategy [5] that performs better in confounder identification and causal effect estimation [17]. In brief, the simulation-informed CIE criterion determines change in the effect estimate of interest that arises when the exposure of interest is adjusted for an independent random variable. With this approach, an independent random variable with the distribution identical to the observed putative confounder is drawn and the causal effect estimates of the exposure and outcome adjusting and not adjusting for this independent random variable are obtained. Next, we record the change-in-estimate that results from adjusting this independent random variable. The above procedure is repeated and the resulting distribution of changes in effect estimates upon adjustment indicates where we need to place a cut-off for the CIE criterion in order to achieve the desired type I error, e.g. for 5 % error the 95 %-percentile of the distribution is used. One can also adopt a CIE criterion with a desired type II error. To do so, one repeatedly simulates a variable with particular correlations with the exposure and outcome, and compares the CIE from models that do and do not adjust for this simulated confounding variable. Using the sth-percentile of the simulated CIEs as a cutoff could yield a type II error of 1-s. In our simulations, we focus on selection of these two CIE cutoffs. In the next section, we describe this procedure in more detail, infusing it with consideration of measurement error.
Screening potential structural confounders
In preliminary investigations, application of simulated CIE approach resulted in an unacceptably high rate (e.g. 50–80 % in some instances) of identification of a risk factor as a structural confounder when it was in fact not correlated with exposure of interest in the population (i.e. by data generating process). We identified correlation of exposure and covariate in finite samples as the culprit of this artifact and developed a screening step that evaluated correlation of exposure and putative confounder before evaluating it further via the five strategies described above. Specifically, only if the hypothesis that the observed exposure and covariate were not correlated was rejected, then the covariate was considered further in the identification of structural confounding. On the other hand, if the hypothesis that the observed exposure and covariate were not correlated was not rejected, then the covariate was excluded from further evaluation in the identification of structural confounding.
Simulation study: overall framework and method of analysis
In specific simulations that we undertook, we assumed that (a) the exposure is associated with the outcome and (b) the putative confounder is indeed a confounder by virtue of its association with both exposure and the outcome (but not the descendant of them). As in many real-life situations, the exposure and confounder are measured with error: for simplicity, we focus on additive classical measurement error models with uncorrelated errors (but our simulation framework can readily be amended to accommodate more complex situations).
The disease model that was considered in our investigation was of the form g(Y) = α + βX + γZ, with g() representing the link function of the generalized linear model, the fixed effects α (background rate or intercept), β (influence of exposure X on outcome Y), and γ (influence of covariate Z on outcome Y). The regression coefficient β is only identical to true value of the effects of interest in linear regression but for logistic and Cox proportional hazard regression, the effects of interest is calculate as relative risk (RR) and hazard ratio (HR), respectively. We denote these true effects of interest as θ for generality.
We assumed that we can only observe realizations of true exposure and confounder with classical additive measurement error models X* = X + ε
x
and Z* = Z + ε
z
, the error terms are unbiased and independent of each other. The estimates of regression coefficients β from (Y, X*, Z*) data with and without adjustment for Z* are denoted by β
Z
* and β
0
*, respectively. These regression coefficients can be used to calculate estimates of the effect of interest θ as θ
Z
* and θ
0
*, with and without adjustment, respectively; the superscript “*” denotes variables and estimates contaminated by measurement error.
The screening test for Pearson correlation of X* and Z* being different from zero used p≤0.05 cutoff. The datasets where covariates Z were not rejected are evaluated using the simulated CIE cutoff calculated as follows.
The simulated CIE cut-offs in presence of measurement error are determined by comparing effect estimates relating X* to Y with and without adjusting regressions of Y on X* for an independent random variable Z
0
with distribution identical to that of Z* over K simulations. Let us denote such effect estimates, functions of regression coefficient, as θ
0
*
k
when unadjusted coefficient is used, and as θ
Z0
*
k
when the adjusted coefficient for Z
0
(not the same as adjusted for Z) is used in the kth simulation. Then, the changes in the estimates in each simulation are then calculated, in general, as
$$ {\delta}_k = \Big|\left({\theta}_0{{}^{*}}_k\hbox{--}\ {\theta}_{Z0}{{}^{*}}_k\right)/\left({\theta}_0{{}^{*}}_k\right)\Big| $$
and the qth-percentile of δ
k
determined over K simulations the cutoff for CIE that would lead to a type I error of 1-q, i.e. δ
c
. The CIE that is simulated to achieve desired type II error can be obtained in a similar manner, with the independent random variable Z
0
replaced by a random variable correlated with X* and Z* according to the simulation setting (i.e. under the assumption that we guessed correctly the true nature of associations in data-generating process), and the sth percentile of δ determined over K simulations the cutoff for CIE that would lead to a type II error of s. As with all power calculations, this requires an informed guess of the structure we aim to detect and is therefore the more difficult criteria to establish objectively (e.g. we do not know true value of all the correlations from the data contaminated by measurement error) as opposed to the one that strives to control type I error.
Simulation study: the specific scenarios
Our example synthetic data scenario features an outcome Y and three different types of outcome were generated, namely binary (with the disease prevalence at follow-up of 10 %), continuous (with a variance of 1), and survival (with the death rate at follow-up of 10 %). The exposure X and true confounder Z both simulated to follow standard normal distributions, Z is associated with both X (via Pearson correlation ρ ≠ 0) and Y (γ ≠ 0). The binary, continuous, and survival data were generated and fitted using a logistic model (ln(P(Y = 1)/P(Y = 0) = ln(1/9) + βX + γZ)), a linear model (Y = βX + γZ + ε
y
, ε
y
~ N(0, 1)), and a Cox model (survival time ~ exp(βX + γZ-min(βX + γZ)), censored at survival time > 0.1), respectively. The survival times were generated as follows: (1) mean survival time for all subjects equaled βX + γZ, (2) the aforementioned means survival times were linear transformed to make then all positive by subtracting the minimum value of (βX + γZ), (3) the survival time for each subject was generated to follow an exponential distribution with rate parameter equal to mean survival time from step (2), (4) survival times were censored at a value of 0.1 so that the outcome was observed in only 10 % of subjects.
In illustrating the kind of information that this tool can yield, we obtained N = 10,000 simulation realizations of a cohort study (yielding a standard error of 0.5 %) of either n = 500 or 2,000 subjects, with ρ∈{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} and the true causal associations of X-Y and Z-Y with β = γ =0.1, as well as a situation in which exposure of interest is measured with smaller error than or equal to the putative confounder, i.e. ε
x
~ N(0, σx2∈{0.25, 1}) and ε
z
~ N(0, σz2∈{0.25, 1}). We used K = 10,000 to determine simulation-based CIE for each combination of parameters defined a simulation framework above.
In each nth simulation realization (1, …, N), when screening potential confounders, we evaluated Pearson correlation of X* and Z* (ρ
n
*) and rejected Z* as potential confounder when p-value of the null hypothesis ρ
n
* = 0 was larger than 0.05. In such instances, final model selected excluded Z*. If Z* remained in contention for role of structural confounder after the screening text, we next estimate the effects of X* on Y in simulated datasets by
-
(a) fitting a regression model appropriate for each outcome with X* as an independent variable and Y as the dependent variable (i.e., do not adjust for Z*), resulting in estimate of effect of X on Y as θ
0
*
n
, which is a function of β
0
*
n
, and
-
(b) fitting a regression model appropriate for each outcome with X* and Z* as independent variables and Y as the dependent variable (i.e., adjust for Z*), resulting in estimate of effect of X* on Y as θZ*
n
, which is a function of β
Z
*
n
.
Effect estimate and p-values resulting from these models of the nth simulation realization are compared and, depending on the confounder selection rule that was triggered, the final estimate of the effect of X on Y in that particular simulated dataset was computed by either model (a) or (b).
We also calculated root mean squared error (RMSE) of effect
$$ \sqrt{{\displaystyle \sum_{n=1}^N{\left({\varphi}_n-\theta \right)}^2/N}} $$
, where in the nth simulation (n ∈ {1,…,N}) φ
n
= θ
Z
*
n
if the confounder identification criteria suggested an adjustment, and φ
n
= θ
O
*
n
otherwise; recall that θ is the true value of the effect estimate set by simulation.
All simulations were carried out using R 3.2.0. The R code we provide allows one to test various CIE cutoffs in order to determine the percentage of the simulated datasets correctly identifying Z* as a confounder or effect of X* as significant, as well as RMSE resulting from model selected after application of each confounder-identification strategy (Additional file 1).
Application to study design to clarify role of a confounder in NHANES
We illustrate the application of our approach in an example arising from an earlier analysis of exposure to total blood mercury (E) and depression (D) in 6,911 adults aged ≥40 years in the National Health and Nutrition Examination Survey (NHANES) 2009–2010 [18] approved by The National Center for Health Statistics Research Ethics Review Board; informed consent was obtained from all participants at the time of data collection and further consent for specific analyses of this publically available data is not needed. The dataset can be downloaded at http://wwwn.cdc.gov/Nchs/Nhanes/Search/nhanes09_10.aspx.
Contrary to an expected null association between exposure and outcome, a nonsensical inverse (protective) association was reported and the original investigators argued that this was probably due to measurement error leading to residual confounding by fish consumption (F) [18]. That study assessed the binary recall of fish consumption in the 30 days prior to data collection (F
obs
). This variable does not demonstrate statistical properties that support its inclusion as a confounder in the final model because (a) p-value for F
obs
in the logistic regression model = 0.82, and (b) inclusion of F
obs
the final models does not affect the observed RRED | Fobs = 0.83 (ORED | Fobs = 0.79) of depression due to mercury to the third decimal place. Nonetheless, it is important to note that our preliminary test for potential confounding would not have rejected F
obs
from the final model because there is evidence that it is correlated with exposure to mercury, albeit “weakly”: Pearson correlations of 0.39 with mercury exposure (95 % CI 0.37–0.41, p< 0.05). Furthermore, given the established effects of habitual fish consumption F on blood mercury E [19] and depression D [20], Ng et al. [18] suspected that F is a confounder of association of total blood mercury with depression (see Fig. 1 for the DAG of the causal association), and that the pattern of results arose because F
obs
is a poor measure of unobserved F.
Let us suppose that in the course of future research we have a particular method of measuring fish consumption (W) with a known degree of measurement error that may prove to be superior to F
obs
. It is important to note that we do not wish to use F
obs
in the new research project with the same sample: it yielded what we believe to be confounded estimates and the motivation of new research would be to improve on quality of data to understand the problem better. We need to simulate W because it is not given in the data but is a reflection of what we can hope to obtain in the study that is being planned under the assumption of given causal structure: we can never hope to measure F itself but need to be generate W and thereby evaluate performance of W. We want to know two things: (a) whether W is a confounder of the mercury-depression relationship, and (b) whether models adjusted for this measure of fish consumption W will result in the hypothesized null mercury-depression association (i.e., RRED | W = 1, as opposed to the observed estimate RRED | Fobs = 0.83) Here, W is related to true habitual fish consumption F by classical measurement error: W = F + ε, ε ~ N(0, σ2); F ~ N(0,1); ε is independent of both E and F. To more specifically motivate these assumptions, reflecting on common experience in exposure assessment, we consider that F is a measure of fatty acid intake that is measured by a biomarker and then normalized to Gaussian distribution via log-transformation, hence additive measurement error model for W and distributional assumptions can be appropriate. In practice, such assumptions would be verified in a validation or reliability study.
We assumed that total blood mercury E is near-perfectly measured because a precise analytical technique exists. To simplify, we ignored the matter of etiologically relevant windows of exposure, although this may not be trivial because the biologic half-life of mercury in blood is short.
Based on prior research [18], we also assumed: (a) the association between F and D is based on the correlation of underlying continuous measures of F and D, and set it to ρ
FD
= −0.35, and (b) that the correlation of F and E is ρ
FE
= 0.39, same as the observed correlation F
obs
and E. With these inputs, we simulated true (F) and mis-measured (W) values of fish consumption subjected to different magnitudes of measurement error. Under various conditions of measurement error, we simulate W 10,000 times. Different degrees of error in measured confounder, σ2, were examined. We acknowledge that a different model for confounding could have been postulated and included in our simulation framework but we followed the path deemed as sensible by the original investigators in [18].
To empirically determine whether measured fish consumption (W) would correctly remove confounding from effect of mercury on depression, the proportions of simulated datasets in which the adjusted association of mercury on depression, RRED | W, has p >0.05 were recorded. This is akin to asking how well should we measure confounder in order to have the power to observe true exposure-response association upon adjustment. We also reported the simulation-determined confounder identification criterion described above (i.e. aimed to control type I error at 5 %) to compare it to the conventional CIE of 10 %. Finally, we also determined the average and the 95 % empirical confidence intervals of the estimates of the mercury-depression association with adjustment for simulated values of W based on the 10,000 simulations, in order to determine how well the adjustment for W is able to estimate a specified true causal effect of E-D adjusted for F. (The number of simulations we informed by the observations that it was sufficient to obtain stable behavior of simulation realizations; in every specific situation, a difference size of simulation may be needed.) This reflects a theoretical perspective for confounder identification where based on some pre-determined DAG, W is theoretically a confounder of the exposure-outcome association and should therefore be included in models regardless of measurement properties. To visualize the empirical distribution of RRED | W, we plotted its histogram from the 10,000 simulated estimates with σ2 = 1. The R code for the simulations can be found in the Online Supplementary Materials (Additional file 2).