Empirical confounder identification strategies
Overview
Five strategies were used, namely significance criteria with cutoff levels of pvalues fixed at ≤0.05 and 0.2 (in which a putative confounder is adjusted for if the pvalue of the ttest of the null hypothesis testing its effect on outcome equals zero is smaller than the cutoff levels), and CIE criterion with three different cutoff levels (fixed a prior at 10 %, with type I error controlled to a desired level, and with type II error controlled to a desired level). The observed change in estimate due to covariate Z is calculated as Δ=(θ_{
0
} – θ_{
Z
})/(θ_{
0
}), where θ_{
0
} is the effect estimate of interest not adjusted for suspected confounder Z and θ_{
Z
} is the effect estimate adjusted for suspected confounder Z. When CIE approach is used, a covariate Z is included in the final model if its inclusion in regression model produces Δ ≥ δ_{
c
}, where δ_{
c
} is 0.1 in the 10 % CIE approach, or δ_{
c
} is determined by simulations as described below. We will describe simulationbased CIE approaches in more detail below, as well as prescreening aimed to reduce confounding by a risk factor by chance.
Simulationbased change in estimate (CIE) approach
As a way of improving on an empirical approach with criteria fixed a prior, we previously proposed a simulationinformed CIE strategy [5] that performs better in confounder identification and causal effect estimation [17]. In brief, the simulationinformed CIE criterion determines change in the effect estimate of interest that arises when the exposure of interest is adjusted for an independent random variable. With this approach, an independent random variable with the distribution identical to the observed putative confounder is drawn and the causal effect estimates of the exposure and outcome adjusting and not adjusting for this independent random variable are obtained. Next, we record the changeinestimate that results from adjusting this independent random variable. The above procedure is repeated and the resulting distribution of changes in effect estimates upon adjustment indicates where we need to place a cutoff for the CIE criterion in order to achieve the desired type I error, e.g. for 5 % error the 95 %percentile of the distribution is used. One can also adopt a CIE criterion with a desired type II error. To do so, one repeatedly simulates a variable with particular correlations with the exposure and outcome, and compares the CIE from models that do and do not adjust for this simulated confounding variable. Using the sthpercentile of the simulated CIEs as a cutoff could yield a type II error of 1s. In our simulations, we focus on selection of these two CIE cutoffs. In the next section, we describe this procedure in more detail, infusing it with consideration of measurement error.
Screening potential structural confounders
In preliminary investigations, application of simulated CIE approach resulted in an unacceptably high rate (e.g. 50–80 % in some instances) of identification of a risk factor as a structural confounder when it was in fact not correlated with exposure of interest in the population (i.e. by data generating process). We identified correlation of exposure and covariate in finite samples as the culprit of this artifact and developed a screening step that evaluated correlation of exposure and putative confounder before evaluating it further via the five strategies described above. Specifically, only if the hypothesis that the observed exposure and covariate were not correlated was rejected, then the covariate was considered further in the identification of structural confounding. On the other hand, if the hypothesis that the observed exposure and covariate were not correlated was not rejected, then the covariate was excluded from further evaluation in the identification of structural confounding.
Simulation study: overall framework and method of analysis
In specific simulations that we undertook, we assumed that (a) the exposure is associated with the outcome and (b) the putative confounder is indeed a confounder by virtue of its association with both exposure and the outcome (but not the descendant of them). As in many reallife situations, the exposure and confounder are measured with error: for simplicity, we focus on additive classical measurement error models with uncorrelated errors (but our simulation framework can readily be amended to accommodate more complex situations).
The disease model that was considered in our investigation was of the form g(Y) = α + βX + γZ, with g() representing the link function of the generalized linear model, the fixed effects α (background rate or intercept), β (influence of exposure X on outcome Y), and γ (influence of covariate Z on outcome Y). The regression coefficient β is only identical to true value of the effects of interest in linear regression but for logistic and Cox proportional hazard regression, the effects of interest is calculate as relative risk (RR) and hazard ratio (HR), respectively. We denote these true effects of interest as θ for generality.
We assumed that we can only observe realizations of true exposure and confounder with classical additive measurement error models X^{*} = X + ε_{
x
} and Z^{*} = Z + ε_{
z
}, the error terms are unbiased and independent of each other. The estimates of regression coefficients β from (Y, X^{*}, Z^{*}) data with and without adjustment for Z^{*} are denoted by β_{
Z
}^{*} and β_{
0
}^{*}, respectively. These regression coefficients can be used to calculate estimates of the effect of interest θ as θ_{
Z
}^{*} and θ_{
0
}^{*}, with and without adjustment, respectively; the superscript “^{*}” denotes variables and estimates contaminated by measurement error.
The screening test for Pearson correlation of X^{*} and Z^{*} being different from zero used p≤0.05 cutoff. The datasets where covariates Z were not rejected are evaluated using the simulated CIE cutoff calculated as follows.
The simulated CIE cutoffs in presence of measurement error are determined by comparing effect estimates relating X^{*} to Y with and without adjusting regressions of Y on X^{*} for an independent random variable Z_{
0
} with distribution identical to that of Z^{*} over K simulations. Let us denote such effect estimates, functions of regression coefficient, as θ_{
0
}^{*}_{
k
} when unadjusted coefficient is used, and as θ_{
Z0
} ^{*}_{
k
} when the adjusted coefficient for Z_{
0
} (not the same as adjusted for Z) is used in the k^{th} simulation. Then, the changes in the estimates in each simulation are then calculated, in general, as
$$ {\delta}_k = \Big\left({\theta}_0{{}^{*}}_k\hbox{}\ {\theta}_{Z0}{{}^{*}}_k\right)/\left({\theta}_0{{}^{*}}_k\right)\Big $$
and the q^{th}percentile of δ_{
k
} determined over K simulations the cutoff for CIE that would lead to a type I error of 1q, i.e. δ_{
c
}. The CIE that is simulated to achieve desired type II error can be obtained in a similar manner, with the independent random variable Z_{
0
} replaced by a random variable correlated with X^{*} and Z^{*} according to the simulation setting (i.e. under the assumption that we guessed correctly the true nature of associations in datagenerating process), and the s^{th} percentile of δ determined over K simulations the cutoff for CIE that would lead to a type II error of s. As with all power calculations, this requires an informed guess of the structure we aim to detect and is therefore the more difficult criteria to establish objectively (e.g. we do not know true value of all the correlations from the data contaminated by measurement error) as opposed to the one that strives to control type I error.
Simulation study: the specific scenarios
Our example synthetic data scenario features an outcome Y and three different types of outcome were generated, namely binary (with the disease prevalence at followup of 10 %), continuous (with a variance of 1), and survival (with the death rate at followup of 10 %). The exposure X and true confounder Z both simulated to follow standard normal distributions, Z is associated with both X (via Pearson correlation ρ ≠ 0) and Y (γ ≠ 0). The binary, continuous, and survival data were generated and fitted using a logistic model (ln(P(Y = 1)/P(Y = 0) = ln(1/9) + βX + γZ)), a linear model (Y = βX + γZ + ε_{
y
}, ε_{
y
} ~ N(0, 1)), and a Cox model (survival time ~ exp(βX + γZmin(βX + γZ)), censored at survival time > 0.1), respectively. The survival times were generated as follows: (1) mean survival time for all subjects equaled βX + γZ, (2) the aforementioned means survival times were linear transformed to make then all positive by subtracting the minimum value of (βX + γZ), (3) the survival time for each subject was generated to follow an exponential distribution with rate parameter equal to mean survival time from step (2), (4) survival times were censored at a value of 0.1 so that the outcome was observed in only 10 % of subjects.
In illustrating the kind of information that this tool can yield, we obtained N = 10,000 simulation realizations of a cohort study (yielding a standard error of 0.5 %) of either n = 500 or 2,000 subjects, with ρ∈{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} and the true causal associations of XY and ZY with β = γ =0.1, as well as a situation in which exposure of interest is measured with smaller error than or equal to the putative confounder, i.e. ε_{
x
} ~ N(0, σ_{x}^{2}∈{0.25, 1}) and ε_{
z
} ~ N(0, σ_{z}^{2}∈{0.25, 1}). We used K = 10,000 to determine simulationbased CIE for each combination of parameters defined a simulation framework above.
In each n^{th} simulation realization (1, …, N), when screening potential confounders, we evaluated Pearson correlation of X^{*} and Z^{*} (ρ_{
n
}^{*}) and rejected Z^{*} as potential confounder when pvalue of the null hypothesis ρ_{
n
}^{*} = 0 was larger than 0.05. In such instances, final model selected excluded Z^{*}. If Z^{*} remained in contention for role of structural confounder after the screening text, we next estimate the effects of X^{*} on Y in simulated datasets by

(a) fitting a regression model appropriate for each outcome with X^{*} as an independent variable and Y as the dependent variable (i.e., do not adjust for Z^{*}), resulting in estimate of effect of X on Y as θ_{
0
}^{*}_{
n
}, which is a function of β_{
0
}^{*}_{
n
}, and

(b) fitting a regression model appropriate for each outcome with X^{*} and Z^{*} as independent variables and Y as the dependent variable (i.e., adjust for Z^{*}), resulting in estimate of effect of X^{*} on Y as θ_{Z}^{*}_{
n
}, which is a function of β_{
Z
}^{*}_{
n
}.
Effect estimate and pvalues resulting from these models of the n^{th} simulation realization are compared and, depending on the confounder selection rule that was triggered, the final estimate of the effect of X on Y in that particular simulated dataset was computed by either model (a) or (b).
We also calculated root mean squared error (RMSE) of effect
$$ \sqrt{{\displaystyle \sum_{n=1}^N{\left({\varphi}_n\theta \right)}^2/N}} $$
, where in the n^{th} simulation (n ∈ {1,…,N}) φ_{
n
} = θ_{
Z
}*_{
n
} if the confounder identification criteria suggested an adjustment, and φ_{
n
} = θ_{
O
}*_{
n
} otherwise; recall that θ is the true value of the effect estimate set by simulation.
All simulations were carried out using R 3.2.0. The R code we provide allows one to test various CIE cutoffs in order to determine the percentage of the simulated datasets correctly identifying Z^{*} as a confounder or effect of X^{*} as significant, as well as RMSE resulting from model selected after application of each confounderidentification strategy (Additional file 1).
Application to study design to clarify role of a confounder in NHANES
We illustrate the application of our approach in an example arising from an earlier analysis of exposure to total blood mercury (E) and depression (D) in 6,911 adults aged ≥40 years in the National Health and Nutrition Examination Survey (NHANES) 2009–2010 [18] approved by The National Center for Health Statistics Research Ethics Review Board; informed consent was obtained from all participants at the time of data collection and further consent for specific analyses of this publically available data is not needed. The dataset can be downloaded at http://wwwn.cdc.gov/Nchs/Nhanes/Search/nhanes09_10.aspx.
Contrary to an expected null association between exposure and outcome, a nonsensical inverse (protective) association was reported and the original investigators argued that this was probably due to measurement error leading to residual confounding by fish consumption (F) [18]. That study assessed the binary recall of fish consumption in the 30 days prior to data collection (F_{
obs
}). This variable does not demonstrate statistical properties that support its inclusion as a confounder in the final model because (a) pvalue for F_{
obs
} in the logistic regression model = 0.82, and (b) inclusion of F_{
obs
} the final models does not affect the observed RR_{ED  Fobs} = 0.83 (OR_{ED  Fobs} = 0.79) of depression due to mercury to the third decimal place. Nonetheless, it is important to note that our preliminary test for potential confounding would not have rejected F_{
obs
} from the final model because there is evidence that it is correlated with exposure to mercury, albeit “weakly”: Pearson correlations of 0.39 with mercury exposure (95 % CI 0.37–0.41, p< 0.05). Furthermore, given the established effects of habitual fish consumption F on blood mercury E [19] and depression D [20], Ng et al. [18] suspected that F is a confounder of association of total blood mercury with depression (see Fig. 1 for the DAG of the causal association), and that the pattern of results arose because F_{
obs
} is a poor measure of unobserved F.
Let us suppose that in the course of future research we have a particular method of measuring fish consumption (W) with a known degree of measurement error that may prove to be superior to F_{
obs
}. It is important to note that we do not wish to use F_{
obs
} in the new research project with the same sample: it yielded what we believe to be confounded estimates and the motivation of new research would be to improve on quality of data to understand the problem better. We need to simulate W because it is not given in the data but is a reflection of what we can hope to obtain in the study that is being planned under the assumption of given causal structure: we can never hope to measure F itself but need to be generate W and thereby evaluate performance of W. We want to know two things: (a) whether W is a confounder of the mercurydepression relationship, and (b) whether models adjusted for this measure of fish consumption W will result in the hypothesized null mercurydepression association (i.e., RR_{ED  W} = 1, as opposed to the observed estimate RR_{ED  Fobs} = 0.83) Here, W is related to true habitual fish consumption F by classical measurement error: W = F + ε, ε ~ N(0, σ^{2}); F ~ N(0,1); ε is independent of both E and F. To more specifically motivate these assumptions, reflecting on common experience in exposure assessment, we consider that F is a measure of fatty acid intake that is measured by a biomarker and then normalized to Gaussian distribution via logtransformation, hence additive measurement error model for W and distributional assumptions can be appropriate. In practice, such assumptions would be verified in a validation or reliability study.
We assumed that total blood mercury E is nearperfectly measured because a precise analytical technique exists. To simplify, we ignored the matter of etiologically relevant windows of exposure, although this may not be trivial because the biologic halflife of mercury in blood is short.
Based on prior research [18], we also assumed: (a) the association between F and D is based on the correlation of underlying continuous measures of F and D, and set it to ρ_{
FD
} = −0.35, and (b) that the correlation of F and E is ρ_{
FE
} = 0.39, same as the observed correlation F_{
obs
} and E. With these inputs, we simulated true (F) and mismeasured (W) values of fish consumption subjected to different magnitudes of measurement error. Under various conditions of measurement error, we simulate W 10,000 times. Different degrees of error in measured confounder, σ^{2}, were examined. We acknowledge that a different model for confounding could have been postulated and included in our simulation framework but we followed the path deemed as sensible by the original investigators in [18].
To empirically determine whether measured fish consumption (W) would correctly remove confounding from effect of mercury on depression, the proportions of simulated datasets in which the adjusted association of mercury on depression, RR_{ED  W}, has p >0.05 were recorded. This is akin to asking how well should we measure confounder in order to have the power to observe true exposureresponse association upon adjustment. We also reported the simulationdetermined confounder identification criterion described above (i.e. aimed to control type I error at 5 %) to compare it to the conventional CIE of 10 %. Finally, we also determined the average and the 95 % empirical confidence intervals of the estimates of the mercurydepression association with adjustment for simulated values of W based on the 10,000 simulations, in order to determine how well the adjustment for W is able to estimate a specified true causal effect of ED adjusted for F. (The number of simulations we informed by the observations that it was sufficient to obtain stable behavior of simulation realizations; in every specific situation, a difference size of simulation may be needed.) This reflects a theoretical perspective for confounder identification where based on some predetermined DAG, W is theoretically a confounder of the exposureoutcome association and should therefore be included in models regardless of measurement properties. To visualize the empirical distribution of RR_{ED  W}, we plotted its histogram from the 10,000 simulated estimates with σ^{2} = 1. The R code for the simulations can be found in the Online Supplementary Materials (Additional file 2).