 Research article
 Open Access
 Published:
Instrumental variable analysis in the context of dichotomous outcome and exposure with a numerical experiment in pharmacoepidemiology
BMC Medical Research Methodology volume 18, Article number: 61 (2018)
Abstract
Background
In pharmacoepidemiology, the prescription preferencebased instrumental variables (IV) are often used with linear models to solve the endogeneity due to unobserved confounders even when the outcome and the endogenous treatment are dichotomous variables. Using this instrumental variable, we proceed by MonteCarlo simulations to compare the IVbased generalized method of moment (IVGMM) and the twostage residual inclusion (2SRI) method in this context.
Methods
We established the formula allowing us to compute the instrument’s strength and the confounding level in the context of logistic regression models. We then varied the instrument’s strength and the confounding level to cover a large range of scenarios in the simulation study. We also explore two prescription preferencebased instruments.
Results
We found that the 2SRI is less biased than the other methods and yields satisfactory confidence intervals. The proportion of previous patients of the same physician who were prescribed the treatment of interest displayed a good performance as a proxy of the physician’s preference instrument.
Conclusions
This work shows that when analysing real data with dichotomous outcome and exposure, appropriate 2SRI estimation could be used in presence of unmeasured confounding.
Background
In observational studies, unobserved confounding may biase the estimation of target effect. Over the last decade, this issue has recieved a growing attention in the field of epidemiological studies attempting to assess adverse effects of drugs with a few works focusing on instrumental variable (IV) approaches. Instrumental variable estimation is a well known approach for assessing endogeneity in statistical modelling [1, 2]. Endogeneity often arises when a causal model is poorly specified therby introducing a structural bias in the estimation of its parameters. This may result from a measurement error in variables [3], an unobserved variable [4] or an inverse causality between the outcome and some regressors. The general IV method of estimation attempts to remove this bias by using structural equations which incorporate instrumental variables in the model. Several theoretical approaches have been developed to build estimators of parameters and to study the properties of these estimators in causal models with endogeneity (see [5]). A wellknown example is the case of linear models in which IV estimation leads to estimators with good properties of convergence such as consistency discussed in [6]. The structural bias can be completely removed in this case.
In pharmacoepidemiology, we most often deal with binary covariables (drug exposure), binary responses (adverse event indicator) and confounding variables which are variables that are correlated with exposure and response. A nonlinear model should be the first choice in this context to match with the specific nature of the variables. However, to quantify the risk of the adverse effect of a treatment in the presence of unobserved confounding, researchers investigating the IVbased estimation often model the probability of dichotomous events as a linear function of covariables, thus ignoring the basic features of a probability. Terza and colleagues [7] investigated the influences of mispecification on the estimation when a linear IV model is used in an inherently nonlinear regression setting with endogeneity. A substantial bias was demonstrated in their results. In the context of pharmacoepidemiology modelling, endogeneity is often due to unobserved confounding and various nonlinear IV methods such as the Generalized Method of Moment (GMM;[8, 9]) or the twostage residual inclusion (2SRI; [10]) can be used to solve this issue. However in a review of IV methods, Klungel and colleagues [11] claimed that the GMM estimator with the logistic regression model is not consistent for causal Odd Ratio (OR) estimation owing to the noncollapsibility of the OR. Consistency is also not guaranteed for the 2SRI when the regression models are nonlinear in both stages. For these nonlinear IV methods, theoretical results exist under very restricted assumptions which do not cover the possible frameworks of real data. Overall, in the context of binary outcome several simulation studies investigate mainly 2stage IV methods with the first step being linear and the second step being logistic as in [12]. A few articles concern double probit models (Chapman et al. [13]). Very few address the comparison of GMM and 2stage approaches and none study GMM, 2stage double logistic using the prescription preferencebased instrumental variables.
In a simulation study, we compare the IVbased GMM and the 2SRI methods to the conventional method which does not account for endogeneity. These comparisons are based on the estimation of the regression coefficient of the exposure varible in nonlinear logistic model. Our numerical comparison of the methods involves several scenarios with different confounding levels and different instrument strengths for which computation formulas are established in the context of dichotomous outcome and exposure. We recall the general formula of the covariance matrix for the twostep estimation methods and give the corresponding expression for twostage nonlinear least squares method in the context of logistic regressions.
The paper is organized as follows: we specify the model and describe the methods of estimation that will be analysed. Then we describe the simulation design, the criteria for evaluating the performances of the methods and the results of our simulations. The final sections discuss the results and make some concluding remarks. Details on the computation of the covariance matrix of the 2SRI method, the instrument’s strength and the confounding level are to be found in the appendices, as well as a detailed description of the simulation model and supplementary results.
Methods
Model
We consider a general model with dichotomous outcome and exposure that can be written as
with Y and T as binary outcome (event or not) and treatment (T_{1} or T_{2}) respectively, X_{1},X_{2} some covariables and X_{ u } an unobserved confounder of the outcome and treatment. The function F(.)=r(.) denotes the logistic distribution function also known as expit(.): expit(.)=exp(.)/(1+exp(.)) and e denotes the error term. The parameter β=(β_{0},β_{ t },β_{1},β_{2}) denotes the vector of unknown parameters to be estimated. Without a confounding variable, all observed regressors are exogenous. In this case, the true model is written
and conventional regression methods are suitable for estimating the parameters β. We will denote (3) the conventional model. If this model is adjusted to data in the presence of an unobserved confounder, the estimated coefficients would lead to a bias with a level depending on the confounding level. As the confounder X_{ u } is not independent of treatment, the residuals of the conventional model are associated with the treatment. This causes endogeneity so a single regression of the outcome on observed covariables will fail to estimate β_{ t } efficiently. A common strategy is to consider another regression model that links the endogenous variable with others. Equation (2) defines the auxiliary model that predicts treatment T as a function of covariables X_{1}, X_{2}, the confounder X_{ u } and another variable Z. Variable Z denotes the instrumental variable (or instrument) related to the treatment, i.e. a variable correlated with the treatment and which has no direct association with the outcome.
The bias due to the unobserved confounder can significantly be reduced by means of the twostage regression model using a valid instrument. As defined by Johnston and colleagues [14] and Greenland [15], a valid instrument must not be correlated with an unobserved confounder or with the error term in the true model (1). Formally, we assume that the instrument Z meets the following assumptions:

\({\mathbb {C}ov}(Z,YT,X_{u},X_{1},X_{2})=0,\)

\({\mathbb {C}ov}(Z,T) \neq 0,\)

\({\mathbb {C}ov}(Z,X_{u}) = 0, {\mathbb {C}ov}(Z,X_{1}) = 0\) and \({\mathbb {C}ov}(Z,X_{2}) = 0\).
We also assume that the confounder X_{ u } is not associated with the covariables X_{1} and X_{2}, that is \({\mathbb {C}ov}(X_{u},X_{1}) = 0\) and \({\mathbb {C}ov}(X_{u},X_{2}) = 0\). The main goal is to estimate β_{ t }, the treatment coefficient which is the basis of risk evaluation; however an estimation of β_{ t } is obtained in general by estimating vector β which is discussed below.
As already proved in the simple case of a linear model (for which the functions F and r are equal to identity in Eqs. (1) and (2)), a high association between the treatment and an instrument should improve the IV estimation of β. Finding a strong instrument is then a crucial step in all procedures of instrumental variable estimation.
In what follows, we first present some specific instrumental variables often used in pharmacoepidemiology, then discuss some IV estimators of β to obtain an estimate of β_{ t } before addressing the properties of these estimators.
Instrument in pharmacoepidemiology
An instrumental variable can be determined in many ways, provided that it meets the assumptions listed above. One of the problems is to find a valid instrument with a reasonable strength. The strength of an instrumental variable can be defined as resulting from the level of its association with the endogenous treatment. As such, it could be quantified by using the correlation coefficient between the treatment and the related instrument. In the wide range of pharmacoepidemiologic applications, we can summarize the various instrumental variables in three categories:
Geographical variation. Proximity to the care provider can positively influence access to treatment of a patient compared to others who live far away from health services. To account for this difference between patients, some researchers (see [16]) consider the distance between a patient and a care provider as an instrumental variable. Although this seems realistic as there is no direct association between this distance and the occurrence of disease, the presence or absence of health services can be associated with some socioeconomic characteristics. The latter are often considered as unmeasured confounders that call into question the suitability of using this instrument.
Calendar time. The use of calendar time as an instrument in pharmacoepidemiology often relies on the occurrence of an event that could change the attitude of the physician or patient regarding a treatment. This could be a change in guidelines for example or a change due to the arrival of a new drug on the market. The time from that event to the date of treatment defines the calendar time which clearly affects the outcome of the treatment since the change in physician or patient attitude will be more pronounced immediately after the event has occurred than later. An example of use of calendar time as an instrument can be found in [17].
Physician’s prescription preference. The most often used instrumental variables in pharmacoepidemiology are preferencebased [18–20]. The issue is to compare the effectiveness of two treatments T_{1} and T_{2} when the assignment of treatment to the patient is not randomized. This is the case in observational studies where the prescriber of the treatment (the physician) introduces an effect that influences the outcome via the prescribed treatment. This effect results in the instrumental variable that reflects the influence of careproviders on the patienttreatment relationship. Brookhart and colleagues [21] define this instrumental variable as the “Physician’s Preference” (PP) and propose to use the treatment prescribed by a physician to its previous patient as a proxy of this IV for his/her new patient.
The instrumental variable \(\left (Z_{i}^{*}\right)\) of the i^{th} patient will then be the treatment prescribed to the previous patient of the same physician. As a physician’s preference could change over time, Abrahamowicz and colleagues [22] introduce a new procedure to detect the change point and build a new proxy of PP that includes not only the treatment prescribed to the previous patient but all the previous prescriptions since the change point.
Those instrumental variables and some others are presented in a more detailed form in [23], [24, 25] or in [26] with enlightening discussion on their validity. In this work, we carry out a simulation study to examine how the proxy Z^{∗} of physician’s preference performs in the context of logistic regression. We also examine the physician’s preferencebased IV in the continuous form (see [22]), i.e. the proportion (pr) of all previous patients of the same physician who were prescribed the treatment of interest. This corresponds to the empirical estimator of the probability for a physician to prefer the treatment of interest.
IV Estimation of β _{ t }
Estimating β_{ t } in model (1) with an unobserved confounder X_{ u } amounts to estimating vector β and taking the corresponding component of treatment T. Below, we present two methods that can provide consistent estimation of β and then that of β_{ t }: the twostage residual inclusion (2SRI) method and the generalized method of moment (GMM).
The twostage residual inclusion method is a modified version of the twostage least squares (2SLS) method used to estimate the parameters in linear models with instrumental variables. As mentioned by Greene [5], the first stage of the 2SLS method predicts the endogenous variable (the treatment here) using the instruments and other covariables (Eq. (2)). In the second stage, the endogenous variable is just replaced by its prediction from the first stage. This method is called twostage predictor substitution (2SPS) when the first stage is nonlinear. Unlike the 2SLS, the residuals of the first regression serve as a regressor in the second stage of 2SRI method. This method also generalizes to the nonlinear models i.e. when first and second stages are nonlinear. The rationale of this approach can be intrinsically related to the form of the true model: sometimes, the prediction equation of the outcome includes the error term of the auxiliary regression. An example is the case when the confounder is the only source of error in the auxiliary regression as considered in [27]. In a linear model, both 2SPS and 2SRI approaches are equivalent.
The GMM is an alternative method for obtaining a reliable estimator of parameter β in a model with an endogenous variable. It is based on the classical assumption
of the error term in Eq. (1). This assumption does not hold in general because the confounder X_{ u } is unobserved (i.e. \(\mathbb {E}(eT,X_{1},X_{2}) \neq 0)\). In this case, the treatment is endogenous and its coefficient β_{ t } cannot be consistently estimated. One suppose in general that there exists some observable instruments w_{1} such that \(\mathbb {E}(ew) = 0\) with w=(w_{1},X_{1},X_{2}). Typically, w corresponds to the vector of exogenous and endogenous variables with endogenous regressors replaced by their corresponding instruments. Using the law of iterated expectation, the last condition implies
The method of moment solves the empirical version of (5) in β, i.e. \(\frac {1}{n} \sum _{i = 1}^{n} e_{i}w_{i} = 0\) where n is the sample size, e_{ i } is the i^{th} component of e and w_{ i } the i^{th} row of w. In turn, the GMM minimizes the quadratic form
where Ω denotes a weighting matrix. As discussed in [28], there are several choices for matrix Ω leading to different estimators of β. The optimal approach is to define Ω as the inverse of the asymptotic covariance matrix (depending on β) of the estimator. Some alternative procedures are also suggested by Hansen and colleagues [29].
Properties
The properties of the 2SRI are addressed by Terza in [27] when the residual also acting as unobserved confounder in the firststage regression (at Eq. (2)) is additive. Under this assumption and using nonlinear least squares regression in each stage, they show the consistency of the estimator \(\hat {\beta }\) from the second stage. Since the confounder is not additive in the model (2), the first stage estimate of a 2SRI will not be consistent, nor will \(\hat {\beta }\). The residual from the firststage is indeed an unknown function of an unobserved confounder. Then there is a bias that depends on the form of this unknown function when one applies the 2SRI method to the structural model at Eqs. (1) and (2). The derivation of the covariance matrix of 2SRI estimator follows a twostep regression covariance matrix of the form
where the computation of matrices A and S is given in Appendix A.
The GMM is a well documented estimation procedure. Both in linear and nonlinear models, several results on the estimator have already been established. In the literature on econometric analysis, the nonlinear GMM with an instrumental variable has received particular attention. In the pioneering work by Amemiya [30], the author demonstrated the consistency and derived the asymptotic distribution of the nonlinear twostage least squares estimator (NL2SLS).
This result provided an important insight into how to handle nonlinear models with endogeneity. Later, Hansen [31] showed the asymptotic properties (consistency and asymptotic distribution) of the GMM to be a kind of generalization of the NL2SLS. More recently, Cameron and Trivedi [28] reviewed the method and gave details on the computation of the estimator’s covariance matrix in some specific cases. Despite the different results on the GMM with an instrumental variable, its performances in terms of bias and variance depend on the validity and strength of the instrument and the nature of the variables in the model. The computation of the covariance matrix of GMM is given in [28] and implemented in dedicated softwares of which [32] is a good example.
Below, we investigate the performances of these methods with numerical experiments.
Simulation design and data generation
A numerical experiment was conducted to investigate and compare several methods of IV estimation in the context of a dichotomous outcome and exposure with endogeneity in pharmacoepidemiology. In this experiment, we cover a wide range of possible scenarios. We choose values of parameter α=(α_{0},α_{ z },α_{1},α_{2}) corresponding to some values of correlation between the variable T^{∗}=α_{0}+PPα_{ z }+X_{1}α_{1}+X_{2}α_{2}+X_{ u } and the Physician’s prescribing Preference instrument PP. In fact, we keep α_{0}, α_{1} and α_{2} fixed and only α_{ z } varies from a scenario to the other. The computation of this correlation is given in Appendix B. It somewhat reflects the strength of the instrument when the confounder and other covariables are kept fixed. For each value of the instrument’s strength, there are three levels of confounding measured by the standard deviation σ_{ u } of the confounding variable X_{ u }, σ_{ u }∈{0.5,1,1.5} which leads to a set of correlations between T^{∗} and X_{ u }. We then have nine scenarios of strengths and confounding level.
For each scenario, we generate ns=1000 Monte Carlo samples of size n, n=10000,20000 and 30000. The number of patients per physician is kept fixed and equals 100; the confounder X_{ u } and covariates X_{1} and X_{2} are assumed to have the normal distributions N(0,σ_{ u }), N(−2,1) and N(−3,1) respectively and the physician’s prescribing preference has the Bernoulli distribution B(0.7). We first simulate the covariables X_{1} and X_{2}, the confounder X_{ u } and the physician’s prescribing preference which is the same for all the patients of the same physician. Using the already fixed values of parameters α, the probability p_{ i } that patient i will be prescribed the drug of interest is calculated by inverting the logit function, i.e. p_{ i }=F(α_{0}+PP_{ i }α_{ z }+X_{1i}α_{1}+X_{2i}α_{2}+X_{ ui }). Treatment T_{ i } of patient i is then generated as a Bernoulli realisation with parameter p_{ i }. The same procedure as for the treatment is used to simulate for each patient i, the corresponding outcome y_{ i }. We fixed the parameter α such that the proportion of exposed patients ranges between 2 and 6% and the prevalence of the event of interest is chosen to be smaller than 5% to reflect a reallife situation of a new treatment (not frequently prescribed) and rare adverse event. With the fixed value of β, we compute the probability F(β_{0}+T_{ i }β_{ t }+X_{1i}β_{1}+X_{2i}β_{2}+X_{ ui }β_{ u }) of the event for patient i and then simulate his outcome y_{ i }. We also explored a more balanced situation in term of exposure frenquencies (between 26 and 45%).
Finally, proxy Z^{∗} of PP is the treatment given to the previous patient and the continuous instrument pr is the proportion of patients of the same physician who were previously prescribed the treatment of interest. More details on the simulation model and the data generating R code are given in Appendix D.
Estimation methods
For the true and conventional models which do not assume endogeneity, the classical onestep regression method without instrumental variable is used. The estimations are performed with the existing regression functions (glm and nls) implemented in R statistical software (R Development core team 2008) in the R package stats.
For the 2SRI method, a twostep regression is used following the procedure outlined in the section dedicated to IV estimation procedure. Recall that the covariance matrices of \(\hat {\beta }\) from the second step regression for both methods retrieved from the software results will not be valid since their calculation ignores the fact that some estimated parameters from the firststage regression are included in the second stage. Then, we reevaluated these covariance matrices using the sequential twostep estimation procedure taking into account the fact that an estimated variable is used in the second step. The computation of these covariance matrices is given in Appendix A.
For the GMM, the R package gmm proposed by Chaussé [32] is a very helpful tool for computating parameters and estimating covariance. The user needs to implement the sample version of the moment condition function at Eq. (5) and its gradient if possible; if not, a numerical approximation of the gradient function will be used by the gmm function to perform the estimation.
From these estimations, we calculate the asymptotic covariance matrices and the corresponding confidence intervals whose levels are evaluated below.
Evaluation criteria
To evaluate the performances of the various methods, we consider several criteria including the percentage of relative bias (rB in %) defined as
the asymptotic standard deviation estimated by the square root of the MonteCarlo mean of variance \(\bar {\hat {\sigma }}^{2} = \frac {1}{ns}\sum _{j=1}^{ns} \hat {\sigma }_{j}^{2},\) with \(\hat {\sigma }_{j}^{2}\) the asymptotic variance of \(\hat {\beta }_{t}^{(j)}\) and the MonteCarlo estimator of the true variance \({\mathbb {V}ar}(\hat {\beta }_{t}) = \frac {1}{ns1}\sum _{j=1}^{ns} \left (\hat {\beta }_{t}^{(j)} \bar {\hat {\beta }_{t}}\right)^{2}\). We also consider the square root of the mean squares error rMSE given by
and the lower and upper noncoverage probabilities (in %) defined as
and
where \( IC^{(j)} = \left [{IC}_{inf}^{(j)};{IC}_{sup}^{(j)}\right ]\) denotes the confidence interval of β_{ t } from the j^{th} MonteCarlo sample using the asymptotic distribution of \(\hat {\beta }_{t}^{(j)}\). The nonparametric bootstrap based estimate of variance and noncoverage probabilities are also investigated in these simulations and the results are analysed below. We complete all these criteria by the equivalent of the firststage Fstatistics in linear regression (see [33]) testing instrument exclusion in the treatment choice model.
Results
Table 1 summarizes the performances of each method in terms of relative bias (rB), the standard deviation (sd), the square root of mean squares error (rMSE) and the noncoverage probabilities (pval=Er_{ inf }+Er_{ sup }). It presents the results related to instrument pr. There were only slight differences between the results with instrument pr and those with Z^{∗}, so we omitted the results related to instrument Z^{∗}. For some samples, the GMM fails to converge owing to singularity problems in the covariance matrix. The estimations from these samples are simply removed (cases marked ‘ −’ in Table 1) for all methods. For the other scenarios, infinite variance estimate or outlier coefficient estimate may be observed; the corresponding samples were dropped for the calculation of the criteria. Table 2 shows the number of samples leading to an outlier estimation (rB>100% or infinite variance) among the 1000 simulated samples. Figure 1 complements the results in Table 1 by displaying the boxplot distributions of rB. Each series of letters a,b,c and d corresponds to the results related to an instrument strength with each letter corresponding to a method as detailed in the legend of Fig. 1. In Table 1 the sd values refer to the MonteCarlobased standard deviation. Except for the GMM estimator where outlier values for the asymptotic variance were observed, the MonteCarlobased standard deviation, the bootstrapbased estimate (not shown) and the asymptotic estimate were very close.
As expected, the relative bias shows that the estimation from the true and conventional models are insensitive to instrument strength but the confounding level affects the estimation in the conventional model: the relative bias increases with level of confounding. In the presence of a strong instrument, the 2SRI tends to improve the estimate when the level of confounding increases. This trend is reversed when the instrument is weak, i.e. the relative bias and the confounding level have the same direction of variation. The percentage of relative bias of the GMM does not seem too sensitive to instrument strength: it just changes slightly when the strength of the instrument grows. However, this bias increases with the magnitude of confounding, which shows the impact of endogeneity on this method (See Fig. 1).
For the standard deviation (sd) and the square root of the mean squares error (rMSE), the asymptotic results (n=30000) show that both criteria decrease when the level of confounding or instrument strength grows, this being the case for all methods except the GMM for which the rMSE decreases very slowly or remains almost constant in some cases. This trend confirms the already obverved low sensitivity of the GMM to the strength of the instruments used in this simulation. Even though the 2SRI has the larger sd than the other methods in all scenarios with an impact on rMSE in several cases, rMSE for 2SRI method seems improved with high confounding and a strong instrument.
Concerning the non coverage probabilities (pval), the true model estimation and the 2SRI displayed an estimated noncoverage probability around the nominal level of 5% in almost all scenarios. Their values ranged from 4 to 6% and reached 7% in rare cases. The noncoverage probability was very large for the other methods, even with large samples: the results showed an overestimation of the coefficient of treatment for the GMM and conventional approaches. Even at low confounding levels, the conventional method yields very poor coverage probabilities which is coherent with what was observed for the relative bias.
We observe that the metric of the instrument we use retains the same direction of variation with the Fstatistics eequivalents (Tables 3 and 4 of Appendix C). Finally, Table 5 in Appendix D summarizes the performances of each method for more balanced exposure frequencies and large sample size (n=30000). In general, performances are less good in comparison with small exposure frequency situation. One could also note that numerical problems arise more often (see Table 6 of Appendix D). Nevertheless, 2SRI is better in term of relative bias and non coverage probability than the conventional method and GMM.
Overall, the results are satisfactory for the 2SRI approach which achieves a similar level of performance to the true model regarding estimation of the confidence interval in the imbalanced situation. We close this section by pointing out the strong numerical instability observed when computing the GMM estimator during these simulations. This could explain the modest performance displayed by the GMM in the simulation results.
Discussion
In this paper, we focus on the effectiveness of regression coefficient estimation in a context of endogeneity, particularly the endogeneity due to unobserved confounding. We are interested in the coefficient of an endogenous treatment which is the basis of risk assessment in pharmacoepidemiology. Linear models are often used in this context to model the probability of dichotomous events (see [7]). Through a simulation study, we investigate the behavior of parameter estimation in nonlinear models specifically logistic regression using some IVbased methods that could potentially be used to overcome the endogeneity issue. The simulation study also made it possible to assess two preferencebased instrumental variables in pharmacoepidemiology.
The results reported from the simulation study show that the 2SRI using nonlinear regression at each stage is an interesting alternative for estimating the coefficient of the endogenous treatment in a logistic regression model. It is very simple to implement and yields satisfactory results regarding the bias and the confidence interval estimate. It was found to yield the most accurate estimate of noncoverage probabilities and thus a more accurate estimate of confidence intervals among the IVbased methods that were compared. However, the conventional approach behaved better than the 2SRI in some cases, especially when the confounding level was weak. We believe that in these cases, the level of confounding is not sufficiently high to require the use of an instrument in the estimation. However, to our knowledge there is still no way to assess the level of unmeasured confounding. For the GMM, the estimation procedure was remarkably unstable. That instability may be attributable to the dichotomous nature of the variables (outcome and exposure) in the context of pharmacoepidemiology with the preferencebased instrument. This situation makes the GMM approach is not to be recommended in this context unless another instrument has proved to behave satisfactorily with this method.
Concerning the instruments under investigation in this study, the proportion of all previous patients of the same physician who were prescribed the treatment of interest proved to be a good proxy of the physician’s preference instrument. This instrument was previously considered by Abrahamowicz and colleagues [22] for investigating the detection of a possible change point in the physician’s preference and their results also seemed satisfactory. This proxy of the physician’s preference instrument is thus a credible alternative to the well known other proxy based only on the single patient of the same physician who was prescribed the treatment of interest.
Even though this work throws light on the performances of IV estimators in the context of a nonlinear model with endogeneity, more work is needed to explore the behavior of these estimators in other contexts when the prevalence of exposure and /or outcome varies.
Conclusions
In observational studies, when assessing the effect of drug exposure on a dichotomous outcome, investigators could use appropriate 2SRI estimation to account for unmeasured confounding. This work showed that two logistic regressions as well as a physician’s preference proxy for IV yeald satisfactory results.
Appendix A: Asymptotic variance of the nonlinear 2SRI
Using the sequential twostep estimation procedure, the nonlinear 2SRI estimator minimizes the least squares criterion
where \(\hat {\alpha }\) denotes the nonlinear least squares estimator of α from the first stage. If we set \(\eta _{i\beta } = X_{i\hat {\alpha }}^{\prime }\beta,\) the firstorder condition in β is given by
Given that \(\hat {\alpha }\) is a consistent estimator of α_{0}, the Taylorlagrange expansion of (9) arround the true value (α_{0},β_{0}) gives
for some (α_{ c };β_{ c }) between \((\hat {\alpha };\hat {\beta })\) and (α_{0};β_{0}). Under the assumption \({\mathbb {E}}\left (\frac {\partial {Q}}{\partial {\beta }}\right)_{\beta _0} = 0\) the terms involving the residuals \(\left (y_{i}F_{\alpha }(\eta _{i\beta })\right)\) in the above expansion, except the first, all tend in probability to zero and we obtain
The quantity \(\sqrt {n}(\hat {\beta }\beta _{0})\) may then be written
with \(\eta _{i\alpha } = w_{i}^{\prime }\alpha \). The latter is derived from the asymptotic approximation of \(\sqrt {n}(\hat {\alpha }\alpha _{0})\) using a similar TaylorLagrange expansion for the firststage nonlinear least squares regression. The previous expansion leads to the covariance matrix of \(\hat {\beta }\) of the form
Under the assumption of independence between observations, the matrices involved in this covariance matrix are given by
where \(p\lim \) denotes the limite in probability, U_{ α }=T−P_{ α }, U_{ β }=y−P_{ β }, with P_{ α }=r(wα) and P_{ β }=F_{ α }(Xβ).
An estimation of this covariance matrix can be obtained using the plugin estimator of each matrix involved in its expression. Let \(P_{\hat {\alpha }} = r(w\hat {\alpha }), P_{\hat {\beta }} = F_{\hat {\alpha }}(X\hat {\beta }),\ U_{\hat {\alpha }} = TP_{\hat {\alpha }}\) and \( U_{\hat {\beta }} = yP_{\hat {\alpha }},\) then the corresponding estimator of matrices at equation (10) are such that
Appendix B: Instrument strength and confounding level
We give bellow the computation of instrument strength and confounding level

Instrument strength
The strength of an instrument results from the correlation between it and the corresponding endogenous variable. In the model considered here, the strength of an instrument Z is given by \({\mathbb {C}orr}(T,Z) = \frac {\mathbb {C}ov(T,Z)}{\sqrt {\mathbb {V}ar(T)}\sqrt {\mathbb {V}ar(Z)}}\). As the treatment has a causal link with other covariables Z,X_{1},X_{2} and X_{ u }, we have
$${} \begin{aligned} {\mathbb{V}ar}(T) &= {\mathbb{V}ar}_{Z} [\!{\mathbb{E}}(TZ,X_{1},X_{2},X_{u})]\\ &\quad+ {\mathbb{E}}_{Z}[{\mathbb{V}ar}(TZ,X_{1},X_{2},X_{u})]. \end{aligned} $$If we consider only the explanatory effect of the instrument in treatment T and replace other covariables and the confounder by their average effect, we have \({\mathbb {V}ar}(T) = {\mathbb {V}ar}_{Z} \left (\frac {1}{1+A_Z}\right) + {\mathbb {E}}_{Z}\left (\frac {A_Z}{(1+A_{Z})^{2}}\right)\) with A_{ Z }=exp(−(α_{0}+Zα_{ z }+μ_{1}α_{1}+μ_{2}α_{2}+μ_{ u })),\(\mu _{1} = {\mathbb {E}}(X_{1}), \mu _{2} = {\mathbb {E}}(X_{2})\) and \(\mu _{u} = {\mathbb {E}}(X_{u})\). This variance may then be written
$$\begin{array}{*{20}l} {}{\mathbb{V}ar}(T) &=& {\mathbb{E}}_{Z} \left(\!\frac{1}{1+A_Z}\!\right)^{2}\,\, \left(\!{\mathbb{E}}_{Z} \left(\!\frac{1}{1+A_Z}\!\right)\!\right)^{2}\\ &&+ {\mathbb{E}}_{Z}\!\left(\!\frac{A_Z}{(1+A_{Z})^{2}}\!\right) \end{array} $$(11)$$\begin{array}{*{20}l} &=& {\mathbb{E}}_{Z} \!\left(\!\frac{1+A_Z}{(1+A_{Z})^{2}}\!\right)\,\, \left(\!{\mathbb{E}}_{Z}\! \left(\!\frac{1}{1+A_Z}\!\right)\!\right)^{2} \end{array} $$(12)$$\begin{array}{*{20}l} &=& {\mathbb{E}}_{Z}\! \left(\!\frac{1}{1+A_Z}\!\right)\! \! \left(\!{\mathbb{E}}_{Z}\! \left(\!\frac{1}{1+A_Z}\!\right)\!\right)^{2}. \end{array} $$(13)Considering a dichotomous instrument Z having the Bernoulli distribution B(p), we have \({\mathbb {E}}_{Z}\left (\frac {1}{1+A_Z}\right) = \frac {p}{1+A_1} + \frac {1p}{1+A_0},\) where A_{ j },j=0,1 is A_{ Z } with Z replaced by j. We finally obtain
$${} \begin{aligned} {\mathbb{V}ar}(T) = \left(\!\frac{p}{1+A_1} + \frac{1p}{1+A_0}\!\right)\!\left(\!1\,\,\left(\!\frac{p}{1+A_1} + \frac{1p}{1+A_0}\right)\!\right)\!. \end{aligned} $$We also have
$$\begin{array}{*{20}l} {\mathbb{E}}(T) &=& {\mathbb{E}}_{Z}({\mathbb{E}}(TZ,X_{1},X_{2},X_{u})) \end{array} $$(14)$$\begin{array}{*{20}l} &=& {\mathbb{E}}_{Z}\left(\frac{1}{1+A_Z}\right) \end{array} $$(15)$$\begin{array}{*{20}l} &=& \frac{p}{1+A_1} + \frac{1p}{1+A_0} \end{array} $$(16)and then \({\mathbb {E}}(Z){\mathbb {E}}(T) = \frac {p^{2}}{1+A_1} + \frac {p(1p)}{1+A_0}\). Furthermore, \({\mathbb {E}}(ZT) = {\mathbb {E}}_{Z}({\mathbb {E}}(ZTZ,X_{1},X_{2},X_{u})) = {\mathbb {E}}_{Z}(Z{\mathbb {E}}(TZ,X_{1},X_{2},X_{u}))\) which leads to \({\mathbb {E}}(ZT) ={\mathbb {E}}_{Z}\left (\frac {Z}{1+A_Z}\right) = \frac {p}{1+A_1}\). The covariance between Z and T is then given by
$$\begin{array}{*{20}l} {\mathbb{C}ov}(Z,T) &=& {\mathbb{E}}(ZT)  {\mathbb{E}}(Z){\mathbb{E}}(T) \end{array} $$(17)$$\begin{array}{*{20}l} &=& \frac{p}{1+A_1}  \frac{p^{2}}{1+A_1}  \frac{p(1p)}{1+A_0}\end{array} $$(18)$$\begin{array}{*{20}l} &=& p(1p)\left(\frac{1}{1+A_1}  \frac{1}{1+A_0}\right). \end{array} $$(19)The correlation between Z and T is
$$ {} {\mathbb{C}orr}(T,Z) = \frac{\left(\frac{1}{1+A_1}\frac{1}{1+A_0}\right)\sqrt{p(1p)}}{\sqrt{\left(\frac{p}{1+A_1}+\frac{1p}{1+A_0}\right)\left(1\left(\frac{p}{1+A_1}+\frac{1p}{1+A_0}\right)\right)}}. $$(20)Then for a given instrument Z with p fixed in (20), α_{0}, α_{ z },α_{1} and α_{2} can be chosen to reach a value of A_{ Z } that leads to a desired value of \({\mathbb {C}orr}(T,Z)\).
Another criterion that could be used to quantify an instrument’s strength is the correlation between T^{∗}=α_{0}+Zα_{ z }+X_{1}α_{1}+X_{2}α_{2}+X_{ u } and the instrument Z. Since \({\mathbb {C}ov}(Z,T^{*}) ={\mathbb {C}ov}(Z,\alpha _{0} + Z \alpha _{z}+ X_{1}\alpha _{1}+ X_{2}\alpha _{2}+ X_{u}) = \alpha _{z}{\mathbb {V}ar}(Z)\) and \( {\mathbb {V}ar}(T^{*}) = \alpha _{z}^{2}{\mathbb {V}ar}(Z) + \alpha _{1}^{2}{\mathbb {V}ar}(X_{1}) + \alpha _{2}^{2}{\mathbb {V}ar}(X_{2}) + {\mathbb {V}ar}(X_{u}),\) under the assumptions on these variables, we have
$$ {\mathbb{C}orr}(Z,T^{*}) = \frac{\alpha_{z}\sqrt{p(1p)}}{\left(\alpha_{z}^{2}p(1p) + \alpha_{1}^{2} \sigma_{1}^{2} +\alpha_{2}^{2} \sigma_{2}^{2} + \sigma_{u}^{2} \right)^{1/2}} $$(21)where \(\sigma _{i}^{2} ={\mathbb {V}ar}(X_{i}) \) and \(\sigma _{u}^{2} = {\mathbb {V}ar}(X_{u})\).

Confounding level
A straightforward calculation as above leads to the following correlation between T^{∗} and X_{ u } that expresses the level of confounding.
$${} {\mathbb{C}orr}(X_{u},T^{*}) = \frac{\sigma_u}{\left(\alpha_{z}^{2}p(1p) + \alpha_{1}^{2} \sigma_{1}^{2} +\alpha_{2}^{2} \sigma_{2}^{2} + \sigma_{u}^{2} \right)^{1/2}}. $$(22)
Appendix C: Fstatistics for each scenario
The following tables display the equivalent of the firststage Fstatistics in linear regression (see [33]) testing instrument exclusion in the treatment choice model.
Appendix D: Description of simulation model and parameters values, results of the second scenario
For all scenarios, the model generating the binary outcome is the index function
with \(Y_{i}^{*} = \beta _{0} + T_{i}\beta _{t}+ X_{1i}\beta _{1}+ X_{2i}\beta _{2}+ X_{ui}\beta _{u} \) and ε_{ i } the standart logistic distribution. T_{ i } is the observed binary treatment of the individual i,X_{1i} and X_{2i} some characteristics of patient i and X_{ ui } the unmeasured confounding factor. Besides β_{0} the intercept, β_{0},β_{ t },β_{1},β_{2} and β_{ u } are parameters related to T, X_{1}, X_{2} and X_{ u } respectively. We fixed these parameters β_{0}=−0.6, β_{ t }=3, β_{1}=1, β_{2}=1, and β_{ u }=1 to keep the prevalence of event less than 5%.
The treatment choice for the i^{th} patient was generated from a Bernoulli model with success probability p_{ i } which depends on the patient’s characteristics X_{1}∼N(−2,1) and X_{2}∼N(−3,1), on a binary instrument Z∼b(0.7) and on the confounding factor X_{ u }∼N(0,σ_{ u }). The probability p_{ i } is given by p_{ i }=F(α_{0}+Z_{ i }α_{ z }+X_{1i}α_{1}+X_{2i}α_{2}+X_{ ui }), where F denotes the logistic distribution function. The standard deviation σ_{ u } of the confounding factor takes values 0.5, 1 and 1.5 corresponding to Low, Medium and high level of confounding respectively. For a fixed level of confounding, we varied only α_{ z } value over {1,2,3} of which each element corresponds to an instrument strength: 1 for “Low” instrument, 2 for “Moderate” and 3 for “High” instrument. All other parameters in the treatment choice model remain constant (α_{0}=0.2, α_{1}=2 and α_{2}=1.2). The data are generated using the R function sim2Logit2().
We investigated the performances of methods in the context of rare exposures (2 to 6%), and rare events (less than 5%). To check whether the results obtained remain valid in the context of higher exposure (near 50%), we design new simulations in which only the intercepts α_{0} and β_{0} are modified in the previous design. We held all other parameters constant and fixed α_{0}=5 and β_{0} = −2.3. The prevalence of exposure ranged then between 26% and 45% whereas that of event is maintained lower than 6%. We present in Table 5 the results from this second scope over 500 Monte Carlo samples of size 30000. Table 6 displays the number of Monte Carlo samples with outliers.
This function allows to simulate a compound logistic model with covariates
Abbreviations
 2SLS:

TwoStage Least Squares
 2SPS:

TwoStage Predictor Substitution
 2SRI:

TwoStage Residual Inclusion
 GMM:

Generalized Method of Moment
 Inserm:

Institut national de la santé et de la recherche médicale
 IV:

Instrumental variable
 OR:

Odd Ratio
 PP:

Physician Preference
 pval:

non coverage probability
 rB:

Relative Bias
 rMSE:

root Mean Square Error
 sd:

standard deviation
 UVSQ:

Université de Versailles SaintQuentinenYvelines
References
 1
Rassen JA, Schneeweiss S, Glynn RJ, Mittleman MA, Brookhart MA. Instrumental variable analysis for estimation of treatment effects with dichotomous outcomes. Am J Epidemiol. 2009; 169(3):273–84. https://doi.org/10.1093/aje/kwn299. http://aje.oxfordjournals.org/content/169/3/273.full.pdf+html.
 2
Gowrisankaran G, Town RJ. Estimating the quality of care in hospitals using instrumental variables. J Health Econ. 1999; 18(6):747–67. https://doi.org/10.1016/S01676296(99)000223.
 3
Carroll RJ, Ruppert D, Stefanski LA. Measurement Error in Nonlinear Models. Monographs on Statistics and Applied Probability. London: Chapman & Hall; 1995.
 4
Nestler S. Using instrumental variables to estimate the parameters in unconditional and conditional secondorder latent growth models. Struct Equ Model A Multidiscip J. 2015; 22(3):461–73.
 5
Greene WH. Econometric Analysis, 7th edition, international edition. Boston: Pearson; 2012. pp. 259–89. Previous edition: 2008.
 6
Bound J, Jaeger DA, Baker RM. Problems with instrumental variables estimation when the correlation between the instruments and the endogeneous explanatory variable is weak. J Am Stat Assoc. 1995; 90(430):443–50.
 7
Terza JV, Bradford WD, Dismuke CE. The use of linear instrumental variables methods in health services research and health economics: A cautionary note. Health Serv Res. 2008; 43(3):1102–1120.
 8
Foster EM. Instrumental Variables for Logistic Regression: An Illustration. Soc Sci Res. 1997; 26(4):487–504. https://doi.org/10.1006/ssre.1997.0606. Accessed 02 Oct 2015.
 9
Terza JV. Estimation of policy effects using parametric nonlinear models: a contextual critique of the generalized method of moments. Health Serv Outcome Res Methodol. 2006; 6(3):177–98. https://doi.org/10.1007/s1074200600130.
 10
Cai B, Small DS, Have TRT. Twostage instrumental variable methods for estimating the causal odds ratio: Analysis of bias. Stat Med. 2011; 30(15):1809–1824. https://doi.org/10.1002/sim.4241.
 11
Klungel OH, Martens EP, Psaty BM, Grobbee DE, Sullivan SD, Stricker BHC, Leufkens HGM, de Boer A. Methods to assess intended effects of drug treatment in observational studies are reviewed. J Clin Epidemiol. 2004; 57:1223–1231. https://doi.org/10.1016/j.jclinepi.2004.03.011.
 12
Palmer TM, Sterne JAC, Harbord RM, Lawlor DA, Sheehan NA, Meng S, Granell R, Smith GD, Didelez V. Instrumental variable estimation of causal risk ratios and causal odds ratios in mendelian randomization analyses. Am J Epidemiol. 2011; 173(12):1392.
 13
Chapman CG, Brooks JM. Treatment effect estimation using nonlinear twostage instrumental variable estimators: Another cautionary note. Health Serv Res. 2016; 51(6):2375–394. https://doi.org/10.1111/14756773.12463.
 14
Johnston KM, Gustafson P, Levy AR, Grootendorst P. Use of instrumental variables in the analysis of generalized linear models in the presence of unmeasured confounding with applications to epidemiological research. Stat Med. 2008; 27(9):1539–1556. https://doi.org/10.1002/sim.3036.
 15
Greenland S. An introduction to instrumental variables for epidemiologists. Int J Epidemiol. 2000; 29(4):722–9. https://doi.org/10.1093/ije/29.4.722. http://ije.oxfordjournals.org/content/29/4/722.full.pdf+html.
 16
McClellan M, McNeil BJ, Newhouse JP. Does more intensive treatment of acute myocardial infarction in the elderly reduce mortality?: Analysis using instrumental variables. JAMA. 1994; 272(11):859–66. https://doi.org/10.1001/jama.1994.03520110039026.
 17
Cain LE, Cole SR, Greenland S, Brown TT, Chmiel JS, Kingsley L, Detels R. Effect of highly active antiretroviral therapy on incident aids using calendar period as an instrumental variable. Am J Epidemiol. 2009; 169(9):1124–1132. https://doi.org/10.1093/aje/kwp002. http://aje.oxfordjournals.org/content/169/9/1124.full.pdf+html.
 18
Brookhart MA, Schneeweiss S. Preferencebased instrumental variable methods for the estimation of treatment effects: assessing validity and interpreting results. Int J Biostat. 2007; 3(1):14. https://doi.org/10.2202/15574679.1072.
 19
Brooks JM, Chrischilles EA, Scott SD, ChenHardee SS. Was breast conserving surgery underutilized for early stage breast cancer? instrumental variables evidence for stage ii patients from iowa. Health Serv Res. 2003; 38(6p1):1385–1402. https://doi.org/10.1111/j.14756773.2003.00184.x.
 20
Johnston SC. Combining ecological and individual variables to reduce confounding by indication:: Case study–subarachnoid hemorrhage treatment. J Clin Epidemiol. 2000; 53(12):1236–1241. https://doi.org/10.1016/S08954356(00)002511.
 21
Brookhart MA, Wang PS, Solomon DH, Schneeweiss S. Evaluating shortterm drug effects using a physicianspecific prescribing preference as an instrumental variable. Epidemiology. 2006; 17(3):268–75.
 22
Abrahamowicz M, Beauchamp ME, IonescuIttu R, Delaney JAC, Pilote L. Reducing the variance of the prescribing preferencebased instrumental variable estimates of the treatment effect. Am J Epidemiol. 2011; 174(4):494–502. https://doi.org/10.1093/aje/kwr057. http://aje.oxfordjournals.org/content/174/4/494.full.pdf+html.
 23
Baiocchi M, Cheng J, Small DS. Instrumental variable methods for causal inference. Stat Med. 2014; 33(13):2297–340. https://doi.org/10.1002/sim.6128.
 24
Uddin MJ, Groenwold RHH, de Boer A, Gardarsdottir H, Martin E, Candore G, Belitser SV, Hoes AW, Roes KCB, Klungel OH. Instrumental variables analysis using multiple databases: an example of antidepressant use and risk of hip fracture. Pharmacoepidemiol Drug Saf. 2016; 25:122–31. https://doi.org/10.1002/pds.3863. PDS140390.R2.
 25
Uddin MJ, Groenwold RHH, de Boer A, Afonso ASM, Primatesta P, Becker C, Belitser SV, Hoes AW, Roes KCB, Klungel OH. Evaluating different physician’s prescribing preference based instrumental variables in two primary care databases: a study of inhaled longacting beta2agonist use and the risk of myocardial infarction. Pharmacoepidemiol Drug Saf. 2016; 25:132–41. https://doi.org/10.1002/pds.3860. PDS140383.R2.
 26
Brookhart MA, Rassen JA, Schneeweiss S. Instrumental variable methods in comparative safety and effectiveness research. Pharmacoepidemiol Drug Saf. 2010; 19(6):537–54. https://doi.org/10.1002/pds.1908.
 27
Terza JV, Basu A, Rathouz PJ. TwoStage Residual Inclusion Estimation: Addressing Endogeneity in Health Econometric Modeling. J Health Econ. 2008; 27(3):531–43. https://doi.org/10.1016/j.jhealeco.2007.09.009. Accessed 02 Oct 2015.
 28
Cameron AC, Trivedi PK. Microeconometrics: Methods and Applications. Cambridge: Cambridge University Press; 2005. pp. 166–220. https://doi.org/10.1017/CBO9780511811241.
 29
Hansen LP, Heaton J, Yaron A. Finitesample properties of some alternative gmm estimators. J Bus Econ Stat. 1996; 14(3):262–80.
 30
Amemiya T. The nonlinear twostage leastsquares estimator. J Econ. 1974; 2(2):105–10. https://doi.org/10.1016/03044076(74)900335.
 31
Hansen LP. Large sample properties of generalized method of moments estimators. Econometrica. 1982; 50(4):1029–1054.
 32
Chausse P. Computing generalized method of moments and generalized empirical likelihood with r. J Stat Softw. 2010; 34(1):1–35.
 33
Stock JH, Wright JH, Yogo M. A survey of weak instruments and weak identification in generalized method of moments. J Bus Econ Stat. 2002; 20(4):518–29.
Funding
This work was supported by grants from the Fondation pour la Recherche Médicale (“Iatrogénie des médicaments 2013”) and the Agence Nationale pour la Recherche (“Appel à projet générique 2015”). The funding bodies had no role in the contents and the writing of the manuscript.
Availability of data and materials
The R programm for simulation are available from the corresponding author.
Author information
Affiliations
Contributions
BFK and PTB conceived the study. BFK performed the simulation study and drafted the manuscript. BFK, PTB and SE interpreted the results, read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Koladjo, B., Escolano, S. & TubertBitter, P. Instrumental variable analysis in the context of dichotomous outcome and exposure with a numerical experiment in pharmacoepidemiology. BMC Med Res Methodol 18, 61 (2018). https://doi.org/10.1186/s128740180513y
Received:
Accepted:
Published:
Keywords
 Instrumental variable
 Nonlinear least squares
 Logistic regression
 Physician’s prescription preference
 Pharmacoepidemiology
 Observational studies
 Simulation study