Instrumental variable analysis in the context of dichotomous outcome and exposure with a numerical experiment in pharmacoepidemiology

Background In pharmacoepidemiology, the prescription preference-based instrumental variables (IV) are often used with linear models to solve the endogeneity due to unobserved confounders even when the outcome and the endogenous treatment are dichotomous variables. Using this instrumental variable, we proceed by Monte-Carlo simulations to compare the IV-based generalized method of moment (IV-GMM) and the two-stage residual inclusion (2SRI) method in this context. Methods We established the formula allowing us to compute the instrument’s strength and the confounding level in the context of logistic regression models. We then varied the instrument’s strength and the confounding level to cover a large range of scenarios in the simulation study. We also explore two prescription preference-based instruments. Results We found that the 2SRI is less biased than the other methods and yields satisfactory confidence intervals. The proportion of previous patients of the same physician who were prescribed the treatment of interest displayed a good performance as a proxy of the physician’s preference instrument. Conclusions This work shows that when analysing real data with dichotomous outcome and exposure, appropriate 2SRI estimation could be used in presence of unmeasured confounding.


Background
In observational studies, unobserved confounding may biase the estimation of target effect. Over the last decade, this issue has recieved a growing attention in the field of epidemiological studies attempting to assess adverse effects of drugs with a few works focusing on instrumental variable (IV) approaches. Instrumental variable estimation is a well known approach for assessing endogeneity in statistical modelling [1,2]. Endogeneity often arises when a causal model is poorly specified therby introducing a structural bias in the estimation of its parameters. This *Correspondence: francois.koladjo@gmail.com 1 Biostatistics, Biomathematics, Pharmacoepidemiology and Infectious Diseases (B2PHI), Inserm, UVSQ, Institut Pasteur, Université Paris-Saclay, 16 Avenue Paul Vaillant-Couturier, 94807 Villejuif, France may result from a measurement error in variables [3], an unobserved variable [4] or an inverse causality between the outcome and some regressors. The general IV method of estimation attempts to remove this bias by using structural equations which incorporate instrumental variables in the model. Several theoretical approaches have been developed to build estimators of parameters and to study the properties of these estimators in causal models with endogeneity (see [5]). A well-known example is the case of linear models in which IV estimation leads to estimators with good properties of convergence such as consistency discussed in [6]. The structural bias can be completely removed in this case.
In pharmacoepidemiology, we most often deal with binary covariables (drug exposure), binary responses (adverse event indicator) and confounding variables which are variables that are correlated with exposure and response. A nonlinear model should be the first choice in this context to match with the specific nature of the variables. However, to quantify the risk of the adverse effect of a treatment in the presence of unobserved confounding, researchers investigating the IV-based estimation often model the probability of dichotomous events as a linear function of covariables, thus ignoring the basic features of a probability. Terza and colleagues [7] investigated the influences of mispecification on the estimation when a linear IV model is used in an inherently nonlinear regression setting with endogeneity. A substantial bias was demonstrated in their results. In the context of pharmacoepidemiology modelling, endogeneity is often due to unobserved confounding and various nonlinear IV methods such as the Generalized Method of Moment (GMM; [8,9]) or the two-stage residual inclusion (2SRI; [10]) can be used to solve this issue. However in a review of IV methods, Klungel and colleagues [11] claimed that the GMM estimator with the logistic regression model is not consistent for causal Odd Ratio (OR) estimation owing to the non-collapsibility of the OR. Consistency is also not guaranteed for the 2SRI when the regression models are nonlinear in both stages. For these nonlinear IV methods, theoretical results exist under very restricted assumptions which do not cover the possible frameworks of real data. Overall, in the context of binary outcome several simulation studies investigate mainly 2-stage IV methods with the first step being linear and the second step being logistic as in [12]. A few articles concern double probit models (Chapman et al. [13]). Very few address the comparison of GMM and 2-stage approaches and none study GMM, 2-stage double logistic using the prescription preference-based instrumental variables.
In a simulation study, we compare the IV-based GMM and the 2SRI methods to the conventional method which does not account for endogeneity. These comparisons are based on the estimation of the regression coefficient of the exposure varible in nonlinear logistic model. Our numerical comparison of the methods involves several scenarios with different confounding levels and different instrument strengths for which computation formulas are established in the context of dichotomous outcome and exposure. We recall the general formula of the covariance matrix for the two-step estimation methods and give the corresponding expression for two-stage nonlinear least squares method in the context of logistic regressions.
The paper is organized as follows: we specify the model and describe the methods of estimation that will be analysed. Then we describe the simulation design, the criteria for evaluating the performances of the methods and the results of our simulations. The final sections discuss the results and make some concluding remarks. Details on the computation of the covariance matrix of the 2SRI method, the instrument's strength and the confounding level are to be found in the appendices, as well as a detailed description of the simulation model and supplementary results.

Model
We consider a general model with dichotomous outcome and exposure that can be written as with Y and T as binary outcome (event or not) and treatment (T 1 or T 2 ) respectively, X 1 , X 2 some covariables and X u an unobserved confounder of the outcome and treatment. The function F(.) = r(.) denotes the logistic distribution function also known as expit(.) : expit(.) = exp(.)/(1+exp(.)) and e denotes the error term. The parameter β = (β 0 , β t , β 1 , β 2 ) denotes the vector of unknown parameters to be estimated. Without a confounding variable, all observed regressors are exogenous. In this case, the true model is written and conventional regression methods are suitable for estimating the parameters β. We will denote (3) the conventional model. If this model is adjusted to data in the presence of an unobserved confounder, the estimated coefficients would lead to a bias with a level depending on the confounding level. As the confounder X u is not independent of treatment, the residuals of the conventional model are associated with the treatment. This causes endogeneity so a single regression of the outcome on observed covariables will fail to estimate β t efficiently. A common strategy is to consider another regression model that links the endogenous variable with others. Equation (2) defines the auxiliary model that predicts treatment T as a function of covariables X 1 , X 2 , the confounder X u and another variable Z. Variable Z denotes the instrumental variable (or instrument) related to the treatment, i.e. a variable correlated with the treatment and which has no direct association with the outcome. The bias due to the unobserved confounder can significantly be reduced by means of the two-stage regression model using a valid instrument. As defined by Johnston and colleagues [14] and Greenland [15], a valid instrument must not be correlated with an unobserved confounder or with the error term in the true model (1). Formally, we assume that the instrument Z meets the following assumptions: We also assume that the confounder X u is not associated with the covariables X 1 and X 2 , that is Cov(X u , X 1 ) = 0 and Cov(X u , X 2 ) = 0. The main goal is to estimate β t , the treatment coefficient which is the basis of risk evaluation; however an estimation of β t is obtained in general by estimating vector β which is discussed below.
As already proved in the simple case of a linear model (for which the functions F and r are equal to identity in Eqs. (1) and (2)), a high association between the treatment and an instrument should improve the IV estimation of β. Finding a strong instrument is then a crucial step in all procedures of instrumental variable estimation.
In what follows, we first present some specific instrumental variables often used in pharmacoepidemiology, then discuss some IV estimators of β to obtain an estimate of β t before addressing the properties of these estimators.

Instrument in pharmacoepidemiology
An instrumental variable can be determined in many ways, provided that it meets the assumptions listed above. One of the problems is to find a valid instrument with a reasonable strength. The strength of an instrumental variable can be defined as resulting from the level of its association with the endogenous treatment. As such, it could be quantified by using the correlation coefficient between the treatment and the related instrument. In the wide range of pharmacoepidemiologic applications, we can summarize the various instrumental variables in three categories: Geographical variation. Proximity to the care provider can positively influence access to treatment of a patient compared to others who live far away from health services. To account for this difference between patients, some researchers (see [16]) consider the distance between a patient and a care provider as an instrumental variable. Although this seems realistic as there is no direct association between this distance and the occurrence of disease, the presence or absence of health services can be associated with some socioeconomic characteristics. The latter are often considered as unmeasured confounders that call into question the suitability of using this instrument.
Calendar time. The use of calendar time as an instrument in pharmacoepidemiology often relies on the occurrence of an event that could change the attitude of the physician or patient regarding a treatment. This could be a change in guidelines for example or a change due to the arrival of a new drug on the market. The time from that event to the date of treatment defines the calendar time which clearly affects the outcome of the treatment since the change in physician or patient attitude will be more pronounced immediately after the event has occurred than later. An example of use of calendar time as an instrument can be found in [17].
Physician's prescription preference. The most often used instrumental variables in pharmacoepidemiology are preference-based [18][19][20]. The issue is to compare the effectiveness of two treatments T 1 and T 2 when the assignment of treatment to the patient is not randomized. This is the case in observational studies where the prescriber of the treatment (the physician) introduces an effect that influences the outcome via the prescribed treatment. This effect results in the instrumental variable that reflects the influence of care-providers on the patient-treatment relationship. Brookhart and colleagues [21] define this instrumental variable as the "Physician's Preference" (PP) and propose to use the treatment prescribed by a physician to its previous patient as a proxy of this IV for his/her new patient.
The instrumental variable Z * i of the i th patient will then be the treatment prescribed to the previous patient of the same physician. As a physician's preference could change over time, Abrahamowicz and colleagues [22] introduce a new procedure to detect the change point and build a new proxy of PP that includes not only the treatment prescribed to the previous patient but all the previous prescriptions since the change point.
Those instrumental variables and some others are presented in a more detailed form in [23], [24,25] or in [26] with enlightening discussion on their validity. In this work, we carry out a simulation study to examine how the proxy Z * of physician's preference performs in the context of logistic regression. We also examine the physician's preference-based IV in the continuous form (see [22]), i.e. the proportion (pr) of all previous patients of the same physician who were prescribed the treatment of interest. This corresponds to the empirical estimator of the probability for a physician to prefer the treatment of interest.

IV Estimation of β t
Estimating β t in model (1) with an unobserved confounder X u amounts to estimating vector β and taking the corresponding component of treatment T. Below, we present two methods that can provide consistent estimation of β and then that of β t : the two-stage residual inclusion (2SRI) method and the generalized method of moment (GMM).
The two-stage residual inclusion method is a modified version of the two-stage least squares (2SLS) method used to estimate the parameters in linear models with instrumental variables. As mentioned by Greene [5], the first stage of the 2SLS method predicts the endogenous variable (the treatment here) using the instruments and other covariables (Eq. (2)). In the second stage, the endogenous variable is just replaced by its prediction from the first stage. This method is called two-stage predictor substitution (2SPS) when the first stage is nonlinear. Unlike the 2SLS, the residuals of the first regression serve as a regressor in the second stage of 2SRI method. This method also generalizes to the nonlinear models i.e. when first and second stages are nonlinear. The rationale of this approach can be intrinsically related to the form of the true model: sometimes, the prediction equation of the outcome includes the error term of the auxiliary regression. An example is the case when the confounder is the only source of error in the auxiliary regression as considered in [27]. In a linear model, both 2SPS and 2SRI approaches are equivalent.
The GMM is an alternative method for obtaining a reliable estimator of parameter β in a model with an endogenous variable. It is based on the classical assumption of the error term in Eq. (1). This assumption does not hold in general because the confounder X u is unobserved (i.e. E(e|T, X 1 , X 2 ) = 0). In this case, the treatment is endogenous and its coefficient β t cannot be consistently estimated. One suppose in general that there exists some observable instruments w 1 such that E(e|w) = 0 with w = (w 1 , X 1 , X 2 ). Typically, w corresponds to the vector of exogenous and endogenous variables with endogenous regressors replaced by their corresponding instruments. Using the law of iterated expectation, the last condition implies The method of moment solves the empirical version of (5) in β, i.e. 1 n n i=1 e i w i = 0 where n is the sample size, e i is the i th component of e and w i the i th row of w. In turn, the GMM minimizes the quadratic form where denotes a weighting matrix. As discussed in [28], there are several choices for matrix leading to different estimators of β. The optimal approach is to define as the inverse of the asymptotic covariance matrix (depending on β) of the estimator. Some alternative procedures are also suggested by Hansen and colleagues [29].

Properties
The properties of the 2SRI are addressed by Terza in [27] when the residual also acting as unobserved confounder in the first-stage regression (at Eq. (2)) is additive. Under this assumption and using nonlinear least squares regression in each stage, they show the consistency of the estimatorβ from the second stage. Since the confounder is not additive in the model (2), the first stage estimate of a 2SRI will not be consistent, nor willβ. The residual from the first-stage is indeed an unknown function of an unobserved confounder. Then there is a bias that depends on the form of this unknown function when one applies the 2SRI method to the structural model at Eqs. (1) and (2). The derivation of the covariance matrix of 2SRI estimator follows a two-step regression covariance matrix of the form where the computation of matrices A and S is given in Appendix A.
The GMM is a well documented estimation procedure. Both in linear and nonlinear models, several results on the estimator have already been established. In the literature on econometric analysis, the nonlinear GMM with an instrumental variable has received particular attention. In the pioneering work by Amemiya [30], the author demonstrated the consistency and derived the asymptotic distribution of the nonlinear two-stage least squares estimator (NL2SLS).
This result provided an important insight into how to handle nonlinear models with endogeneity. Later, Hansen [31] showed the asymptotic properties (consistency and asymptotic distribution) of the GMM to be a kind of generalization of the NL2SLS. More recently, Cameron and Trivedi [28] reviewed the method and gave details on the computation of the estimator's covariance matrix in some specific cases. Despite the different results on the GMM with an instrumental variable, its performances in terms of bias and variance depend on the validity and strength of the instrument and the nature of the variables in the model. The computation of the covariance matrix of GMM is given in [28] and implemented in dedicated softwares of which [32] is a good example.
Below, we investigate the performances of these methods with numerical experiments.

Simulation design and data generation
A numerical experiment was conducted to investigate and compare several methods of IV estimation in the context of a dichotomous outcome and exposure with endogeneity in pharmacoepidemiology. In this experiment, we cover a wide range of possible scenarios. We choose values of parameter α = (α 0 , α z , α 1 , α 2 ) corresponding to some values of correlation between the variable T * = α 0 + PPα z + X 1 α 1 + X 2 α 2 + X u and the Physician's prescribing Preference instrument PP. In fact, we keep α 0 , α 1 and α 2 fixed and only α z varies from a scenario to the other. The computation of this correlation is given in Appendix B. It somewhat reflects the strength of the instrument when the confounder and other covariables are kept fixed. For each value of the instrument's strength, there are three levels of confounding measured by the standard deviation σ u of the confounding variable X u , σ u ∈ {0.5, 1, 1.5} which leads to a set of correlations between T * and X u . We then have nine scenarios of strengths and confounding level.
For each scenario, we generate ns = 1000 Monte Carlo samples of size n, n = 10000, 20000 and 30000. The number of patients per physician is kept fixed and equals 100; the confounder X u and covariates X 1 and X 2 are assumed to have the normal distributions N(0, σ u ), N(−2, 1) and N(−3, 1) respectively and the physician's prescribing preference has the Bernoulli distribution B(0.7). We first simulate the covariables X 1 and X 2 , the confounder X u and the physician's prescribing preference which is the same for all the patients of the same physician. Using the already fixed values of parameters α, the probability p i that patient i will be prescribed the drug of interest is calculated by inverting the logit function, i.e.
. Treatment T i of patient i is then generated as a Bernoulli realisation with parameter p i . The same procedure as for the treatment is used to simulate for each patient i, the corresponding outcome y i . We fixed the parameter α such that the proportion of exposed patients ranges between 2 and 6% and the prevalence of the event of interest is chosen to be smaller than 5% to reflect a real-life situation of a new treatment (not frequently prescribed) and rare adverse event. With the fixed value of β, we compute the probability F(β 0 + T i β t + X 1i β 1 + X 2i β 2 + X ui β u ) of the event for patient i and then simulate his outcome y i . We also explored a more balanced situation in term of exposure frenquencies (between 26 and 45%).
Finally, proxy Z * of PP is the treatment given to the previous patient and the continuous instrument pr is the proportion of patients of the same physician who were previously prescribed the treatment of interest. More details on the simulation model and the data generating R code are given in Appendix D.

Estimation methods
For the true and conventional models which do not assume endogeneity, the classical one-step regression method without instrumental variable is used. The estimations are performed with the existing regression functions (glm and nls) implemented in R statistical software (R Development core team 2008) in the R package stats.
For the 2SRI method, a two-step regression is used following the procedure outlined in the section dedicated to IV estimation procedure. Recall that the covariance matrices ofβ from the second step regression for both methods retrieved from the software results will not be valid since their calculation ignores the fact that some estimated parameters from the first-stage regression are included in the second stage. Then, we re-evaluated these covariance matrices using the sequential two-step estimation procedure taking into account the fact that an estimated variable is used in the second step. The computation of these covariance matrices is given in Appendix A.
For the GMM, the R package gmm proposed by Chaussé [32] is a very helpful tool for computating parameters and estimating covariance. The user needs to implement the sample version of the moment condition function at Eq. (5) and its gradient if possible; if not, a numerical approximation of the gradient function will be used by the gmm function to perform the estimation.
From these estimations, we calculate the asymptotic covariance matrices and the corresponding confidence intervals whose levels are evaluated below.

Evaluation criteria
To evaluate the performances of the various methods, we consider several criteria including the percentage of relative bias (rB in %) defined as the asymptotic standard deviation estimated by the square root of the Monte-Carlo mean of varianceσ 2 = 1 ns ns j=1σ 2 j , withσ 2 j the asymptotic variance ofβ sup denotes the confidence interval of β t from the j th Monte-Carlo sample using the asymptotic distribution ofβ (j) t . The nonparametric bootstrap based estimate of variance and non-coverage probabilities are also investigated in these simulations and the results are analysed below. We complete all these criteria by the equivalent of the first-stage F-statistics in linear regression (see [33]) testing instrument exclusion in the treatment choice model. Table 1 summarizes the performances of each method in terms of relative bias (rB), the standard deviation (sd), the square root of mean squares error (rMSE) and the noncoverage probabilities (pval = Er inf + Er sup ). It presents the results related to instrument pr. There were only slight differences between the results with instrument pr and those with Z * , so we omitted the results related to instrument Z * . For some samples, the GMM fails to converge owing to singularity problems in the covariance matrix. The estimations from these samples are simply removed (cases marked '−' in Table 1) for all methods. For the other scenarios, infinite variance estimate or outlier coefficient estimate may be observed; the corresponding samples were dropped for the calculation of the criteria. Table 2 shows the number of samples leading to an outlier estimation (rB > 100% or infinite variance) among the 1000 simulated samples. Figure 1 complements the results in Table 1 by displaying the boxplot distributions of rB. Each series of letters a,b,c and d corresponds to the results related to an instrument strength with each letter corresponding to a method as detailed in the legend of Fig. 1. In Table 1 the sd values refer to the Monte-Carlobased standard deviation. Except for the GMM estimator where outlier values for the asymptotic variance were observed, the Monte-Carlo-based standard deviation, the bootstrap-based estimate (not shown) and the asymptotic estimate were very close.

Results
As expected, the relative bias shows that the estimation from the true and conventional models are insensitive to instrument strength but the confounding level affects the estimation in the conventional model: the relative bias increases with level of confounding. In the presence of a strong instrument, the 2SRI tends to improve the estimate when the level of confounding increases. This trend is reversed when the instrument is weak, i.e. the relative bias and the confounding level have the same direction of variation. The percentage of relative bias of the GMM does not seem too sensitive to instrument strength: it just changes slightly when the strength of the instrument grows. However, this bias increases with the magnitude of confounding, which shows the impact of endogeneity on this method (See Fig. 1).
For the standard deviation (sd) and the square root of the mean squares error (rMSE), the asymptotic results (n = 30000) show that both criteria decrease when the level of confounding or instrument strength grows, this being the case for all methods except the GMM for which the rMSE decreases very slowly or remains almost constant in some cases. This trend confirms the already obverved low sensitivity of the GMM to the strength of the instruments used in this simulation. Even though the 2SRI has the larger sd than the other methods in all scenarios with an impact on rMSE in several cases, rMSE for 2SRI method seems improved with high confounding and a strong instrument.
Concerning the non coverage probabilities (pval), the true model estimation and the 2SRI displayed an estimated non-coverage probability around the nominal level of 5% in almost all scenarios. Their values ranged from 4 to 6% and reached 7% in rare cases. The non-coverage probability was very large for the other methods, even with large samples: the results showed an overestimation of the coefficient of treatment for the GMM and conventional approaches. Even at low confounding levels, the conventional method yields very poor coverage probabilities which is coherent with what was observed for the relative bias.
We observe that the metric of the instrument we use retains the same direction of variation with the F-statistics eequivalents (Tables 3 and 4 of Appendix C). Finally, Table 5 in Appendix D summarizes the performances of each method for more balanced exposure frequencies and large sample size (n = 30000). In general, performances are less good in comparison with small exposure frequency situation. One could also note that numerical problems arise more often (see Table 6 of Appendix D). Nevertheless, 2SRI is better in term of relative bias and non coverage probability than the conventional method and GMM.
Overall, the results are satisfactory for the 2SRI approach which achieves a similar level of performance to the true model regarding estimation of the confidence interval in the imbalanced situation. We close this section by pointing out the strong numerical instability observed when computing the GMM estimator during these simulations. This could explain the modest performance displayed by the GMM in the simulation results.

Discussion
In this paper, we focus on the effectiveness of regression coefficient estimation in a context of endogeneity, particularly the endogeneity due to unobserved confounding. We are interested in the coefficient of an endogenous treatment which is the basis of risk assessment in pharmacoepidemiology. Linear models are often used in this context to model the probability of dichotomous events (see [7]). Through a simulation study, we investigate the behavior of parameter estimation in nonlinear models specifically logistic regression using some IV-based methods that could potentially be used to overcome the endogeneity issue. The simulation study also made it possible to assess two preference-based instrumental variables in pharmacoepidemiology. The results reported from the simulation study show that the 2SRI using nonlinear regression at each stage is an interesting alternative for estimating the coefficient of the endogenous treatment in a logistic regression model. It is very simple to implement and yields satisfactory results regarding the bias and the confidence interval estimate. It was found to yield the most accurate estimate of non-coverage probabilities and thus a more accurate estimate of confidence intervals among the IV-based methods that were compared. However, the conventional approach behaved better than the 2SRI in some cases, especially when the confounding level was weak. We believe that in these cases, the level of confounding is not sufficiently high to require the use of an instrument in the estimation. However, to our knowledge there is still no way to assess the level of unmeasured confounding. For the GMM, the estimation procedure was remarkably unstable. That instability may be attributable to the dichotomous nature of the variables (outcome and exposure) in the context of pharmacoepidemiology with the preference-based instrument. This situation makes the GMM approach is not to be recommended in this context unless another instrument has proved to behave satisfactorily with this method. Concerning the instruments under investigation in this study, the proportion of all previous patients of the same physician who were prescribed the treatment of interest proved to be a good proxy of the physician's preference instrument. This instrument was previously considered by Abrahamowicz and colleagues [22] for investigating the detection of a possible change point in the physician's preference and their results also seemed satisfactory. This proxy of the physician's preference instrument is thus a credible alternative to the well known other proxy based only on the single patient of the same physician who was prescribed the treatment of interest.
Even though this work throws light on the performances of IV estimators in the context of a nonlinear model with endogeneity, more work is needed to explore the behavior of these estimators in other contexts when the prevalence of exposure and/or outcome varies.

Conclusions
In observational studies, when assessing the effect of drug exposure on a dichotomous outcome, investigators could use appropriate 2SRI estimation to account for unmeasured confounding. This work showed that two logistic regressions as well as a physician's preference proxy for IV yeald satisfactory results.

Appendix A: Asymptotic variance of the nonlinear 2SRI
Using the sequential two-step estimation procedure, the nonlinear 2SRI estimator minimizes the least squares criterion whereα denotes the nonlinear least squares estimator of α from the first stage. If we set η iβ = X iα β, the first-order condition in β is given by Given thatα is a consistent estimator of α 0 , the Taylor-lagrange expansion of (9) arround the true value (α 0 , β 0 ) gives for some (α c ; β c ) between (α;β) and (α 0 ; β 0 ). Under the assumption E ∂Q ∂β β 0 = 0 the terms involving the residuals y i − F α (η iβ ) in the above expansion, except the first, all tend in probability to zero and we obtain The quantity √ n(β − β 0 ) may then be written The latter is derived from the asymptotic approximation of √ n(α − α 0 ) using a similar Taylor-Lagrange expansion for the first-stage nonlinear least squares regression.
The previous expansion leads to the covariance matrix ofβ of the form Under the assumption of independence between observations, the matrices involved in this covariance matrix are given by where p lim denotes the limite in probability, U α = T −P α , U β = y − P β , with P α = r(wα) and P β = F α (Xβ). An estimation of this covariance matrix can be obtained using the plug-in estimator of each matrix involved in its expression. Let Pα = r(wα), Pβ = Fα(Xβ), Uα = T − Pα and Uβ = y − Pα, then the corresponding estimator of matrices at equation (10) are such that

Appendix B: Instrument strength and confounding level
We give bellow the computation of instrument strength and confounding level

• Instrument strength
The strength of an instrument results from the correlation between it and the corresponding endogenous variable. In the model considered here, the strength of an instrument Z is given by . As the treatment has a causal link with other covariables Z, X 1 , X 2 and X u , we have If we consider only the explanatory effect of the instrument in treatment T and replace other covariables and the confounder by their average effect, we have . This variance may then be written Considering a dichotomous instrument Z having the Bernoulli distribution B(p), we have We finally obtain We also have and then E(Z)E(T) = p 2 1+A 1 + p(1−p) 1+A 0 . Furthermore, E(ZT) = E Z (E(ZT|Z, X 1 , X 2 , X u )) = E Z (ZE(T|Z, X 1 , X 2 , X u )) which leads to The covariance between Z and T is then given by The correlation between Z and T is Then for a given instrument Z with p fixed in (20), α 0 , α z , α 1 and α 2 can be chosen to reach a value of A Z that leads to a desired value of Corr(T, Z).
Another criterion that could be used to quantify an instrument's strength is the correlation between T * = α 0 + Zα z + X 1 α 1 + X 2 α 2 + X u and the instrument Z. Since Cov(Z, , under the assumptions on these variables, we have where σ 2 i = Var(X i ) and σ 2 u = Var(X u ).

• Confounding level
A straightforward calculation as above leads to the following correlation between T * and X u that expresses the level of confounding.

Appendix C: F-statistics for each scenario
The following tables display the equivalent of the firststage F-statistics in linear regression (see [33]) testing instrument exclusion in the treatment choice model.

Appendix D: Description of simulation model and parameters values, results of the second scenario
For all scenarios, the model generating the binary outcome is the index function with Y * i = β 0 + T i β t + X 1i β 1 + X 2i β 2 + X ui β u and ε i the standart logistic distribution. T i is the observed binary Table 3 Monte-carlo mean of F-statistics in each scenario using the proportion of patients who received the same treatment as proxy of instrument treatment of the individual i, X 1i and X 2i some characteristics of patient i and X ui the unmeasured confounding factor. Besides β 0 the intercept, β 0 , β t , β 1 , β 2 and β u are parameters related to T, X 1 , X 2 and X u respectively. We fixed these parameters β 0 = −0.6, β t = 3, β 1 = 1, β 2 = 1, and β u = 1 to keep the prevalence of event less than 5%. The treatment choice for the i th patient was generated from a Bernoulli model with success probability p i which depends on the patient's characteristics X 1 ∼ N(−2, 1)   and on the confounding factor X u ∼ N(0, σ u ). The probability p i is given by p i = F(α 0 +Z i α z +X 1i α 1 +X 2i α 2 +X ui ), where F denotes the logistic distribution function. The standard deviation σ u of the confounding factor takes values 0.5, 1 and 1.5 corresponding to Low, Medium and high level of confounding respectively. For a fixed level of confounding, we varied only α z value over {1, 2, 3} of which each element corresponds to an instrument strength: 1 for "Low" instrument, 2 for "Moderate" and 3 for "High" instrument. All other parameters in the treatment choice model remain constant (α 0 = 0.2, α 1 = 2 and α 2 = 1.2). The data are generated using the R function sim2Logit2(). We investigated the performances of methods in the context of rare exposures (2 to 6%), and rare events (less than 5%). To check whether the results obtained remain valid in the context of higher exposure (near 50%), we design new simulations in which only the intercepts α 0 and β 0 are modified in the previous design. We held all other parameters constant and fixed α 0 = 5 and 3. The prevalence of exposure ranged then between 26% and 45% whereas that of event is maintained lower than 6%. We present in Table 5 the results from this second scope over 500 Monte Carlo samples of size 30000.