Semiparametric estimation of the attributable fraction when there are interactions under monotonicity constraints

Background The population attributable fraction (PAF) is the fraction of disease cases in a sample that can be attributed to an exposure. Estimating the PAF often involves the estimation of the probability of having the disease given the exposure while adjusting for confounders. In many settings, the exposure can interact with confounders. Additionally, the exposure may have a monotone effect on the probability of having the disease, and this effect is not necessarily linear. Methods We develop a semiparametric approach for estimating the probability of having the disease and, consequently, for estimating the PAF, controlling for the interaction between the exposure and a confounder. We use a tensor product of univariate B-splines to model the interaction under the monotonicity constraint. The model fitting procedure is formulated as a quadratic programming problem, and, thus, can be easily solved using standard optimization packages. We conduct simulations to compare the performance of the developed approach with the conventional B-splines approach without the monotonicity constraint, and with the logistic regression approach. To illustrate our method, we estimate the PAF of hopelessness and depression for suicidal ideation among elderly depressed patients. Results The proposed estimator exhibited better performance than the other two approaches in the simulation settings we tried. The estimated PAF attributable to hopelessness is 67.99% with 95% confidence interval: 42.10% to 97.42%, and is 22.36% with 95% confidence interval: 12.77% to 56.49% due to depression. Conclusions The developed approach is easy to implement and supports flexible modeling of possible non-linear relationships between a disease and an exposure of interest.

value indicates a stronger association between disease risk and the exposure level. In the extreme, a PAF equal to 1 implies no disease risk when there is no exposure. If the disease risk is higher in the absence of the exposure than in the presence of the exposure, the PAF then has no proper meaning. Thus, the definition of the PAF itself also suggests a monotone relationship between the disease risk and the exposure level (a justification is presented at the end of the next section). Incorporating the monotonicity assumption into the estimation of the PAF provides performance gains when there are no other covariates [6].
In many research fields, the exposure is thought to have a monotone effect on the probability of having the disease, i.e., the probability is a monotone function of the exposure. For instance, a dose-response curve is often thought to be non-decreasing [7,8]. One such example is the probability of suicidal ideation, which is thought to be an increasing function of both hopelessness and depression [9][10][11][12]. Hence, it is desirable to model the probability of suicidal ideation under the monotonicity constraint of both hopelessness and depression.
This situation also presents an analytic challenge. When there are other covariates, they can interact with the exposure, for instance, the interaction of two drugs [13]. In our example, hopelessness is a system of negative expectations concerning one's future life and can cause depression. On the other hand, current depression can influence one's hopelessness towards the future [14,15]. Thus, an interaction between hopelessness and depression can present to have a joint effect on suicidal ideation.
These examples highlight, how in many studies, the effect of the exposure can be complicated and is not necessarily linear, including the common analysis of drug interactions [16]. We bring together several past innovations to propose a novel analytic solution. [17] study the PAF when there are joint effects or interactions. [18] develop an approach of estimating logistic regression models with interactions and monotonicity constraints. The authors [18] apply the approach to estimating the PAF and achieve substantially more accurate estimates in some settings than the usual approach which uses logistic regression without monotonicity constraints.
Semiparametric approaches have been applied to estimate relative risk functions [19], to calculate odds ratios [20], and to estimate effect measures in the presence of interactions [21]. Herein, we use B-splines [22] (see also [23,24] and references therein), to develop a semiparametric approach to estimate the PAF in the presence of interactions with confounding. The model fitting procedure is formulated as a well studied quadratic programming problem, and, thus, can be easily solved using standard optimization packages. We implement the approach using the R function solve.QP in the package quadprog [25,26]. After a simulation study, we illustrate our new method by examining hopelessness and depression for suicidal ideation among elderly depressed patients from the PROSPECT (Prevention of Suicide in Primary Care Elderly: Collaborative Trial) study [27]. Specifically, we model the interaction between hopelessness and depression under the monotonicity constraint.

Methods
We develop the semiparametric approach to estimate PAF accounting for interactions under the monotonicity constraint using B-splines in the last two subsections. We compare the performance of the following three approaches: the approach we developed (monB), the conventional B-splines (conB) approach without the monotonicity constraint but with the same knots, and the logistic regression approach (logit). Comparisons are made through simulation studies and a case study of estimating the PAF for suicidal ideation attributable to hopelessness or depression. To save computation time, we use a small number of basis functions, quadratic B-splines with knots placed at quantiles of the distribution of unique predictor values [28, P. 24]. We use quartiles {0, 1/4, 1/2, 3/4 and 1} as the knot locations.

Simulations
Let Y denote presence (1) or absence (0) of the outcome, Z the exposure, X a confounder, and V an additional covariate. The two interacting covariates Z and X are simulated independently from the Uniform [ 0, 1] distribution where a 0 value is always part of the simulated z. The additional covariate V is simulated from a Bernoulli distribution with p = 1/2. The outcome is simulated from a Bernoulli distribution with the following probability functions, A: 0.1+ 0.4z+0.3x−0.2xz+0.2v; B: 2(0.1+2 log(z/2+1))x/3+0.3v; and C: 0.8 √ z + 0.1x 2 + 0.1v. Shapes of the models at a fixed v are provided in Figure S1 of the supplementary material.
We examined 1000 simulations for each sample size 100 and 200. We compared the absolute value of the bias, the variance, and the mean squared error (MSE) of estimating the PAF attributable to the exposure Z. The true PAF values from these models are respectively 0.3, 0.4404, and 0.4616. They are calculated from (4) where the integrals are computed using R functions integral and integral2 [29].

Illustrative case study
In the PROSPECT study, we focus on suicidal ideation four months after the beginning of the study. We consider the 592 patients in the study with no missing data. An event was observed if the score for suicidal ideation is greater than zero [27]. Following [30], we use the Beck Hopelessness Scale [31,BHS] to measure hopeless, and the Beck Depression Inventory score [12,BDI] to measure depression. The BHS and the BDI range from 0 to 19 and from 0 to 17, respectively, where a higher value means more hopelessness or depression. Figure 1 shows the sample average of suicidal ideation by BDI and BHS scores, while Fig. 2 shows the corresponding patient frequency. Many patients with low BDI and BHS scores experienced no suicidal ideation. Overall, the risk of suicidal ideation increases with BDI and BHS scores, though the patient frequency decreases. The figures also show that high BHS scores are associated with high BDI scores. The PAF for hopelessness is the proportion of suicide ideation that would be prevented if all patients' hopelessness was reduced to 0 on the BHS scale, while keeping BDI fixed. Similarly, the PAF for depression is the proportion of suicide ideation that would be prevented if all patients' depression was reduced to 0 on the BDI scale, while BHS fixed.

Semiparametric estimation of the probability
Suppose Y is distributed in the exponential family with mean μ [32]. The logit link function connects μ with the exposure and the confounder through a smooth function f [18] models f (z, x) parametrically assuming linearity as f (z, x) = β 0 +β 1 z+β 2 x+β 3 x z and estimates the coefficients under the constraint that β 1 z+β 3 x z ≤ β 1 z * +β 3 x z * when z ≤ z * at every x. We use a flexible semiparametric approach to model a possible non-linear f (z, x) by linear combinations of B-splines basis functions [22] under the constraint f (z, Let ψ(z) = (ψ 1 (z), . . . , ψ P (z)) be a set of B-spline basis functions . Let φ(x) = φ 1 (x), . . . , φ Q (x) be another set of B-spline basis functions. We model where B is the unknown coefficient matrix. Using Kronecker product ⊗ and the vectorization operator vec described in Section S1 of the supplementary material, we further obtain The maximum likelihood estimate of β without the constraint can be viewed as a modified iteratively weighted least squares problem [33,34]. Derivation in Section S2 of the supplementary material shows that the constraint f (z, x) ≤ f (z * , x) when z ≤ z * at every x can be expressed as Aβ ≥ 0, where the matrix A has a special pattern. The derivation in Section S3 of the supplementary material shows that the estimation procedure can be expressed as a quadratic programming problem. The approach is also capable of finding the estimate under the additional constraint that f (z, x) is monotone in x.

Estimation of the PAF
To be general, let X be a vector of measured covariates. Let Y 0 denote what the presence or absence would be if the exposure were to be eliminated. Let P be the probability measure. In particular, let P(Y 0 = 1) be the "hypothetical probability of disease in the same population but with all exposure eliminated" [35,36]. The PAF for the exposure Z is the proportion of disease that would be eliminated if Z = 0, To identify the PAF based on observed data, we make the assumption that all confounders of the diseaseexposure relationship are measured and contained in X. We assume consequently that the ignorable treatment assignment [3] holds: . Under this assumption, the PAF can be written as which is equal to the expression (2.2) for the PAF in [36].
The PAF is also written as where P(·|Y = 1) is the conditional probability given Y = 1 in the subpopulation of people with the disease [1,5,37]. We provide a proof of the equivalence between (4) and (5) in Section S4 of the supplementary material. From a random sample of the population of size n, i = 1, . . . , n, an estimate of the PAF then follows aŝ where I {Y i =1} is an indicator function [18]. Justification of the convergence of the estimate (6) to the PAF in (5) is provided by [6] in the scenario of no other covariates. A similar proof adjusting for the X can also be derived. Thus, an accurate estimate of the probability of the disease would provide a reasonable estimate of the PAF. The defined PAF in (3) as a proportion has no practical meaning if can result in an estimated PAF out of the [ 0, 1] range. Hence, an estimate of the probability with the monotonicity constraint ensures a reasonable estimate of the PAF. Table 1 summarizes the proportion of the times that the estimated PAF was within the [ 0, 1] range for the three approaches. As discussed, the monotonicity constraint ensures the estimated PAF is within [ 0, 1]. The lack of a constraint results in PAF estimates that are sometimes out of the range using the logistic regression and the conventional B-splines approaches. Table 2 shows the performance of the three approaches of estimating the PAF. In the comparison, results from the conventional B-splines approach are not shown due to its large scale. Instead, we show the results from this approach by censoring the original estimate at 0 if it is negative, or at 1 if it is bigger than 1. Similarly, we obtain the censored estimate from the logistic regression approach. In general, the censored estimate improves the original estimate. Overall, the developed approach outperformed the other approaches in the settings that we examined.

Case study
The estimated probability of suicidal ideation is shown in Fig. 3 separately by the three approaches. Both the logistic regression and the developed approach demonstrate a monotone relationship. Fitted probabilities numerically 0 or 1 occurred using the conventional B-splines approach. Table 3 summarizes the estimated PAF by the three approaches. The logistic regression produces lower PAF attributable to BHS and higher PAF attributable to BDI. Due to the numerical instability, the conventional B-splines estimated PAF is out of the [ 0, 1] range. We examine the numerical instability in the "Discussion" section.
We use the bootstrap method to obtain the 95% confidence intervals of the estimates. The lower bound is the 2.5% quantile of the 1000 bootstrap estimates and the upper bound is the 97.5% quantile. The lower bound of the logistic regression estimated PAF attributable to BDI is negative. In fact, 28 of the estimated PAF's are negative with the minimum being −32.45%. Among the 1000 estimates, two of the estimated PAF attributable to BHS are negative with the minimum being −4.27%.

Discussion
We used B-splines to develop a semiparametric estimate of the probability of an event under the monotonicity constraint and accounting for interactions. The approach is solved as a quadratic programming problem and can be easily implemented using the statistical software R. Using a boosting technique [38,39] implement a similar approach to estimate β under the same constraint Aβ ≥ 0 to solve the problem defined in equation (3) of the supplementary material. In the settings that have been tried, a comparison of fitting a univariate generalized linear model under the monotonicity constraint showed that the boosting algorithm is computationally intensive [23]. A summary of other approaches in the literature for estimation when there are monotonicity constraints and interactions is provided in Section S8 of the supplementary material.
Throughout our study, we placed the knots at the quartiles of the distribution of the unique predictors. We observed similar performance of the conventional Bsplines approach and the developed approach when the knots are placed at the tertiles {0, 1/3, 2/3, 1}. In the Table 2 Comparison of the absolute value of the bias (|Bias|), the variance and the MSE of estimating the PAF among the logistic regression approach (logit), the conventional B-splines approach (conB), and the developed approach (monB)  (Table  S3). Similar numerical instability was observed using the conventional B-splines approach ( Figure S2 and Table  S3). The estimated PAF is also out of the defined range between 0 and 1 using the conventional approach when knots are placed at the quantiles {0, 1/2, 1} (Table S3). Performance of the developed approach is robust to the knots placement.

Conclusions
We developed a semiparametric estimate of the probability of an event using B-splines. Our approach can model a monotone relationship between the response and covariates, and can account for interactions. We applied the approach to estimate the PAF, and compared the performance of the estimator with the logistic regression approach and the conventional approach without the monotonicity constraint. Simulation studies showed that the developed estimator outperforms the other two approaches.