We first present the lasso-based detection approaches. Then we detail the detection approaches based on the propensity score in high dimension. In a third step, we detail implementations of adaptive lasso proposed in the literature and we present our proposals based on the adaptive lasso.
The logistic lasso
Let N denote the number of spontaneous reports (i.e. the number of observations) and P the total number of drug covariates. Let X denote the N×P binary matrix of drug exposures and let xi be a 1×P vector of covariates for the ith observation. Let y be the N-vector of binary responses that indicates the presence or absence of the AE of interest. For i∈{1,...,N}, the corresponding multiple logistic model is
$$ \text{logit}(\text{Pr} (y_{i} = 1 | \mathbf{x_{i}})) = \beta_{0} + \sum_{p=1}^{P} \beta_{p} ~ x_{{ip}}, $$
(1)
where β0 is the intercept and β is a P-vector of regression coefficients associated with drug covariates. Although we are not in the P>>N context, P is typically very large, which can cause some numerical problems with classical regression. The penalized logistic lasso consists in estimating:
$$\begin{array}{*{20}l} \left(\widehat{\beta_{0}}_{\lambda}, \widehat{\boldsymbol{\beta}}_{\lambda}\right) = \text{argmax}_{\left(\beta_{0}, \boldsymbol{\beta}\right)} \left\{ l\left(\left(\beta_{0}, \boldsymbol{\beta}\right), \mathbf{y}, \mathbf{X}\right) - \text{pen}(\lambda) \right\}, \end{array} $$
where l is the log-likelihood of model (1), λ is the regularization parameter and pen(λ) is defined as
$$ \text{pen}(\lambda) = \lambda | \boldsymbol{\beta} |_{1} = \lambda \sum_{p=1}^{P} | \beta_{p} |. $$
(2)
Thanks to the L1 penalty in (2), some coefficients of \(\widehat {\boldsymbol {\beta }}_{\lambda }\) are shrunk to exactly zero, so the covariates associated with these coefficients are not retained in the model. By controlling the amount of penalization, the λ parameter in the lasso regression is closely related to the number of non-zero estimated coefficients. Since the aim in pharmacovigilance is to detect deleterious associations with the outcome, we are only interested in covariates with a positive associated penalized coefficient in \(\widehat {\boldsymbol {\beta }}_{\lambda }\).
Penalization parameter selection
We considered three strategies for selecting the penalization parameter: cross-validation, BIC and permutations. One round of cross-validation involves partitioning the dataset into nf subsets, called folds: nf−1 are used as a training set, i.e. the model is estimated on this set, and the remaining fold is used as a validation set where a prediction performance metric (e.g. area under curve or deviance) is calculated. This procedure is repeated so that each fold is used only once as a validation set. An average value of the performance metric and a standard deviation are then calculated over the nf obtained values. In the lasso regression context, cross-validation is performed for each tested λ value. The selected λ according to cross-validation is the one with the best result in terms of the prediction performance metric selected. In this work, we used the deviance, and we set nf to 5.
An alternative strategy for selecting the penalization parameter is to rely on model selection criteria such as the BIC [15, 27]. For each tested λ, we implemented the BIC as follows:
$$ \text{BIC}_{\lambda} = - 2 l_{\lambda} + \text{df}(\lambda) ~ \text{ln} (N), $$
(3)
where lλ is the log-likelihood of the classical multiple logistic regression model, which includes the set of covariates with a non-zero coefficient in \(\widehat {\boldsymbol {\beta }}_{\lambda }\), and \(\text {df}(\lambda) = | \widehat {\boldsymbol {\beta }}_{\lambda } \neq 0 |\). If different λs lead to the same subset of retained covariates, then the non-penalized models resulting from these λs are the same, as is the BIC. Consequently, this approach selects the subset of covariates that leads to the classical model which minimizes the BIC defined in (3), rather than selecting a particular λ.
An approach based on permutations for selecting the penalization parameter in lasso regression was proposed by Sabourin et al. [27] based on the suggestion of Ayers and Cordell [26]. Denoting π as any permutation of {1,...,N}, let \(\phantom {\dot {i}\!}\mathbf {y}_{\pi _{l}} = (y_{\pi (1)},..., y_{\pi (N)})\) be a permuted version of the outcome y with 1≤l≤K. A lasso regression is performed for each of these permutations by regressing \(\phantom {\dot {i}\!}\mathbf {y}_{\pi _{l}}\) on the original data set X. One then obtains \(\phantom {\dot {i}\!}\lambda _{{max}}(\mathbf {y}_{\pi _{l}})\), i.e. the smallest value of the penalty parameter, such that no covariate is selected in the lasso regression on \(\phantom {\dot {i}\!}\mathbf {y}_{\pi _{l}}\). As in Sabourin et al. [27], we used the median value of \(\phantom {\dot {i}\!}\left (\lambda _{{max}}(\mathbf {y}_{\pi _{1}}),..., \lambda _{{max}}(\mathbf {y}_{\pi _{K}})\right)\) in a lasso regression performed with the original outcome y. In this work, we set K=20.
In the following, we refer to the approach involving cross-validation, BIC or permutation to choose the penalty parameter as lasso-cv, lasso-bic and lasso-perm, respectively.
Class-imbalanced subsampling lasso
To circumvent the penalization parameter selection issue in lasso regression, Meinshausen and Bühlmann proposed the stability selection algorithm [21]. Briefly, it consists of perturbing the data by subsampling many times, implementing lasso regression on these subsamples randomly drawn without replacement, and choosing covariates that occur in a large fraction of the resulting selected sets induced by the lasso path of regularization. Ahmed et al. proposed a variation of this method to account for the large imbalance of the outcome that occurs in pharmacovigilance databases: the CISL algorithm [15]. In CISL, subsamples are drawn following a nonequiprobable sampling scheme with replacement in order to allow a better representation of individuals who experienced the outcome of interest. Lasso regressions are performed in each of these samples and the following quantity is computed:
$$ \widehat{\pi}^{b}_{p}= \frac{1}{E} \sum_{\eta = 1}^{E} \mathbbm{1} \left[ \widehat{\beta}_{p}^{\eta, b} > 0 \right], $$
(4)
where E is the maximum number of covariates selected by all the lasso regressions, η∈{1,..,E} is the number of covariates selected and \(\widehat {\beta }^{\eta, b}_{p}\) is the regression coefficient estimated by the logistic lasso for drug p, on sample b∈{1,..,B} for a model including η covariates. Thus, for each drug, an empirical distribution of \(\widehat {\pi }^{b}_{p}\) is obtained over all B samples. The drug covariate is then selected if a given quantile of the distribution of \(\widehat {\pi }^{b}_{p}\) is non-zero. In this work, we considered the covariate sets established with the 10% quantiles of these distributions following Ahmed et. al.’s recommendation.
Propensity score approaches
The propensity score (PS) is defined as the probability of being exposed to a drug of interest given the observed covariates [30]. It is a balancing score, which means that conditionally on the PS, treatment exposure and the observed covariates are independent, so it is possible to deal with measured confounding. Recently, this methodology was extended to exploit large healthcare databases. In this framework, covariate selection algorithms are used to automatically select potential confounders for inclusion in the PS estimation model of a given drug exposure [31].
In Courtois et al. [17], several PS-based approaches were proposed in the context of signal detection from spontaneous reporting data. These approaches consisted in estimating a PS for each drug reported in the database. The PSs were built by selecting among all the other drugs those to be included in the PS estimation model. Because of the large number of candidate covariates to be included in these models, covariate selection algorithms were used and compared. Here we used the lasso-bic approach presented earlier to select the set of covariates to be included in the PS logistic regression model. This procedure was repeated for all drugs in the database. Following Courtois et al, we accounted for these PSs in the final regression model through adjustment and weighting with two different weightings for the latter: Inverse Probability of Treatment Weighting (IPTW) [32] and Matching Weights (MW) [33]. We also investigated the weights truncation approach with IPTW. This consists in assigning to individuals whose corresponding weight is below the rth percentile or above the (1−r)th percentile of weights, the value of the rth or (1−r)th percentile, respectively [34]. Here we chose to set r=2.5%.
For a given PS-based approach, each drug was evaluated using one-sided hypothesis testing. To account for multiple testing, we used the procedure proposed by Benjamini and Yekutelli [35] to control the FDR under arbitrary dependence assumptions. We set the FDR level at 5%. In the following, we refer to the adjustment on the PS, the weighting on the PS with weights IPTW, IPTW with truncation, and MW as ps-adjust, ps-iptw, ps-iptwT and ps-mw, respectively.
Adaptive lasso and extensions for signal detection
The adaptive lasso
As defined by Fan and Li [20], an optimal procedure in statistical learning should have the following oracle properties: (i) identifies the right subset of true predictors, and (ii) produces unbiased estimates. In their work, they showed that the lasso procedure does not enjoy these oracle properties. Indeed, there are some scenarios in which the lasso variable selection could be inconsistent. Furthermore, with an equal penalty for all covariates, the lasso tends to overpenalize the relevant ones and to produce biased estimates for true large coefficients. To overcome this drawback, Zou [22] proposed the adaptive lasso in which AWs are used to penalize covariates differently in the L1 penalty:
$$\begin{array}{*{20}l} \text{pen}(\lambda) = \lambda \sum_{p=1}^{P} w_{p} | \beta_{p} |. \end{array} $$
The penalty applied to the covariate p is defined by λp=λ×wp. The higher the value of the weight wp, the more the variable p is penalized and the less likely the variable is to be included in the model. By assigning a higher penalty to small coefficients and a lower penalty to large ones, the adaptive lasso makes it possible to consistently select the right model and produce unbiased estimates. Thus, Zou showed in his work that under certain conditions for the AWs, the adaptive lasso enjoys the oracle properties.
To build the AWs, Zou used an initial consistent estimator of β∗, the P-vector of regression coefficients. To this end, he considered \(\widehat {\boldsymbol {\beta }}^{mle}_{p}\) the maximum likelihood estimate for covariate p and defined the associated penalty weight \(w_{p} = \frac {1}{|\widehat {\beta }^{mle}_{p}|^{\gamma }}\), with γ>0. However, in the high-dimensional context, it is non-trivial to find a consistent estimate for constructing the AWs since computing the maximum likelihood is not feasible.
In the linear case, Bühlmann and Van De Geer [23] proposed to use the penalized regression coefficients estimated by a lasso regression to determine these AWs considering γ=1. In both the lasso and the adaptive lasso, the penalization parameter λ was selected through cross-validation. This two-stage procedure, which involves an initial lasso step with cross-validation, was proposed in the logistic case under the name of iterated lasso [24]. By denoting \(\widehat {\boldsymbol {\beta }}^{lcv}\) the P-vector of lasso regression coefficients determined with lasso-cv, the AWs associated with the drug covariate p in the adaptive lasso stage are defined as:
$$\begin{array}{*{20}l} w^{lcv}_{p} = \left\{ \begin{array}{ccc} \frac{1}{|\widehat{\beta}_{p}^{lcv}|} & \text{if} & \widehat{\beta}_{p}^{lcv} \neq 0 \\ \infty & \text{if} & \widehat{\beta}_{p}^{lcv} = 0 \end{array}\right. \end{array} $$
Thus, a covariate that has not been selected with lasso-cv in the first stage is automatically excluded in the adaptive lasso stage. In the following, we refer to this approach as adapt-cv.
In the linear case, Huang et al. [25] showed that under certain conditions, using univariate regression coefficients to determine the AWs presents nice properties. By denoting \(\widehat {\beta }^{univ}_{p}\) the univariate coefficient associated to drug covariate p, we defined the following AWs associated with the drug covariate p in the adaptive lasso stage:
$$\begin{array}{*{20}l} w^{univ}_{p} = \frac{1}{|\widehat{\beta}_{p}^{univ}|}. \end{array} $$
As in the work of Huang et al., we chose the penalisation parameter according to cross-validation in the adaptive lasso stage. We refer to this approach as adapt-univ.
Following Ballout et al. [36], the optimal λ for cross-validation-based adaptive lasso is obtained by deriving adaptive weights for each training set (i) directly for adapt-univ or (ii) using an embedded cross-validation for adapt-cv. This optimal λ is then used on the full data to obtain the final adaptive lasso estimates.
Extending adaptive lasso for pharmacovigilance
Although adaptive lasso is an appealing variable selection procedure, to our knowledge it has never been used for signal detection. Since the aim in pharmacovigilance is to select the right subset of drugs associated with an AE, we sought to develop a signal detection approach by enhancing the performance of this method in terms of variable selection. To this end, we first use the BIC as defined above to identify the final subset of covariates in the adaptive lasso stage instead of cross-validation. We also propose two new AWs that aim to under-penalise variables that have been considered relevant by lasso-based variable selection methods, and to increase the penalty applied to, or even exclude, variables considered as less relevant.
The first one consists in using the BIC in the first stage. By denoting \(\widehat {\boldsymbol {\beta }}^{lbic}\) the P-vector of unpenalized regression coefficients estimated in the first stage with lasso-bic, we define the following AWs associated with the drug covariate p in the adaptive lasso stage by:
$$\begin{array}{*{20}l} w^{lb}_{p} = \left\{ \begin{array}{ccc} \frac{1}{|\widehat{\beta}_{p}^{lbic}|} & \text{if} & \widehat{\beta}_{p}^{lbic} \neq 0 \\ \infty & \text{if} & \widehat{\beta}_{p}^{lbic} = 0 \end{array}\right. \end{array} $$
A covariate that has not been selected by the lasso-bic in the first stage is automatically excluded in the adaptive lasso stage.
The second proposed AWs are derived from the CISL approach. We first compute CISL by considering a non-zero constraint in calculating of the quantity (4) instead of the original positive constraint:
$$\begin{array}{*{20}l} \widehat{\tau}^{b}_{p}= \frac{1}{E} \sum_{\eta = 1}^{E} \mathbbm{1} [ \widehat{\beta}_{p}^{\eta, b} \neq 0 ]. \end{array} $$
This quantity measures the proportion to which a variable has been selected in E first models provided by the lasso regularization path. We define AW for covariate p according to the B-vector \(\widehat {\boldsymbol {\tau }}_{p}\) as:
$$\begin{array}{*{20}l} w^{cisl}_{p} = \left\{ \begin{array}{ccc} \frac{1}{B} & \text{if} & \forall b \in \{1,..., B\} ~ \widehat{\tau}^{b}_{p} >0 \\ \\ \infty & \text{if} & \forall b \in \{1,..., B\} ~ \widehat{\tau}^{b}_{p} = 0 \\ \\ 1- \frac{1}{B} \sum_{b=1}^{B} \mathbbm{1} [ \widehat{\tau}^{b}_{p} >0 ] & & \text{otherwise.} \end{array}\right. \end{array} $$
Thus, the more \(\widehat {\boldsymbol {\tau }}_{p}\) is non-null over the B subsamples, the smaller is its associated AW.
In the following, we refer to these approaches as adapt-bic and adapt-cisl.