Skip to main content

Applications of Bayesian shrinkage prior models in clinical research with categorical responses



Prediction and classification algorithms are commonly used in clinical research for identifying patients susceptible to clinical conditions such as diabetes, colon cancer, and Alzheimer’s disease. Developing accurate prediction and classification methods benefits personalized medicine. Building an excellent predictive model involves selecting the features that are most significantly associated with the outcome. These features can include several biological and demographic characteristics, such as genomic biomarkers and health history. Such variable selection becomes challenging when the number of potential predictors is large. Bayesian shrinkage models have emerged as popular and flexible methods of variable selection in regression settings. This work discusses variable selection with three shrinkage priors and illustrates its application to clinical data such as Pima Indians Diabetes, Colon cancer, ADNI, and OASIS Alzheimer’s real-world data.


A unified Bayesian hierarchical framework that implements and compares shrinkage priors in binary and multinomial logistic regression models is presented. The key feature is the representation of the likelihood by a Polya-Gamma data augmentation, which admits a natural integration with a family of shrinkage priors, specifically focusing on Horseshoe, Dirichlet Laplace, and Double Pareto priors. Extensive simulation studies are conducted to assess the performances under different data dimensions and parameter settings. Measures of accuracy, AUC, brier score, L1 error, cross-entropy, and ROC surface plots are used as evaluation criteria comparing the priors with frequentist methods as Lasso, Elastic-Net, and Ridge regression.


All three priors can be used for robust prediction on significant metrics, irrespective of their categorical response model choices. Simulation studies could achieve the mean prediction accuracy of 91.6% (95% CI: 88.5, 94.7) and 76.5% (95% CI: 69.3, 83.8) for logistic regression and multinomial logistic models, respectively. The model can identify significant variables for disease risk prediction and is computationally efficient.


The models are robust enough to conduct both variable selection and prediction because of their high shrinkage properties and applicability to a broad range of classification problems.

Peer Review reports


Recent innovations and availability of structured electronic health records (EHRs), and multisite longitudinal studies, high-throughput sequencing (HTS) [1] data have made patient characteristics, genomics information accessible for statistical prediction analysis and clinical research. Medical decisions are modified upon the identification of essential covariates for a specific clinical outcome. Treatment recommendations are also age, population, and comorbidity specific, with HTS bringing a paradigm shift in large-scale personalized medicine. For example, in diabetes prediction, various biological and clinical factors such as sex, age, obesity are responsible for predicting the future risk of the disease. In oncology, for instance, in colorectal cancer [2], breast cancer [3], non-small cell lung cancer(NSCLS) prognostic [4], and predictive genes such as human epidermal growth factor receptor 2 (HER2), BRAF, KRAS are not only essential for early detection but also therapy selection, subgroup stratification, and controlled monitoring of the disease. In Alzheimer’s disease (AD), a combination of traditional risk factors including age, education, hypertension, obesity, cognitive test scores, cardiovascular disease [5] along with testing of ApoE4 gene can be informative in identifying if a patient is vulnerable to the non-curable disease or any other forms of dementia.

A clinical prediction or classification model, followed by a variable selection strategy, plays a vital role in designing preventive measures from adverse outcomes.

Penalized regression

Variable selection methods can mitigate noise imposed by irrelevant variables in the data. Most of these methods are frequentist approaches such as Lasso [6], Elastic-Net [7] that uses L1 or L2 penalty to shrink the coefficients of irrelevant variables. Bayesian regression methods that incorporates shrinkage properties as prior information is a reasonable alternative approach, and is the ultimate focus of this research work.

Public health researchers often encounters prediction problems with categorical responses. Depending on the clinical features the primary outcome needs to be predicted, such as the classification of a Covid-19 patient as “discharged” or “died” or “remained in hospital” with individual-patient level data [8]. When the response variable is binary, logistic regression (LR) models enable assessing the association between an independent variable(s) and the response variable. When the response variable has more than two categories, generalizations of the logistic model, such as multinomial logistic regression (MLR) models, are common. The above example provides a better understanding of categorical response models and their contribution as a natural and attractive approach in a diverse array of applications, ranging from spam detection, credit card fraud to predicting Alzheimer’s stage and tumor malignancy. The categorical response models’ primary goal is to infer relationships between a categorical response and a set of explanatory variables. As in any form of regression technique, this is critical to understanding the causal dynamics of a complex system and making accurate predictions on future data. Depending on a specific problem, the explanatory variables or predictors can be of the form of demographic profiles, socio-economic data, or complex biomolecular entities. Modeling becomes particularly challenging when the number of such predictors grows large, a common feature in genomics. Traditionally, frequentist likelihood-based methods have been using penalized regression techniques to address this “curse of dimensionality” and to deal with data having a considerable number of predictors in linear and multicategory response regression. These methods briefly add penalty terms to the negative log-likelihood function and then optimize this penalized likelihood for parameter inference. The penalization leads to a desired “shrinking” behavior of the estimates. Specifically, it pulls the small coefficients towards zero while minimally affecting the large coefficients, thus selecting only the most relevant variables. For a LR set up, the objective function that needs to be minimized is of the form

$$l_{\lambda}^{LR}(\beta)=\sum_{i=1}^{n}y_{i} x_{i} \beta-\log(1+e^{x_{i}^{T} \beta})+\lambda\left(\sum_{j=1}^{p}|\beta_{j}|^{r}\right)^{\frac{1}{r}},r>0.$$

Here, y=(y1,y2,...,yn) is an n- dimensional vector representing the outcome variable; xi=(xi1,xi2,…,xip) are the covariates with xij are the observed values on the p predictors and (β1,β2,...,βp)T is a p-dimensional parameter vector of regression coefficients. λ is the penalty parameter which modulates the amount of shrinkage. Large values of λ lead to more shrinkage, while λ=0 leads to the simple logistic likelihood without shrinkage. The generalized objective function for multiple categories is

$$l_{\lambda}^{MLR}(\beta)=\sum_{i=1}^{N} y_{i} (\sum_{k=0}^{K}x_{ik} \beta_{k}-n_{i} \log(1+e^{\sum_{k=0}^{K}{x_{ik} \beta_{k}}})+\lambda\left(\sum_{k=1}^{K}\sum_{j=1}^{p}|\beta_{kj}|^{r}\right)^{\frac{1}{r}}.$$

Different values of r lead to various penalization techniques. For example, r=1 results in the well-known Least Absolute Shrinkage and Selection Operator (Lasso) [6] solution and r=2 results in the Ridge regression solution [9]. Elastic-Net [7] is another method that is a convex combination of Lasso and Ridge.

Shrinkage priors

Bayesian variable selection models form a natural counterpart to penalization, with some specialized priors assuming the penalty terms’ role. Some notable works in this area include [6, 10,11,12,13,14,15,16,17,18] among many. A comprehensive overview of shrinkage priors for data applications is present in [19]. The historical development of shrinkage priors can be traced back to the spike and slab approach proposed in [20]. It is a hierarchical mixture model that essentially uses a latent binary indicator to differentiate between the large and small regression coefficients. It does so by assigning Gaussian priors with high and low variances to the coefficients, conditional on these indicators. The resultant marginal prior for a coefficient takes on a shape with a “spike” at 0 and a relatively flat “slab” at non-zero regions, lending the model its name. Stochastic search variable selection (SSVS) [21] is used for identifying a subset of significant covariates with a hierarchical normal mixture model structure, similar to the spike and slab prior. However, it uses continuous Gaussian distribution to include and exclude variables and lower variance, respectively. The spike-and slab prior has a bi-separation effect on the model coefficients, thus bringing a computational complexity of 2p. SSVS is computationally intensive and cannot tackle a large set of predictors. Inclusions of non-conjugate and conjugate priors are detailed in this extension of the work [22]. A related class of variable selection (VS) models put the positive prior mass at 0. Inference in these models usually relies on Reversible Jump sampling techniques [23]. Though these models have the nice property of explicitly reducing coefficients to zero, they typically incur high computational costs and exhibit poor mixing. These issues necessitated using a computationally efficient class of priors, which allowed sufficient shrinkage in high dimensions. The focus solely is on such priors for this article. These priors are called global-local(GL) shrinkage priors [24]. Denoting the parameter set as β=(β1,...,βp)T, the GL framework can be written hierarchically as follows

βjN(0,λjτ) λjf(.) and τg(.).

The global parameter τ controls the overall shrinkage of all the coefficients towards zero while the local parameters λ=(λ1,...,λp) modulate the coefficient-specific shrinkage. It is desired that the distribution of the local parameter f(.) has heavy tails and the distribution of the global parameter g(.) has substantial mass at point zero. The Normal-Gamma prior [25], the Dirichlet Laplace (DL) prior [18], Horseshoe prior [26], and the Double Pareto (DP) prior [17] are some of the GL priors. A review and comparison of shrinkage methods for VS are detailed in [27]. High-dimensional statistical modelling with Bayesian shrinkage prior method has been extended in several arenas such as longitudinal binary data with informative missingness [28], and for joint modelling of clustered mixed outcomes with uniform shrinkage prior [29]. The second piece of our proposed approach focuses on a convenient data augmentation representation of the logistic regression likelihood. Two data augmentation algorithms that have gained popularity for binary responses which are for the probit regression model that uses truncated normal random variables [30] and another for the logistic regression that uses Polya-Gamma (PG) random variables [31]. Here we focus on the latter scheme.

In the following sections, the PG Data Augmentation is described under LR and MLR models. Next, the data-augmented likelihood framework’s connection to the Horseshoe, DL, and DP priors is presented, embedding them in a fully Bayesian hierarchical model. Performance metrics evaluate the simulation results with different sample sizes, covariate dimensions, and parameter settings with several data applications. The limitations and contributions are in the Discussion section where the article is concluded with future research directions.

Bayesian binary regression and polya-Gamma data augmentation

Consider the binary logistic regression model where Y1,Y2,...,Yn are i.i.d. Bernoulli random variables with variables \(\left\{x_{i}\in \mathbb{R}^{p}\right\}_{i=1}^{n}\) as the covariates and \(\beta \in \mathbb {R}^{p}\) denotes the unknown regression coefficients. The corresponding likelihood is given by

$$\prod_{i=1}^{n}P(Y_{i} = y_{i}\mid \beta)=\prod_{i=1}^{n} \frac{\exp(x_{i}^{T}\beta y_{i})}{1+\exp(x_{i}^{T}\beta)}.$$

In the context of the Bayesian analysis, the posterior of β given the data is given by \(\pi (\beta \mid Y) \propto \pi (\beta)\prod _{i=1}^{n} \frac {\exp (x_{i}^{T}\beta y_{i})}{1+\exp (x_{i}^{T}\beta)}\), where π(β) denotes the prior distribution for β. Unfortunately, the above posterior becomes intractable due to the presence of the term \((1+\exp (x_{i}^{T}\beta))\) in its denominator [32]. Therefore, posterior sampling in this setup traditionally relied on the Metropolis Hastings (MH) algorithm. As an alternative, the Data Augmentation (DA) algorithm utilizes the latent variables to circumvent the difficulty. Before introducing to the specific Polya-Gamma DA scheme that we used in this article, we provide the general structure of a DA algorithm. Suppose, we need to sample a parameter of interest θ from an intractable density π(θ). The technique requires to design a suitable joint density π(θ,W) in such a way that it satisfies the following two criteria ; firstly, \(\int \pi (\theta, W) dW = \pi (\theta)\) and secondly, the corresponding conditional distributions, π(θW) and π(Wθ) are possible to sample from [33, 34]. In the context of the Bayesian analysis, π(θ) typically refers to a posterior density for a parameter of interest θ. A major challenge for designing a DA is to construct an appropriate choice for π(θ,W) (see [30], [31]). A commonly used strategy is to build a conditional distribution π(Wθ) so that π(θ,W)=π(Wθ)π(θ) fulfills the requirements. The latent variable W is often termed as the augmented random variable. In the current context, our parameter of interest is β with the posterior density π(βY) while we make use of the DA technique designed in [31] where the authors develops and utilizes the Polya-Gamma (PG) distribution as the choice for the augmented random variable. As we will require in the later sections, we include the probability density function of the Polya-Gamma distribution (denoted by PG(1,c)) [31], as following

$$\begin{array}{@{}rcl@{}} p(x\mid c)& =& \cosh\left(\frac{c}{2}\right)\exp\left(-\frac{c^{2}x}{2}\right) h(x) \text{ where }\\ h(w)& = & \sum_{k=0}^{\infty}(-1)^{k}\frac{2k+1}{\sqrt{2 \pi w}}\exp\left(-\frac{(2k+1)^{2}}{8w}\right),0< w<\infty. \end{array}$$

To deal with the intractability in Eq. 3, the DA scheme augments independent latent variables, \(\{W_{i}\}_{i=1}^{n}\) where \(W_{i} \sim PG(1,|x_{i}^{T}\beta |)\). \(\{W_{i}\}_{i=1}^{n}\) are also assumed to be independent of \(\{Y_{i}\}_{i=1}^{n}\) for i=1,…n. Then, the joint posterior \(\pi (\beta, \{W_{i}\}_{i=1}^{n}\mid \{Y_{i}\}_{i=1}^{n})\) satisfies the two criteria related to a generic DA algorithm that are mentioned above. The integrability criterion trivially holds whereas the the random variables W1,…,Wn given \(\{Y_{i}\}_{i=1}^{n}, \beta\) are independent and follows PG distribution [31]. The posterior conditional for \(\beta \mid \{Y_{i}\}_{i=1}^{n}, \{W_{i}\}_{i=1}^{n}\) becomes to be the multivariate normal if the multivariate normal prior is used for β [31]. The appearance of normal distribution as a posterior for β is a key feature which provides a way to utilize more nontrivial priors for β without much difficulty. Specifically, in this manuscript we use the Global-Local priors that we discuss in details in the next section. As mentioned earlier, Bayesian regression for binary responses has been recognized as a hard problem due to the likelihood’s unwieldy form. There have been several efforts towards an improved version of the DA algorithm [30] for probit regression. Some notable works include [35] and [36]. However, these algorithms are more imprecise versions of [30] making it significantly difficult with multiple layers of latent variables and restricting its usage. In contrast, the DA algorithm by [31] is free of these problems and computationally much less cumbersome.

Logistic regression model with hierarchical prior structures

In this subsection, we include the details of each of the three priors distributions along with the corresponding posterior distributions. The original form of the horseshoe prior [37] is represented as

$$\begin{array}{@{}rcl@{}} \\ \beta_{j}\mid\Lambda_{j}^{2}\tau^{2} \sim N_{p}\left(0,\Lambda_{j}^{2}\tau^{2}\right), j=1,2,...,p; \Lambda_{j}^{2} \sim C^{+}(0,1); \tau \sim C^{+}(0,1). \end{array}$$

Another computationally feasible hierarchical representation of the Horseshoe prior [14] that is used here is

$$\begin{array}{@{}rcl@{}} \\ \beta_{j}\mid\Lambda_{j}^{2}\tau^{2} \sim N_{p}\left(0,\Lambda_{j}^{2}\tau^{2}\right), j=1,2,...,p; \Lambda_{j}^{2}\mid\gamma_{j} \sim IG\left(\frac{1}{2},\frac{1}{\gamma_{j}}\right); \\ \tau^{2}\mid\xi\sim IG\left(\frac{1}{2},\frac{1}{\xi}\right); \gamma_{1},\gamma_{2},...,\gamma_{p},\xi\sim IG\left(\frac{1}{2},1\right). \end{array}$$

Here, the variance covariance matrix Σ of the distribution of β is a diagonal matrix with elements (Λ1τ2,…Λpτ2). From Eqs. 3,(4) and the above hierarchical prior structure (6), the full posterior distribution is given by:

$$\begin{array}{@{}rcl@{}} \pi\left(\beta,\Lambda_{j}^{2},w_{i},\gamma_{j},\tau^{2},\xi\mid Y\right) \propto \prod_{i=1}^{n} \frac{\exp(x_{i}^{T}\beta y_{i})}{(1+\exp(x_{i}^{T}\beta))}\left(1+\exp\left(x_{i}^{T}\beta\right)\right) h\left(w_{i}\right) \\ \exp\left(-\frac{x_{i}^{T}\beta}{2}-\frac{\left(x_{i}^{T}\beta\right)^{2}w_{i}}{2}\right) \frac{\exp(-\frac{1}{2}(\beta^{T}\Sigma^{-1}\beta))}{\sqrt{2\pi\Sigma}} \\ \prod_{j=1}^{p}\frac{\left(\Lambda_{j}^{2}\right)^{-(\frac{1}{2}+1)}\exp\left(\frac{-1}{\gamma_{j}\Lambda_{j}^{2}}\right)}{\gamma_{j}^{\frac{1}{2}}}\ \frac{(\tau^{2})^{-\left(\frac{1}{2}+1\right)}\exp\left(\frac{-1}{\tau^{2}\xi}\right)}{\xi^{\frac{1}{2}}} \\ \gamma_{j}^{-(\frac{1}{2}+1)}\exp\left(\frac{-1}{\gamma_{j}}\right)\ \xi^{-(\frac{1}{2}+1)}\exp\left(\frac{-1}{\xi}\right). \end{array}$$

where, h(wi) is defined in Eq. 4. The conditional distributions required for the analysis are as follows:

The conditional density of β given y, w is

$$\pi\left(\beta\mid \Sigma,W_{D},Y\right)\sim N_{p}\left(\left(X^{T}W_{D}X+\Sigma^{-1}\right)^{-1}X^{T}y^{*},\left(X^{T}W_{D}X+\Sigma^{-1}\right)^{-1}\right)$$

where, WD and Σ are diagonal matrices where the elements are (w1,w2,...,wn), \((\Lambda _{1}^{2}\tau ^{2},...,\Lambda _{p}^{2}\tau ^{2})\) respectively and, \(y^{*}=\left (y_{1}-\frac {1}{2},....,y_{n}-\frac {1}{2}\right)\).

In high-dimensional scenarios where sampling of β can be difficult for p>n case, fast sampling technique with Gaussian scale-mixture priors [38] is used where the mean and variance of a Gaussian distribution is in the respective form: \(N_{p}(\mu,\hat {\Sigma }), \mu =\hat {\Sigma } A^{T}\alpha, \hat {\Sigma }=(A^{T}A+D^{-1})^{-1}\) DRp×p,ARn×p,αRn×1 The algorithm involves sampling uN(0,D), and δN(0,In); with V=Au+δ getting the inverse of (ADAT+In)w=(αv), and finally obtaining θ=u+DATw,θN(μ,Σ). The conditional density of wi given xi,β is

$$\begin{array}{@{}rcl@{}} \pi\left(w_{i}\mid\beta\right)\sim PG\left(1,x_{i}^{T}\beta\right). \end{array}$$

The conditional density of the hyper-parameters are as follows

$$\begin{array}{@{}rcl@{}} \pi(\Lambda_{j}^{2}\mid \gamma_{j},\beta_{j},\xi,\Lambda_{j}^{2})\sim IG\left(1,\frac{1}{\gamma_{j}}+\frac{\beta_{j}^{2}}{2\tau^{2}}\right) \\ \pi(\gamma_{j}\mid \Lambda_{j}^{2},\beta_{j},\xi,\tau^{2})\sim IG\left(1,1+\frac{1}{\Lambda_{j}^{2}}\right) \\ \pi(\tau^{2}\mid \gamma_{j},\beta_{j},\xi,\Lambda_{j}^{2})\sim IG\left(\frac{p+1}{2},\frac{1}{\xi}+\sum_{j=1}^{p}\frac{\beta_{j}^{2}}{2\Lambda_{j}^{2}}\right) \\ \pi\left(\xi\mid \tau^{2}\right)\sim IG\left(1,1+\frac{1}{\tau^{2}}\right). \end{array}$$

Here, all the posterior densities are in the closed form, and follow simple probability densities such as Normal, Polya-Gamma and Inverse-Gamma making sampling from them trivial. The hierarchical structure of the Dirichlet Laplace prior [18] is

$$\begin{array}{@{}rcl@{}} \beta_{j} \sim N_{p}\left(0,\psi_{j}\phi_{j}^{2}\tau^{2}\right); \psi_{j} \sim exp\left(\frac{1}{2}\right); \phi \sim Dir\left(a,a,...,a\right); \tau \sim G\left(pa, \frac{1}{2}\right). \end{array}$$

The conditional posterior distributions remain same for βyi and wiβ is similar to that of Eqs. 8 and 9.

The conditional density of the hyper-parameters as obtained similar to Theorem 2.2 in [18] are as follows:

$$\begin{array}{@{}rcl@{}} \pi(\psi\mid \phi,\tau,\beta)\sim IG\left(\frac{\phi_{j}\tau}{|\beta_{j}|},1\right) \\ \pi(\tau\mid \phi,\beta)\sim GIG\left(pa-p,1,2\sum_{j=1}^{p}\frac{|\beta_{j}|}{\phi_{j}}\right). \end{array}$$

To sample π(ϕβj) sample TjGIG(a−1,1,2|βj|), set \(\phi _{j} =\frac {T_{j}}{T}, T=\sum _{j=1}^{p}T_{j}\). where GIG(a,b,c) is the Generalized Inverse Gaussian distribution with density \(f(x; a, b,c) \propto x^{(c-1)} e^{\frac {-1}{2}(ax+\frac {b}{x})}\).

The hierarchical structure of Double Pareto prior[30] is

$$\begin{array}{@{}rcl@{}} \\ \beta\mid\Lambda,\tau \sim N_{p}(0,D_{\tau}); w_{i}\mid x_{i},\beta \sim PG\left(1,x_{i}^{T}\beta\right); \\ \tau_{j}\mid\Lambda_{j} \sim exp\left(\frac{\Lambda^{2}}{2}\right); \Lambda_{j} \sim G(\zeta,\eta). \end{array}$$

Again, the conditional densities of βyi and wiβ remains same as (8) and (9). Here Σ=Dτ is a diagonal matrix with elements (τ1,τ2,...,τp). The conditional density of rest of the hyper-parameters are as follows:

$$\begin{array}{@{}rcl@{}} \pi\left(\tau_{j}\mid \beta, \Lambda,y\right)\sim GIG\left(\frac{1}{2}, \Lambda_{j}^{2}, \beta_{j}^{2}\right)\\ \pi\left(\Lambda\mid\beta,y\right)\sim Gamma\left(\zeta +1, \eta+ \left|\beta_{j}\right|\right). \end{array}$$

Extending hierarchical models and differential shrinkage

The strength of our methods are in no way limited to a common shrinkage prior across covariates. In fact, this could be applied to several different variant models which allow for borrowing strength across multiple responses and different (in some cases, user-defined) levels of sparsity among groups of covariates. In particular, this could be used for models where we can choose not to shrink some covariates (for instance, some demographic covariates) in contrast to the genomic covariates.

In terms of estimation procedures, the hierarchical structure of the models would allow the form of the posterior conditional of β would remain the same in the model variants. Only the diagonal weight matrices appearing in the posterior means and variances would have some differential allocation of λs and constant variances depending on which variable that we choose to shrink. Similarly, the posterior sampling of the shrinkage hyper-parameters would be a subset. Details of the derivation are omitted since these are straight forward.

Bayesian multinomial logistic regression and polya-Gamma data augmentation

This is the extension of the binary regression, where the outcome variable has more than two classes. Let Yi, i=1,2,…,n be a categorical random variable with k categories, where k≥2. The probability for the kth category is pik=P(Yi=kxi), where \(\sum _{k=1}^{K} p_{ik} = 1\). The multinomial logistic regression (MLR) model is given as

$$\begin{array}{@{}rcl@{}} \\ P(Y_{i} = k \mid x_{i}) &= p_{ik} &= \frac{e^{x_{i}^{T}\beta_{k}}}{1+\sum_{j=1}^{K-1} e^{x_{i}^{T}\beta_{j}}}, k = 1,2,\ldots K-1 \\ P(Y_{i} = K \mid x_{i}) &= p_{iK} &= \frac{1}{1+\sum_{j=1}^{K-1} e^{x_{i}^{T}\beta_{j}}}. \end{array}$$

Here, βk are the coefficients associated with k-th category, and K is the baseline category with its coefficient βK constrained to zero. The Polya-Gamma data augmentation approach can be extended for multi-category response variables. Here, βk is updated conditional on the remaining βk=(β1,β2,…,βk−1,βk+1,…,βK). The full conditional for βk given y and βjk can be expressed as a likelihood of the Bernoulli distribution.

$$f(\beta_{k} \mid y, \beta_{-k}) \propto f(\beta_{k}) \Pi_{i=1}^{n} p_{ik}^{I(y_{i}=k)}(1-p_{ik})^{1-I(y_{i}=k)}$$

where f(βk) is the prior for coefficient βk,I(Yi=k)=1 when Yi=k with probability \(p_{ik} = P(Y_{i}=k)= \frac {e^{x_{i}^{T}\beta _{k}}}{\sum _{j=1}^{K} e^{x_{i}^{T}\beta _{k}}}\). It can be re-written as \(p_{ik} = P(Y_{i}=k) = \frac {e^{x_{i}^{T}\beta _{k}-M_{ik}}}{1+ e^{x_{i}^{T}\beta _{k}-M_{ik}}} = \frac {e^{\psi _{ik}}}{1+e^{\psi _{ik}}}\) where, \(M_{ik}= \log \sum _{j \neq k}e^{x_{i}^{T}\beta _{j}}\) and \(\psi _{ik} = x_{i}^{T}\beta _{k}-M_{ik}\). \(\sum _{j \neq k}e^{x_{i}^{T}\beta _{j}}\) includes the reference category K, as βK=0, so \(e^{x_{i}^{T}\beta _{k}} =1\), and hence \(M_{ik} = \log \sum _{j \neq k}e^{x_{i}^{T}\beta _{j}} = \log \left (1+\sum _{j \neq {k,K}}e^{x_{i}^{T}\beta _{j}}\right)\). Thus, the full conditional for βk given y and βk is

$$\begin{array}{@{}rcl@{}} f(\beta_{k}\mid y,\beta_{-k}) \propto f(\beta_{k}) \Pi_{i=1}^{n} \frac{(e^{\psi_{ik}})^{I(y_{i}=k)}}{1+e^{\psi_{ik}}} \end{array}$$

which is a logistic regression likelihood, thus PG data augmentation can be implemented to update each of βk’s based on the binary indicator variable I(yi=k). The above Eq. 17 after including PG data augmentation and prior for βkN(0,Σk) can be expressed as:

$$\begin{array}{@{}rcl@{}} f(\beta_{k}\mid y,\beta_{-k}) = \Pi_{i=1}^{n} \frac{(e^{\psi_{ik}})^{I(y_{i}=k)}}{1+e^{\psi_{ik}}} (1+e^{\psi_{ik}}) e^{-\psi_{ik}/2-\psi_{ik}^{2}w_{ik}/2}\\ h(w_{ik}) e^{-\frac{1}{2}\left(\beta_{k}^{T}\Sigma_{k}^{-1}\beta_{k}\right)}. \end{array}$$

Multinomial logistic regression with hierarchical prior structures

The hierarchical structure with three priors was calculated similarly to LR. Considering \(N(0,\Lambda _{t}^{2}\tau ^{2}), t = 1,2,\ldots,p\), for β1,β2,…,βK−1, the conditional density for βk and wik are given as

$$\begin{array}{@{}rcl@{}} \beta_{k} \sim N(\mu_{k},\Sigma_{k}),\ \mu_{k} = \Sigma_{k} \left(\Sigma_{0} + X^{T}W_{k} y_{k}^{*}\right), \ \Sigma_{k} = \left(\Sigma_{0}+X^{T}W_{k}X\right)^{-1}\\ w_{ik} \sim PG(1, \psi_{ik}). \end{array}$$

where \(\Sigma _{0} =diag\left (\Lambda _{t}^{2}\tau ^{2}\right), W_{k} = diag(w_{ik}), y_{k}^{*} = \frac {I(y_{i}=k)-0.5}{w_{ik}}+M_{ik}\).

The conditional density of the hyper-parameters in the case of horseshoe prior is

$$\begin{array}{@{}rcl@{}} \pi(\Lambda_{jk}^{2}\mid \gamma_{jk},\beta_{jk},\xi,\Lambda_{jk}^{2})\sim IG\left(1,\frac{1}{\gamma_{jk}}+\frac{\beta_{jk}^{2}}{2\tau^{2}}\right) \\ \pi(\gamma_{jk}\mid \Lambda_{jk}^{2},\beta_{jk},\xi,\tau^{2})\sim IG\left(1,1+\frac{1}{\Lambda_{jk}^{2}}\right) \\ \pi(\tau^{2}\mid \gamma_{jk},\beta_{jk},\xi,\Lambda_{jk}^{2})\sim IG\left(\frac{p+1}{2},\frac{1}{\xi}+\sum_{t=1}^{p}\frac{\beta_{jk}^{2}}{2\Lambda_{jk}^{2}}\right) \\ \pi\left(\xi\mid \tau^{2}\right)\sim IG\left(1,1+\frac{1}{\tau^{2}}\right) \end{array}$$

The conditional density of the hyper-parameters in case of DL prior is

$$\begin{array}{@{}rcl@{}} \pi(\psi\mid \phi,\tau,\beta_{k})\sim IG\left(\frac{\phi_{tk}\tau}{|\beta_{tk}|},1\right) \\ \pi(\tau\mid \phi,\beta_{k})\sim GIG\left(pa-p,1,2\sum_{j=1}^{p}\frac{|\beta_{tk}|}{\phi_{tk}}\right). \end{array}$$

To sample π(ϕβk) sample TtkGIG(a−1,1,2|βtk|), set \(\phi _{tk} =\frac {T_{tk}}{T}, T=\sum _{t=1}^{p}T_{tk}\). where GIG(a,b,c) is the Generalized Inverse Gaussian distribution with density \(f(x; a, b,c) \propto x^{(c-1)} e^{\frac {-1}{2}(ax+\frac {b}{x})}\).

The conditional densities of the hyper-parameters are as follows:

$$\begin{array}{@{}rcl@{}} \pi\left(\tau_{tk}\mid \beta_{k}, \Lambda,y\right)\sim GIG\left(\frac{1}{2}, \Lambda_{tk}^{2}, \beta_{tk}^{2}\right)\\ \pi\left(\Lambda\mid\beta_{k},y\right)\sim Gamma\left(\zeta +1, \eta+ \left|\beta_{tk}\right|\right). \end{array}$$

Here, βk=(β1k,β2k,…,βpk),t=1,2,…,p.

Simulation and results

Data was simulated from the logistic regression model under various settings of sample size (n), dimensions (p), and effect sizes β=(β1,β2,…,βp). The continuous covariates are simulated with mean m and sd=1. A total of 100 data sets were simulated for each of the conditions. All computations are carried out in RStudio [39]. The length of the Markov Chain Monte Carlo (MCMC) simulation is 10000, and 6000 iterations are discarded in the burn-in step. The value of parameter a in DL prior is chosen as 0.8 for optimal results and convergence.

Post-processing of mCMC samples

After model fitting variable selection was implemented by computing their posterior credible intervals. Posterior credible intervals can be readily obtained from a Bayesian framework via MCMC samples, and can provide direct uncertainty measures irrespective of model complexity. Specifically, these intervals are constructed by the quantiles of the MCMC samples of the parameters and estimate the probability intervals of the true posterior distribution. The variables are declared to be significant if and only if these posterior credible interval do not include zero. Thus, it will be a binary vector of the variables that were selected from estimation. These binary estimates were then matched against the simulation truth to compute some performance metrics.

Other than performance measures, these credible interval techniques would also provide a convenient measure of uncertainty, in case we need to use them. The varying sizes and noise content of datasets gets reflected in the varying lengths of the estimated prediction intervals that in turn allows us to place varying degrees of confidence in our inferred relationships. This could also inform us about the desired adequacy of sample size and number of variables for future studies. This can be readily obtained from MCMC samples unlike other traditional methods such as Lasso.

Performance measures

Based on the binary estimates, sensitivity, specificity and overall accuracy were computed for variable selection. In addition, two continuous error measures, MSE and Brier scores,were computed. The Brier Score (BS) is defined as \(BS= \frac {1}{N}\sum _{i=1}^{N} \left (P_{i} -Y_{i}\right)\); here, Pi is the probability of prediction and Yi is the actual outcome at that instance. The best score achievable is 0 and the worst is 1; The Mean Squared Error (MSE) is given by \({\frac {1}{p}\sum _{i=1}^{p}(\beta _{i}-\hat {\beta _{i}})^{2}}\);


A prediction module was also added that would assess how the model fares in classifying the responses. For this, a cross-validation approach and divided each simulated data set into training and test data on a 80-20 allocation percentage. The following measures were used for assessing the quality of our prediction. Accuracy (the proportion of correctly classified responses in the test dataset); Sensitivity (proportion of true positives); Specificity (proportion of true negatives); AUC (Area Under the Receiver Operating Characteristic (ROC) Curve).

Simulation scenarios and tables

Table 1 summarizes the simulation results and simulation settings for prediction and VS and Table 2 records the same for MLR.

Table 1 Prediction & Variable Selection Performance for LR with Shrinkage Priors
Table 2 Prediction Performance & Variable Selection for MLR with Shrinkage Priors

Each measure has three rows corresponding to the three priors. Figures 1 and 2 represents the VS and prediction performance across the simulation scenarios for LR and MLR, respectively. The two simulation scenarios (BS8 and BS9) reflect the extension discussed in the earlier section– namely, the differential shrinkage option. In BS8, some demographic variables (gender, age, weight, height) as well as their interactions were included with the remaining covariates. The coefficients for the demographic variables and their interactions were not penalized, i.e. given non-shrinkage priors. Another simulation scenario (BS9) was explored that included the same four demographic variables along with four binary variables Ber(1,0.2). These proxied SNP variables- (sn1,sn2,sn3,sn4)–that arise in genomics applications. Interactions between (sn3,weight), and (sn3,age) were considered. Shrinkage was applied to all the main effects of SNPs and the interactions.

Fig. 1
figure 1

(a) Variable Selection and (b) Prediction Performance in LR with Shrinkage Priors across Simulation Scenarios

Fig. 2
figure 2

(a) Variable Selection and (b) Prediction Performance in MLR with Shrinkage Priors across Simulation Scenarios

The primary takeaway from the results in the tables is that performance metrics reach increasingly higher levels with higher N/P ratio and higher magnitude of the coefficients. In Table 1, for LR, a large part of these numerical measures lie in the range of 0.8-0.9 and the trend also continues in the simulation settings with differential shrinkage and interaction coefficients (BS8 and BS9). Most of the non-zero interaction coefficients in these settings were correctly identified. Higher correlation among the variables, as in BS5, and higher percentage of zero βs (BS6) does drop the performance to some extent. Overall, the proposed method does markedly better on specificity than sensitivity. Even in the p>n scenario, most of our performance metrics stayed above 0.8 for both prediction and variable selection. In Table 2, for MLR, larger sample size, as before drives better performance. Both large N and large P (MS7), although performing well in variable selection, has the worst predictive performance of all. We intend to explore such settings in more depth in future work with novel shrinkage methods specially geared towards such settings. Overall, performance scores in the range of 0.85-0.9 or above dominate the results in variable selection. The predictive performance, in contrast, is moderate compared to the counterpart in Table 1 for LR, though the prediction accuracies remain above 0.8 in 4 out of the 7 simulation settings. It is seen that a larger number of classes, as in MS6 (5 classes), push down the predictive measures to the range of 0.7-0.8. Lastly, the performances are very similar across all three priors for variable selection as well as prediction, so the method is robust to prior choices.

Data application

The method is validated by applying it to standard data examples. For all case studies, MCMC burn-in size is 6000, and no. of MCMC iterations is 10000. The training and test set size is of ratio 80/20. All data descriptions are in Table 3.

Table 3 Data Description

Pima indians diabetes

The data has been analyzed in the literature related to shrinkage priors, including [44,45,46]. The three priors can detect four variables by 95% credible intervals. The credible intervals of the coefficients with variable names “pregnant,” “glucose,” “mass (BMI),” and “pedigree (family history)” do not include zero; hence they are significant. These results are at par with [44]; the “pressure” variable is also determined as significant by DP prior. These three priors are also compared with Bayesian-Lasso(BLasso) [44], Bayesian-Elastic-Net(BElastic) [47] and frequentist methods such as Lasso, Elastic-Net and Ridge. Here, the three priors’ accuracy is similar, but BS is the least among all methods. Even though EN and Ridge have high accuracy, the frequentist methods have substantially high BS and low specificity values.


This data comprises of gene expression levels for 40 tumors and 22 normal colon tissues measured for 6500 human genes using the Affymetrix oligonucleotide arrays [48]. Samples are obtained from tumor tissue and adjacent unaffected parts of the colon of the same patients [49]. Out of these, 2000 genes with the highest minimal intensity across the tissues were selected for classification purposes [48]. In the public domain, there is no demographic information available for the two groups such as race, age, sex distribution; else, a study could have been conducted about the benefits of inclusion of one or all these covariates in prediction. Lasso selects 12 genes. The priors are applied to the data with the lasso estimates as the initial values. Here, an approach of k-means similar to [18] with prostate cancer data set is adopted. The |βi| are clustered by K-means algorithm at each MCMC iteration into two clusters. For each iteration, the number of non-zero β’s is then estimated by the smaller cluster size out of the two clusters. A final estimate (F) of the number of non-zero signals is obtained by taking the highest frequency over all the MCMC iterations. The F-number of gene IDs are traced back for each iteration and the first F genes with the highest frequency are chosen. In the case of Horseshoe prior, it selects 36 genes with the highest frequencies, among which one gene (gene ID: 1423) is included among the top 20 genes selected by the t-test and fold change [50]. Also, the gene ID: 1325 selected by Lasso [51] can be seen in the set of 36 genes selected by Horseshoe. DL prior selects 164 genes, of which gene ID: 138 is included in the top 2 genes. DP selects 67 genes, out of which eight genes are in the set of genes selected by Horseshoe.The MCMC Gibbs sampling mixing and convergence were determined by trace plots and auto-correlation plots with an average adequate sample size of 23095.21 for Horseshoe, 39755.4 for DL, and 40129.35 for DP. About 95% of the genes conform to the Geweke diagnostic criteria for all three priors. A thinning step at every 15 steps of iteration did not significantly improve the results, so a posterior prediction analysis was performed on 40,000 iterations after burn-in. The MSE for HS, DL, and DP are 0.001,0.003,0.003, implying that MCMC has converged well. Here, on the basis of Brier score Horseshoe prior performs better than the other priors.


Alzheimer’s disease (AD) is a critical public health concern throughout the world and one of the most widespread neurodegenerative disorder [52]. AD is an irreparable brain disease, which impairs thinking and memory. Several machine learning methods have helped predict AD from Mild Cognitive Impairment (MCI); here, the shrinkage priors are used to predict AD from MCI, and they achieve high prediction measures.

This dataset is obtained from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database ( The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The main mission was to test if the combination of serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be used to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD).

The data used here is obtained from the R package ADNIMERGE [42] which is a subset of the “adnimerge” data that contains only baseline variable measurements (i.e., the first visit for each participant) that has a diagnosis. The “adnimerge” datas et merges several key variables from various case report forms and biomarker lab from all ADNI protocols (ADNI1, ADNIGO, and ADNI2). The integrated data consists of 113 variables and 14712 observations, which include multiple observations per participant, representing multiple visits to the study’s site for participant evaluation The participants is divided into five different classes, namely, Cognitively Normal (CN =4428), Early Mild Cognitive Impairment (EMCI =2687), Late Mild Cognitive Impairment (LMCI =4993), Subjective Memory Complaint (SMC =938), and Alzheimer’s Disease (AD =1654). After pre-processing, there were 911 observations, and 23 variables including the outcome variable. The demographic, cognitive assessment and clinical assessment attributes that are included in the analysis are described in [53]. The missing values are discarded, and the data is normalized. The three priors, along with Bayesian lasso and Bayesian EN, are used to predict AD from MCI that comprised of both early and late stages. The three priors achieve 85% accuracy and 93.5% specificity for the prediction of AD from the MCI stage and have comparable results with other shrinkage priors. The results also indicate that the most distinguishing attribute for the prediction of AD includes the (13 item AD Cognitive Scale score), (Rey’s Auditory Verbal Learning Test (scores for immediate response learning, forgetting and percentage forgetting)) by all three priors. (Functional Activities Questionnaire) cognitive test was selected only by DP prior.


This data set consists of a longitudinal collection of Non-demented and Demented Older Adults, 150 subjects aged 60 to 96. The three priors’ most frequently included variables are Age, SES, MMSE, gender, ASF, and nWBV. The three priors’ prediction efficiency shows that they outperform other methods concerning BS and specificity. Table 4, and the circular bar chart Fig. 3 gives a detailed view of the prediction performance of LR model. Figure 3 is efficient in describing the comparison of all the four data sets in a single platform. For each data set, eight bars correspond to the eight shrinkage methods, and each stacked bar consists of the five measures of prediction. For ADNI and PIMA, all methods perform similarly, though the priors do not prove to be efficient in the high-dimensional Colon data. For the MLR model, in ADNI data, the three classes are considered as EMCI, LMCI, and AD, and collected data for these classes had a dimension of 1438×113. A 10- fold cross-validation was used for prediction analysis. The variables selected by the priors are “,” “,” “”, “”, and “ICV,” which are real-life markers for AD prediction and are validated by previous literature. The five-category model did not have a high performance in terms of accuracy. DL and DP provide better metrics than Horseshoe here. For OASIS data, the categorical response variable is classified into three classes. The moderate (1) and mild CDR (2) ratings are combined since there were only four observations for the moderate class. The variables selected are Age, Education, Socio Economic Status (SES), Mini-Mental State Exam (MMSE), and normalized Whole Brain Volume (nWBV)”. Accuracy and AUC are presented in parentheses for the three priors in Table 4. Figure 4 is prepared by the R packages HUM [54], and plotly [55] that provides details about the behavior of the shrinkage priors. The ROC surface for horseshoe prior is presented here; all the other priors follow similarly.

Fig. 3
figure 3

Circular Bar Chart comparing Prediction Metrics among data sets

Fig. 4
figure 4

ROC Surface Plot for Shrinkage priors in ADNI data

Table 4 Prediction Performance in Real Life Data for LR

The convergence metrics, trace plots, and acf plots for all data sets to assess the sampling procedure are presented in the supplementary file. The Geweke diagnostic (Z) is calculated that compares the mean of the samples drawn from the end of a chain of MCMC output to the mean of the samples at the beginning of the chain using a z-test statistic. A cut-off of 2 is arbitrarily chosen for the absolute value of Z. The percentage of |Z|<=2 is reported for each of the three priors.

Overall, it is seen that on two out of four datasets, our proposed method with the horseshoe prior outperformed the frequentist methods on specificity by as much as 10 percentage points. On the other two datasets, while we are behind on the specificity, we compare much favorably on sensitivity. The colon dataset has the second highest AUC from our method only superseded by Blasso. Overall, our method is clearly competitive and shows superiority in at least one of the measures over the frequentist methods in all datasets.


Bayesian shrinkage models can be utilized as a practical and useful alternative classification approach and a plausible way to select genetic markers and risk factors. This is a detailed study on a wide variety of settings comparing the three most well-known shrinkage priors. All of the three priors were able to recognize patterns differentiating the binary classes to a highly accurate level. This was shown by the excellent predictive performance of the binary LR model on the simulated data sets and fairly high predictive accuracies on a wide array of public health data sets. Some of these datasets were high dimensional, which exhibits the model’s power to scale up to these challenging scenarios. The algorithm is efficient, and the time taken to execute the simulations is relatively low, such as algorithm with n=500;p=50 takes about 673 seconds. This combination of computational power and predictive performance makes it a very reasonable method of use for practitioners who require quick high dimensional analysis, retaining the advantages of Bayesian analysis. In future we would formally explore the comparison with priors such as Spike and Slab priors and schemes ase the MH algorithm. We posit that features such as selections of indicators and acceptance-rejection step would not favorably compare with the proposed algorithm. We also extended the model to a multinomial logistic model that handles multiple categories. An R package “ShrinkageBayesGlm” is developed to be publicly available soon. We expect that the coming years will witness its wider dissemination among public health research and will be useful for predicting occurrences of common disorders as dementia, colon cancer, diabetes, etc. Computational advances, especially in high-dimensional cases [56] will continue expanding the scope of exact methods.

Availability of data and materials

The datasets used and/or analysed during the current study are available and information about it is included in this article in Table 3.



Stochastic Search Variable Selection






Dirichlet Laplace


Double Pareto




Elastic Net


Bayesian Elastic Net


Bayesian Lasso


Logistic Regression


Multinomial Logistic Regression


Data Augmentation


Markov Chain Monte Carlo


Area Under Curve


Receiver Operating Characteristic Curve


Mean Squared Error


Brier Score


variable selection


  1. Reuter JA, Spacek DV, Snyder MP. High-throughput sequencing technologies. Mol Cell. 2015; 58(4):586–97.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  2. Newton K, Newman W, Hill J. Review of biomarkers in colorectal cancer. Color Dis. 2012; 14(1):3–17.

    CAS  Article  Google Scholar 

  3. Krishnamurti U, Silverman JF. Her2 in breast cancer: a review and update. Adv Anat Pathol. 2014; 21(2):100–07.

    CAS  PubMed  Article  Google Scholar 

  4. Cappuzzo F, Gregorc V, Rossi E, Cancellieri A, Magrini E, Paties CT, Ceresoli G, Lombardo L, Bartolini S, Calandri C, et al. Gefitinib in pretreated non–small-cell lung cancer (nsclc): Analysis of efficacy and correlation with her2 and epidermal growth factor receptor expression in locally advanced or metastatic nsclc. J Clin Oncol. 2003; 21(14):2658–63.

    CAS  PubMed  Article  Google Scholar 

  5. Barnes DE, Lee SJ. Predicting alzheimer’s risk: why and how?Alzheimers Res Ther. 2011; 3(6):33.

    PubMed  PubMed Central  Article  Google Scholar 

  6. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol. 1996; 58(1):267–88.

    Google Scholar 

  7. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005; 67(2):301–20.

    Article  Google Scholar 

  8. Li Y, Horowitz MA, Liu J, Chew A, Lan H, Liu Q, Sha D, Yang C. Individual-level fatality prediction of covid-19 patients using ai methods. Front Public Health. 2020; 8:566.

    CAS  Google Scholar 

  9. Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970; 12(1):55–67.

    Article  Google Scholar 

  10. Bae K, Mallick BK. Gene selection using a two-level hierarchical bayesian model. Bioinformatics. 2004; 20(18):3423–30.

    CAS  PubMed  Article  Google Scholar 

  11. Bhadra A, Datta J, Polson NG, Willard B, et al. The horseshoe+ estimator of ultra-sparse signals. Bayesian Anal. 2017; 12(4):1105–31.

    Article  Google Scholar 

  12. Griffin J, Brown P, et al. Hierarchical shrinkage priors for regression models. Bayesian Anal. 2017; 12(1):135–59.

    Article  Google Scholar 

  13. Ishwaran H, Rao JS, et al. Spike and slab variable selection: Frequentist and Bayesian strategies. Ann Stat. 2005; 33(2):730–73.

    Article  Google Scholar 

  14. Makalic E, Schmidt DF. A simple sampler for the horseshoe estimator. IEEE Sig Process Lett. 2015; 23(1):179–82.

    Article  Google Scholar 

  15. Park T, Casella G. The Bayesian Lasso. J Am Stat Assoc. 2008; 103(482):681–86.

    CAS  Article  Google Scholar 

  16. Piironen J, Vehtari A. On the hyperprior choice for the global shrinkage parameter in the horseshoe prior. In: Artificial Intelligence and Statistics; 2017. p. 905–13.

  17. Armagan A, Dunson DB, Lee J. Generalized Double Pareto shrinkage. Stat Sin. 2013; 23(1):119.

    PubMed  PubMed Central  Google Scholar 

  18. Bhattacharya A, Pati D, Pillai NS, Dunson DB. Dirichlet–Laplace priors for optimal shrinkage. J Am Stat Assoc. 2015; 110(512):1479–90.

    CAS  PubMed  Article  Google Scholar 

  19. Van Erp S, Oberski DL, Mulder J. Shrinkage priors for Bayesian penalized regression. J Am Stat Assoc. 2019; 89:31–50.

    Google Scholar 

  20. Mitchell TJ, Beauchamp JJ. Bayesian variable selection in linear regression. J Am Stat Assoc. 1988; 83(404):1023–32.

    Article  Google Scholar 

  21. George EI, McCulloch RE. Variable selection via Gibbs sampling. J Am Stat Assoc. 1993; 88(423):881–89.

    Article  Google Scholar 

  22. George EI, McCulloch RE. Approaches for Bayesian variable selection. Stat Sin. 1997:339–373.

  23. Green PJ, Hastie DI. Reversible jump mcmc. Genetics. 2009; 155(3):1391–403.

    Google Scholar 

  24. Polson NG, Scott JG. Shrink globally, act locally: Sparse Bayesian regularization and prediction. Bayesian Stat. 2010; 9:501–38.

    Google Scholar 

  25. Griffin JE, Brown PJ, et al. Inference with normal-gamma prior distributions in regression problems. Bayesian Anal. 2010; 5(1):171–88.

    Article  Google Scholar 

  26. Carvalho CM, Polson NG, Scott JG. The horseshoe estimator for sparse signals. Biometrika. 2010; 97(2):465–80.

    Article  Google Scholar 

  27. O’Hara RB, Sillanpää MJ, et al. A review of bayesian variable selection methods: what, how and which. Bayesian Anal. 2009; 4(1):85–117.

    Google Scholar 

  28. Wang C, Daniels M, Scharfstein DO, Land S. A Bayesian shrinkage model for incomplete longitudinal binary data with application to the breast cancer prevention trial. J Am Stat Assoc. 2010; 105(492):1333–46.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  29. Chen H. -C., Wehrly TE. Approximate uniform shrinkage prior for a multivariate generalized linear mixed model. J Multivar Anal. 2016; 145:148–61.

    Article  Google Scholar 

  30. Albert JH, Chib S. Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc. 1993; 88(422):669–79.

    Article  Google Scholar 

  31. Polson NG, Scott JG, Windle J. Bayesian inference for logistic models using pólya–gamma latent variables. J Am Stat Assoc. 2013; 108(504):1339–49.

    CAS  Article  Google Scholar 

  32. Choi HM, Hobert JP, et al. The Polya-Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic. Electron J Stat. 2013; 7:2054–64.

    Article  Google Scholar 

  33. Van Dyk DA, Meng X-L. The art of data augmentation. J Comput Graph Stat. 2001; 10(1):1–50.

    Article  Google Scholar 

  34. Hobert JP, Marchev D, et al. A theoretical comparison of the data augmentation, marginal augmentation and px-da algorithms. Ann Stat. 2008; 36(2):532–54.

    Article  Google Scholar 

  35. Holmes CC, Held L, et al. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Anal. 2006; 1(1):145–68.

    Google Scholar 

  36. Frühwirth-Schnatter S, Frühwirth R. Data augmentation and mcmc for binary and multinomial logit models. In: Statistical Modelling and Regression Structures. Springer: 2010. p. 111–32.

  37. Carvalho CM, Polson NG, Scott JG. Handling sparsity via the horseshoe. Proc Twelth Int Conf Artif Intell Stat. 2009; 5:73–80.

    Google Scholar 

  38. Bhattacharya A, Chakraborty A, Mallick BK. Fast sampling with Gaussian scale mixture priors in high-dimensional regression. R software. 2016:042.

  39. R Core Team. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing; 2020. R Foundation for Statistical Computing.

    Google Scholar 

  40. Leisch F, Dimitriadou E. mlbench: Machine Learning Benchmark Problems. R package version 2.1-3. 2021.

  41. Silva PD. Hidimda: An r package for supervised classification of high-dimensional data. In: 1ères Rencontres R; 2012.

  42. the ADNI team. ADNIMERGE: Alzheimer’s Disease Neuroimaging Initiative. 2020. R package version 0.0.1.

  43. Marcus DS, Fotenos AF, Csernansky JG, Morris JC, Buckner RL. Open access series of imaging studies: longitudinal mri data in nondemented and demented older adults. J Cogn Neurosci. 2010; 22(12):2677–84.

    PubMed  PubMed Central  Article  Google Scholar 

  44. Makalic E, Schmidt D. High-Dimensional Bayesian Regularised Regression with the Bayesreg Package. arXiv:1611.06649.

  45. Bové DS, Held L, et al. Hyper- g priors for generalized linear models. Bayesian Anal. 2011; 6(3):387–410.

    Article  Google Scholar 

  46. Ghosh J, Li Y, Mitra R, et al. On the use of Cauchy prior distributions for Bayesian logistic regression. Bayesian Anal. 2018; 13(2):359–83.

    Article  Google Scholar 

  47. Huang A, Liu D. Ebglmnet: a comprehensive R package for sparse generalized linear regression models. Bioinformatics. 2016.

  48. Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci. 1999; 96(12):6745–50.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  49. Sakellariou A, Sanoudou D, Spyrou G. Combining multiple hypothesis testing and affinity propagation clustering leads to accurate, robust and sample size independent classification on gene expression data. BMC Bioinformatics. 2012; 13(1):270.

    PubMed  PubMed Central  Article  Google Scholar 

  50. Pepe MS, Longton G, Anderson GL, Schummer M. Selecting differentially expressed genes from microarray experiments. Biometrics. 2003; 59(1):133–42. Wiley Online Library.

  51. Algamal Z. An efficient gene selection method for high-dimensional microarray data based on sparse logistic regression. Electron J Appl Stat Anal. 2017; 10(1):242–56.

    Google Scholar 

  52. Bäckman L, Jones S, Berger A-K, Laukka EJ, Small BJ. Cognitive impairment in preclinical alzheimer’s disease: a meta-analysis. Neuropsychology. 2005; 19(4):520.

    PubMed  Article  Google Scholar 

  53. Shahbaz M, Ali S, Guergachi A, Niazi A, Umer A. Classification of alzheimer’s disease using machine learning techniques. In: DATA; 2019. p. 296–303.

  54. Novoselova N, Della Beffa C, Wang J, Li J, Pessler F, Klawonn F. Hum calculator and hum package for r: easy-to-use software tools for multicategory receiver operating characteristic analysis. Bioinformatics. 2014; 30(11):1635–36.

    CAS  PubMed  Article  Google Scholar 

  55. Sievert C. Interactive Web-Based Data Visualization with R, Plotly, and Shiny: Chapman and Hall/CRC; 2020.

  56. Lee SY, Pati D, Mallick BK. Continuous shrinkage prior revisited: a collapsing behavior and remedy. arXiv preprint arXiv:2007.02192. 2020.

Download references


Authors would like to acknowledge Dr. Craig McClain and Dr. S. Srivastava for funding to conduct the research.

ADNI data

Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health ( The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

*Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative(ADNI) database ({}). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at:

OASIS data

Authors would like to acknowledge the following for the OASIS data

• OASIS: Cross-Sectional: Principal Investigators: D. Marcus, R, Buckner, J, Csernansky J. Morris; P50 AG05681, P01 AG03991, P01 AG026276, R01 AG021910, P20 MH071616, U24 RR021382

• OASIS: Longitudinal: Principal Investigators: D. Marcus, R, Buckner, J. Csernansky, J. Morris; P50 AG05681, P01 AG03991, P01 AG026276, R01 AG021910, P20 MH071616, U24 RR021382

• OASIS-3: Principal Investigators: T. Benzinger, D. Marcus, J. Morris; NIH P50AG00561, P30NS09857781, P01AG026276, P01AG003991, R01AG043434, UL1TR000448, R01EB009352. AV-45 doses were provided by Avid Radiopharmaceuticals, a wholly owned subsidiary of Eli Lilly.


Dr. S. Rai was partly supported with Wendell Cherry Chair in Clinical Trial Research Fund, multiple National Institutes of Health (NIH) grants (5P20GM113226, PI: McClain; 1P42ES023716, PI: Srivastava; 5P30GM127607-02, PI: Jones; 1P20GM125504-01, PI: Lamont; 2U54HL120163, PI: Bhatnagar/Robertson; 1P20GM135004, PI: Yan; 1R35ES0238373-01, PI: Cave; 1R01ES029846, PI: Bhatnagar; 1R01ES027778-01A1, PI: States; 1P30ES030283, PI: Sates), and Kentucky Council on Postsecondary Education grant (PON2 415 1900002934, PI: Chesney).

Author information




A.B. has contributed to the methodology, data collection, analysis, and writing of the manuscript. S.P. and R.M. have contributed equally to methodology and analysis. S.R. has contributed to developing ideas and valuable comments. All authors have contributed to the final preparation of the manuscript.

Corresponding author

Correspondence to Shesh Rai.

Ethics declarations

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

The additional tables and figures are presented in the Supplementary file.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Bhattacharyya, A., Pal, S., Mitra, R. et al. Applications of Bayesian shrinkage prior models in clinical research with categorical responses. BMC Med Res Methodol 22, 126 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Shrinkage priors
  • Logistic regression
  • Horseshoe
  • Dirichlet Laplace
  • MCMC
  • Polya-Gamma
  • Multinomial
  • ADNI
  • Pima
  • Data augmentation