Applications of Bayesian shrinkage prior models in clinical research with categorical responses

Background Prediction and classification algorithms are commonly used in clinical research for identifying patients susceptible to clinical conditions such as diabetes, colon cancer, and Alzheimer’s disease. Developing accurate prediction and classification methods benefits personalized medicine. Building an excellent predictive model involves selecting the features that are most significantly associated with the outcome. These features can include several biological and demographic characteristics, such as genomic biomarkers and health history. Such variable selection becomes challenging when the number of potential predictors is large. Bayesian shrinkage models have emerged as popular and flexible methods of variable selection in regression settings. This work discusses variable selection with three shrinkage priors and illustrates its application to clinical data such as Pima Indians Diabetes, Colon cancer, ADNI, and OASIS Alzheimer’s real-world data. Methods A unified Bayesian hierarchical framework that implements and compares shrinkage priors in binary and multinomial logistic regression models is presented. The key feature is the representation of the likelihood by a Polya-Gamma data augmentation, which admits a natural integration with a family of shrinkage priors, specifically focusing on Horseshoe, Dirichlet Laplace, and Double Pareto priors. Extensive simulation studies are conducted to assess the performances under different data dimensions and parameter settings. Measures of accuracy, AUC, brier score, L1 error, cross-entropy, and ROC surface plots are used as evaluation criteria comparing the priors with frequentist methods as Lasso, Elastic-Net, and Ridge regression. Results All three priors can be used for robust prediction on significant metrics, irrespective of their categorical response model choices. Simulation studies could achieve the mean prediction accuracy of 91.6% (95% CI: 88.5, 94.7) and 76.5% (95% CI: 69.3, 83.8) for logistic regression and multinomial logistic models, respectively. The model can identify significant variables for disease risk prediction and is computationally efficient. Conclusions The models are robust enough to conduct both variable selection and prediction because of their high shrinkage properties and applicability to a broad range of classification problems. Supplementary Information The online version contains supplementary material available at (10.1186/s12874-022-01560-6).


Introduction
accessible for statistical prediction analysis and clinical research. Medical decisions are modified upon the identification of essential covariates for a specific clinical outcome. Treatment recommendations are also age, population, and comorbidity specific, with HTS bringing a paradigm shift in large-scale personalized medicine. For example, in diabetes prediction, various biological and clinical factors such as sex, age, obesity are responsible for predicting the future risk of the disease. In oncology, for instance, in colorectal cancer [2], breast cancer [3], non-small cell lung cancer(NSCLS) prognostic [4], and predictive genes such as human epidermal growth factor receptor 2 (HER2), BRAF, KRAS are not only essential for early detection but also therapy selection, subgroup stratification, and controlled monitoring of the disease. In Alzheimer's disease (AD), a combination of traditional risk factors including age, education, hypertension, obesity, cognitive test scores, cardiovascular disease [5] along with testing of ApoE4 gene can be informative in identifying if a patient is vulnerable to the non-curable disease or any other forms of dementia.
A clinical prediction or classification model, followed by a variable selection strategy, plays a vital role in designing preventive measures from adverse outcomes.

Penalized regression
Variable selection methods can mitigate noise imposed by irrelevant variables in the data. Most of these methods are frequentist approaches such as Lasso [6], Elastic-Net [7] that uses L1 or L2 penalty to shrink the coefficients of irrelevant variables. Bayesian regression methods that incorporates shrinkage properties as prior information is a reasonable alternative approach, and is the ultimate focus of this research work.
Public health researchers often encounters prediction problems with categorical responses. Depending on the clinical features the primary outcome needs to be predicted, such as the classification of a Covid-19 patient as "discharged" or "died" or "remained in hospital" with individual-patient level data [8]. When the response variable is binary, logistic regression (LR) models enable assessing the association between an independent variable(s) and the response variable. When the response variable has more than two categories, generalizations of the logistic model, such as multinomial logistic regression (MLR) models, are common. The above example provides a better understanding of categorical response models and their contribution as a natural and attractive approach in a diverse array of applications, ranging from spam detection, credit card fraud to predicting Alzheimer's stage and tumor malignancy. The categorical response models' primary goal is to infer relationships between a categorical response and a set of explanatory variables.
As in any form of regression technique, this is critical to understanding the causal dynamics of a complex system and making accurate predictions on future data. Depending on a specific problem, the explanatory variables or predictors can be of the form of demographic profiles, socio-economic data, or complex biomolecular entities. Modeling becomes particularly challenging when the number of such predictors grows large, a common feature in genomics. Traditionally, frequentist likelihoodbased methods have been using penalized regression techniques to address this "curse of dimensionality" and to deal with data having a considerable number of predictors in linear and multicategory response regression. These methods briefly add penalty terms to the negative log-likelihood function and then optimize this penalized likelihood for parameter inference. The penalization leads to a desired "shrinking" behavior of the estimates. Specifically, it pulls the small coefficients towards zero while minimally affecting the large coefficients, thus selecting only the most relevant variables. For a LR set up, the objective function that needs to be minimized is of the form Here, y=(y 1 ,y 2 ,...,y n ) is an n-dimensional vector representing the outcome variable; x i =(x i1 ,x i2 ,…,x ip ) are the covariates with x ij are the observed values on the p predictors and (β 1 ,β 2 ,...,β p ) T is a p-dimensional parameter vector of regression coefficients. λ is the penalty parameter which modulates the amount of shrinkage. Large values of λ lead to more shrinkage, while λ=0 leads to the simple logistic likelihood without shrinkage. The generalized objective function for multiple categories is Different values of r lead to various penalization techniques. For example, r=1 results in the well-known Least Absolute Shrinkage and Selection Operator (Lasso) [6] solution and r=2 results in the Ridge regression solution [9]. Elastic-Net [7] is another method that is a convex combination of Lasso and Ridge.

Shrinkage priors
Bayesian variable selection models form a natural counterpart to penalization, with some specialized priors assuming the penalty terms' role. Some notable works in this area include [6, 10-18] among many. A comprehensive overview of shrinkage priors for data applications is present in [19]. The historical development of shrinkage priors can be traced back to the spike and slab approach (1) proposed in [20]. It is a hierarchical mixture model that essentially uses a latent binary indicator to differentiate between the large and small regression coefficients. It does so by assigning Gaussian priors with high and low variances to the coefficients, conditional on these indicators. The resultant marginal prior for a coefficient takes on a shape with a "spike" at 0 and a relatively flat "slab" at non-zero regions, lending the model its name. Stochastic search variable selection (SSVS) [21] is used for identifying a subset of significant covariates with a hierarchical normal mixture model structure, similar to the spike and slab prior. However, it uses continuous Gaussian distribution to include and exclude variables and lower variance, respectively. The spike-and slab prior has a bi-separation effect on the model coefficients, thus bringing a computational complexity of 2 p . SSVS is computationally intensive and cannot tackle a large set of predictors. Inclusions of non-conjugate and conjugate priors are detailed in this extension of the work [22]. A related class of variable selection (VS) models put the positive prior mass at 0. Inference in these models usually relies on Reversible Jump sampling techniques [23]. Though these models have the nice property of explicitly reducing coefficients to zero, they typically incur high computational costs and exhibit poor mixing. These issues necessitated using a computationally efficient class of priors, which allowed sufficient shrinkage in high dimensions. The focus solely is on such priors for this article. These priors are called global-local(GL) shrinkage priors [24]. Denoting the parameter set as β=(β 1 ,...,β p ) T , the GL framework can be written hierarchically as follows β j ∼N(0,λ j τ) λ j ∼f(.) and τ∼g(.). The global parameter τ controls the overall shrinkage of all the coefficients towards zero while the local parameters λ=(λ 1 ,...,λ p ) modulate the coefficient-specific shrinkage. It is desired that the distribution of the local parameter f(.) has heavy tails and the distribution of the global parameter g(.) has substantial mass at point zero. The Normal-Gamma prior [25], the Dirichlet Laplace (DL) prior [18], Horseshoe prior [26], and the Double Pareto (DP) prior [17] are some of the GL priors. A review and comparison of shrinkage methods for VS are detailed in [27]. High-dimensional statistical modelling with Bayesian shrinkage prior method has been extended in several arenas such as longitudinal binary data with informative missingness [28], and for joint modelling of clustered mixed outcomes with uniform shrinkage prior [29]. The second piece of our proposed approach focuses on a convenient data augmentation representation of the logistic regression likelihood. Two data augmentation algorithms that have gained popularity for binary responses which are for the probit regression model that uses truncated normal random variables [30] and another for the logistic regression that uses Polya-Gamma (PG) random variables [31]. Here we focus on the latter scheme.
In the following sections, the PG Data Augmentation is described under LR and MLR models. Next, the dataaugmented likelihood framework's connection to the Horseshoe, DL, and DP priors is presented, embedding them in a fully Bayesian hierarchical model. Performance metrics evaluate the simulation results with different sample sizes, covariate dimensions, and parameter settings with several data applications. The limitations and contributions are in the Discussion section where the article is concluded with future research directions.

Bayesian binary regression and polya-Gamma data augmentation
Consider the binary logistic regression model where Y 1 ,Y 2 ,...,Y n are i.i.d. Bernoulli random variables with variables x i ∈ R p n i=1 as the covariates and β ∈ R p denotes the unknown regression coefficients. The corresponding likelihood is given by In the context of the Bayesian analysis, the posterior of β given the data is given by where π(β) denotes the prior distribution for β. Unfortunately, the above posterior becomes intractable due to the presence of the term (1 + exp(x T i β)) in its denominator [32]. Therefore, posterior sampling in this setup traditionally relied on the Metropolis Hastings (MH) algorithm. As an alternative, the Data Augmentation (DA) algorithm utilizes the latent variables to circumvent the difficulty. Before introducing to the specific Polya-Gamma DA scheme that we used in this article, we provide the general structure of a DA algorithm. Suppose, we need to sample a parameter of interest θ from an intractable density π(θ). The technique requires to design a suitable joint density π(θ,W) in such a way that it satisfies the following two criteria ; firstly, π(θ, W )dW = π(θ) and secondly, the corresponding conditional distributions, π(θ|W) and π(W|θ) are possible to sample from [33,34]. In the context of the Bayesian analysis, π(θ) typically refers to a posterior density for a parameter of interest θ. A major challenge for designing a DA is to construct an appropriate choice for π(θ,W) (see [30], [31]). A commonly used strategy is to build a conditional distribution π(W|θ) so that π(θ,W)=π( W|θ)π(θ) fulfills the requirements. The latent variable W is often termed as the augmented random variable. In the current context, our parameter of interest is β with the posterior density π(β|Y) while we make use of the DA technique designed in [31] where the authors develops and . utilizes the Polya-Gamma (PG) distribution as the choice for the augmented random variable. As we will require in the later sections, we include the probability density function of the Polya-Gamma distribution (denoted by PG(1,c)) [31], as following To deal with the intractability in Eq. 3, the DA scheme augments independent latent variables, ) satisfies the two criteria related to a generic DA algorithm that are mentioned above. The integrability criterion trivially holds whereas the the random variables W 1 ,…,W n given {Y i } n i=1 , β are independent and follows PG distribution [31]. The posterior condi- becomes to be the multivariate normal if the multivariate normal prior is used for β [31]. The appearance of normal distribution as a posterior for β is a key feature which provides a way to utilize more nontrivial priors for β without much difficulty. Specifically, in this manuscript we use the Global-Local priors that we discuss in details in the next section. As mentioned earlier, Bayesian regression for binary responses has been recognized as a hard problem due to the likelihood's unwieldy form. There have been several efforts towards an improved version of the DA algorithm [30] for probit regression. Some notable works include [35] and [36]. However, these algorithms are more imprecise versions of [30] making it significantly difficult with multiple layers of latent variables and restricting its usage. In contrast, the DA algorithm by [31] is free of these problems and computationally much less cumbersome.

Logistic regression model with hierarchical prior structures
In this subsection, we include the details of each of the three priors distributions along with the corresponding posterior distributions. The original form of the horseshoe prior [37] is represented as Another computationally feasible hierarchical representation of the Horseshoe prior [14] that is used here is Here, the variance covariance matrix Σ of the distribution of β is a diagonal matrix with elements (Λ 1 τ 2 ,…Λ p τ 2 ). From Eqs. 3,(4) and the above hierarchical prior structure (6), the full posterior distribution is given by: where, h(w i ) is defined in Eq. 4. The conditional distributions required for the analysis are as follows: The conditional density of β given y, w is where, W D and Σ are diagonal matrices where the elements are (w 1 ,w 2 ,...,w n ), (� 2 1 τ 2 , ..., � 2 p τ 2 ) respectively and, y * = y 1 − 1 2 , ...., y n − 1 2 . In high-dimensional scenarios where sampling of β can be difficult for p>n case, fast sampling technique with Gaussian scale-mixture priors [38] is used where the mean and variance of a Gaussian distribution is in the respective form: The algorithm involves sampling u∼N(0,D), and δ∼N(0,I n ); with V=Au+δ getting the inverse of (ADA T +I n )w=(α−v), and finally obtaining θ=u+DA T w,θ∼N(μ,Σ). The conditional density of w i given x i ,β is The conditional density of the hyper-parameters are as follows Here, all the posterior densities are in the closed form, and follow simple probability densities such as Normal, Polya-Gamma and Inverse-Gamma making sampling from them trivial. The hierarchical structure of the Dirichlet Laplace prior [18] is The conditional posterior distributions remain same for β|y i and w i |β is similar to that of Eqs. 8 and 9.  The conditional density of the hyper-parameters as obtained similar to Theorem 2.2 in [18] are as follows: . The hierarchical structure of Double Pareto prior [30] is Again, the conditional densities of β|y i and w i |β remains same as (8) and (9). Here Σ=D τ is a diagonal matrix with elements (τ 1 ,τ 2 ,...,τ p ). The conditional density of rest of the hyper-parameters are as follows:

Extending hierarchical models and differential shrinkage
The strength of our methods are in no way limited to a common shrinkage prior across covariates. In fact, this could be applied to several different variant models which allow for borrowing strength across multiple responses and different (in some cases, user-defined) levels of sparsity among groups of covariates. In particular, this could be used for models where we can choose not to shrink some covariates (for instance, some demographic covariates) in contrast to the genomic covariates.
In terms of estimation procedures, the hierarchical structure of the models would allow the form of the posterior conditional of β would remain the same in the model variants. Only the diagonal weight matrices appearing in the posterior means and variances would have some differential allocation of λs and constant variances depending on which variable that we choose to shrink. Similarly, the posterior sampling of the shrinkage hyper-parameters would be a subset. Details of the derivation are omitted since these are straight forward.

Bayesian multinomial logistic regression and polya-Gamma data augmentation
This is the extension of the binary regression, where the outcome variable has more than two classes. Let Y i , i=1,2,…,n be a categorical random variable with (12) k categories, where k≥2. The probability for the k th category is p ik =P(Y i =k|x i ), where K k=1 p ik = 1 . The multinomial logistic regression (MLR) model is given as Here, β k are the coefficients associated with k-th category, and K is the baseline category with its coefficient β K constrained to zero. The Polya-Gamma data augmentation approach can be extended for multi-category response variables. Here, β k is updated conditional on the remaining β −k =(β 1 ,β 2 ,…,β k−1 ,β k+1 ,…,β K ). The full conditional for β k given y and β j≠k can be expressed as a likelihood of the Bernoulli distribution.
It can be re-written as where, Thus, the full conditional for β k given y and β −k is which is a logistic regression likelihood, thus PG data augmentation can be implemented to update each of β k 's based on the binary indicator variable I(y i =k). The above Eq. 17 after including PG data augmentation and prior for β k ∼N(0,Σ k ) can be expressed as:

Multinomial logistic regression with hierarchical prior structures
The hierarchical structure with three priors was calculated similarly to LR. Considering N (0, � 2 t τ 2 ), t = 1, 2, . . . , p , for β 1 ,β 2 ,…,β K−1 , the conditional density for β k and w ik are given as The conditional density of the hyper-parameters in the case of horseshoe prior is The conditional density of the hyper-parameters in case of DL prior is . The conditional densities of the hyper-parameters are as follows: Here, β k =(β 1k ,β 2k ,…,β pk ),t=1,2,…,p.

Simulation and results
Data was simulated from the logistic regression model under various settings of sample size (n), dimensions (p), and effect sizes β=(β 1 ,β 2 ,…,β p ). The continuous covariates are simulated with mean m and sd=1. A total of 100 data sets were simulated for each of the conditions. All computations are carried out in RStudio [39]. The length of the Markov Chain Monte Carlo (MCMC) simulation is 10000, and 6000 iterations are discarded in the burn-in step. The value of parameter a in DL prior is chosen as 0.8 for optimal results and convergence.

Post-processing of mCMC samples
After model fitting variable selection was implemented by computing their posterior credible intervals. Posterior credible intervals can be readily obtained from a Bayesian framework via MCMC samples, and can provide direct uncertainty measures irrespective of model complexity. Specifically, these intervals are constructed by the quantiles of the MCMC samples of the parameters and estimate the probability intervals of the true posterior distribution. The variables are declared to be significant if and only if these posterior credible interval do not (20) include zero. Thus, it will be a binary vector of the variables that were selected from estimation. These binary estimates were then matched against the simulation truth to compute some performance metrics.
Other than performance measures, these credible interval techniques would also provide a convenient measure of uncertainty, in case we need to use them. The varying sizes and noise content of datasets gets reflected in the varying lengths of the estimated prediction intervals that in turn allows us to place varying degrees of confidence in our inferred relationships. This could also inform us about the desired adequacy of sample size and number of variables for future studies. This can be readily obtained from MCMC samples unlike other traditional methods such as Lasso.

Performance measures
Based on the binary estimates, sensitivity, specificity and overall accuracy were computed for variable selection. In addition, two continuous error measures, MSE and Brier scores,were computed. The Brier Score (BS) is defined as here, P i is the probability of prediction and Y i is the actual outcome at that instance. The best score achievable is 0 and the worst is 1; The Mean Squared Error (MSE) is given by 1

Prediction
A prediction module was also added that would assess how the model fares in classifying the responses. For this, a cross-validation approach and divided each simulated data set into training and test data on a 80-20 allocation percentage. The following measures were used for assessing the quality of our prediction. Accuracy (the proportion of correctly classified responses in the test dataset); Sensitivity (proportion of true positives); Specificity (proportion of true negatives); AUC (Area Under the Receiver Operating Characteristic (ROC) Curve). Each measure has three rows corresponding to the three priors. Figures 1 and 2 represents the VS and prediction performance across the simulation scenarios for LR and MLR, respectively. The two simulation scenarios (BS8 and BS9) reflect the extension discussed in the earlier sectionnamely, the differential shrinkage option. In BS8, some demographic variables (gender, age, weight, height) as well as their interactions were included with the remaining covariates. The coefficients for the demographic variables and their interactions were not penalized, i.e. given nonshrinkage priors. Another simulation scenario (BS9) was      explored that included the same four demographic variables along with four binary variables ∼Ber(1,0.2). These proxied SNP variables-(sn1,sn2,sn3,sn4)-that arise in genomics applications. Interactions between (sn3,weight), and (sn3,age) were considered. Shrinkage was applied to all the main effects of SNPs and the interactions. The primary takeaway from the results in the tables is that performance metrics reach increasingly higher levels with higher N/P ratio and higher magnitude of the coefficients. In Table 1, for LR, a large part of these numerical measures lie in the range of 0.8-0.9 and the trend also continues in the simulation settings with differential shrinkage and interaction coefficients (BS8 and BS9). Most of the non-zero interaction coefficients in these settings were correctly identified. Higher correlation among the variables, as in BS5, and higher percentage of zero β ′ s (BS6) does drop the performance to some extent. Overall, the proposed method does markedly better on specificity than sensitivity. Even in the p>n scenario, most of our performance metrics stayed above 0.8 for both prediction and variable selection. In Table 2, for MLR, larger sample size, as before drives better performance. Both large N and large P (MS7), although performing well in variable selection, has the worst predictive performance of all. We intend to explore such settings in more depth in future work with novel shrinkage methods specially geared towards such settings. Overall, performance scores in the range of 0.85-0.9 or above dominate the results in variable selection. The predictive performance, in contrast, is moderate compared to the counterpart in Table 1 for LR, though the prediction accuracies remain above 0.8 in 4 out of the 7 simulation settings. It is seen that a larger number of classes, as in MS6 (5 classes), push down the predictive measures to the range of 0.7-0.8. Lastly, the performances are very similar across all three priors for variable selection as well as prediction, so the method is robust to prior choices.

Data application
The method is validated by applying it to standard data examples. For all case studies, MCMC burn-in size is 6000, and no. of MCMC iterations is 10000. The training and test set size is of ratio 80/20. All data descriptions are in Table 3.

Pima indians diabetes
The data has been analyzed in the literature related to shrinkage priors, including [44][45][46]. The three priors can detect four variables by 95% credible intervals. The credible intervals of the coefficients with variable names "pregnant, " "glucose, " "mass (BMI), " and "pedigree (family history)" do not include zero; hence they are significant. These results are at par with [44]; the "pressure" variable is also determined as significant by DP prior. These three priors are also compared with Bayesian-Lasso(BLasso) [44], Bayesian-Elastic-Net(BElastic) [47] and frequentist methods such as Lasso, Elastic-Net and Ridge. Here, the three priors' accuracy is similar, but BS is the least among all methods. Even though EN and Ridge have high accuracy, the frequentist methods have substantially high BS and low specificity values.

Colon
This data comprises of gene expression levels for 40 tumors and 22 normal colon tissues measured for 6500 human genes using the Affymetrix oligonucleotide arrays [48]. Samples are obtained from tumor tissue and adjacent unaffected parts of the colon of the same patients [49]. Out of these, 2000 genes with the highest minimal intensity across the tissues were selected for classification purposes [48]. In the public domain, there is no demographic information available for the two groups such as race, age, sex distribution; else, a study could have been conducted about the benefits of inclusion of one or all these covariates in prediction. Lasso selects 12 genes. The priors are applied to the data with the lasso estimates as the initial values. Here, an approach of k-means similar to [18] with prostate cancer data set is adopted. The |β i | are clustered by K-means algorithm at each MCMC iteration into two clusters. For each iteration, the number of non-zero β's is then estimated by the smaller cluster size out of the two clusters. A final estimate (F) of the number of non-zero signals is obtained by taking the highest frequency over all the MCMC iterations. The F-number of gene IDs are traced back for each iteration and the first F genes with the highest frequency are chosen. In the case of Horseshoe prior, it selects 36 genes with the highest frequencies, among which one gene (gene ID: 1423) is included among the top 20 genes selected by the t-test and fold change [50]. Also, the gene ID: 1325 selected by Lasso [51]

ADNI
Alzheimer's disease (AD) is a critical public health concern throughout the world and one of the most widespread neurodegenerative disorder [52]. AD is an irreparable brain disease, which impairs thinking and memory. Several machine learning methods have helped predict AD from Mild Cognitive Impairment (MCI); here, the shrinkage priors are used to predict AD from MCI, and they achieve high prediction measures. This dataset is obtained from Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The main mission was to test if the combination of serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be used to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).
The data used here is obtained from the R package ADNIMERGE [42] which is a subset of the "adnimerge" data that contains only baseline variable measurements (i.e., the first visit for each participant) that has a diagnosis. The "adnimerge" datas et merges several key variables from various case report forms and biomarker lab from all ADNI protocols (ADNI1, ADNIGO, and ADNI2). The integrated data consists of 113 variables and 14712 observations, which include multiple observations per participant, representing multiple visits to the study's site for participant evaluation The participants is divided into five different classes, namely, Cognitively Normal (CN =4428), Early Mild Cognitive Impairment (EMCI =2687), Late Mild Cognitive Impairment (LMCI =4993), Subjective Memory Complaint (SMC =938), and Alzheimer's Disease (AD =1654). After pre-processing, there were 911 observations, and 23 variables

OASIS
This data set consists of a longitudinal collection of Non-demented and Demented Older Adults, 150 subjects aged 60 to 96. The three priors' most frequently included variables are Age, SES, MMSE, gender, ASF, and nWBV. The three priors' prediction efficiency shows that they outperform other methods concerning BS and specificity. Table 4, and the circular bar chart Fig. 3 gives a detailed view of the prediction performance of LR model. Figure 3 is efficient in describing the comparison of all the four data sets in a single platform. For each data set, eight bars correspond to   Table 4. Figure 4 is prepared by the R packages HUM [54], and plotly [55] that provides details about the behavior of the shrinkage priors. The ROC surface for horseshoe prior is presented here; all the other priors follow similarly. The convergence metrics, trace plots, and acf plots for all data sets to assess the sampling procedure are presented in the supplementary file. The Geweke diagnostic (Z) is calculated that compares the mean of the samples drawn from the end of a chain of MCMC output to the mean of the samples at the beginning of the chain using a z-test statistic. A cut-off of 2 is arbitrarily chosen for the absolute value of Z. The percentage of |Z|<=2 is reported for each of the three priors.
Overall, it is seen that on two out of four datasets, our proposed method with the horseshoe prior outperformed the frequentist methods on specificity by as much as 10 percentage points. On the other two datasets, while we are behind on the specificity, we compare much favorably on sensitivity. The colon dataset has the second highest AUC from our method only superseded by Blasso. Overall, our method is clearly competitive and  22:126 shows superiority in at least one of the measures over the frequentist methods in all datasets.

Discussion
Bayesian shrinkage models can be utilized as a practical and useful alternative classification approach and a plausible way to select genetic markers and risk factors. This is a detailed study on a wide variety of settings comparing the three most well-known shrinkage priors. All of the three priors were able to recognize patterns differentiating the binary classes to a highly accurate level. This was shown by the excellent predictive performance of the binary LR model on the simulated data sets and fairly high predictive accuracies on a wide array of public health data sets. Some of these datasets were high dimensional, which exhibits the model's power to scale up to these challenging scenarios. The algorithm is efficient, and the time taken to execute the simulations is relatively low, such as algorithm with n=500;p=50 takes about 673 seconds. This combination of computational power and predictive performance makes it a very reasonable method of use for practitioners who require quick high dimensional analysis, retaining the advantages of Bayesian analysis. In future we would formally explore the comparison with priors such as Spike and Slab priors and schemes ase the MH algorithm. We posit that features such as selections of indicators and acceptance-rejection step would not favorably compare with the proposed algorithm. We also extended the model to a multinomial logistic model that handles multiple categories. An R package "ShrinkageBayesGlm" is developed to be publicly available soon. We expect that the coming years will witness its wider dissemination among public health research and will be useful for predicting occurrences of common disorders as dementia, colon cancer, diabetes, etc. Computational advances, especially in high-dimensional cases [56] will continue expanding the scope of exact methods.