- Research
- Open access
- Published:
A Bayesian latent class extension of naive Bayesian classifier and its application to the classification of gastric cancer patients
BMC Medical Research Methodology volume 23, Article number: 190 (2023)
Abstract
Background
The Naive Bayes (NB) classifier is a powerful supervised algorithm widely used in Machine Learning (ML). However, its effectiveness relies on a strict assumption of conditional independence, which is often violated in real-world scenarios. To address this limitation, various studies have explored extensions of NB that tackle the issue of non-conditional independence in the data. These approaches can be broadly categorized into two main categories: feature selection and structure expansion.
In this particular study, we propose a novel approach to enhancing NB by introducing a latent variable as the parent of the attributes. We define this latent variable using a flexible technique called Bayesian Latent Class Analysis (BLCA). As a result, our final model combines the strengths of NB and BLCA, giving rise to what we refer to as NB-BLCA. By incorporating the latent variable, we aim to capture complex dependencies among the attributes and improve the overall performance of the classifier.
Methods
Both Expectation-Maximization (EM) algorithm and the Gibbs sampling approach were offered for parameter learning. A simulation study was conducted to evaluate the classification of the model in comparison with the ordinary NB model. In addition, real-world data related to 976 Gastric Cancer (GC) and 1189 Non-ulcer dyspepsia (NUD) patients was used to show the model's performance in an actual application. The validity of models was evaluated using the 10-fold cross-validation.
Results
The presented model was superior to ordinary NB in all the simulation scenarios according to higher classification sensitivity and specificity in test data. The NB-BLCA model using Gibbs sampling accuracy was 87.77 (95% CI: 84.87-90.29). This index was estimated at 77.22 (95% CI: 73.64-80.53) and 74.71 (95% CI: 71.02-78.15) for the NB-BLCA model using the EM algorithm and ordinary NB classifier, respectively.
Conclusions
When considering the modification of the NB classifier, incorporating a latent component into the model offers numerous advantages, particularly within medical and health-related contexts. By doing so, the researchers can bypass the extensive search algorithm and structure learning required in the local learning and structure extension approach. The inclusion of latent class variables allows for the integration of all attributes during model construction. Consequently, the NB-BLCA model serves as a suitable alternative to conventional NB classifiers when the assumption of independence is violated, especially in domains pertaining to health and medicine.
Background
The Naive Bayes (NB) classifier is a well-established supervised algorithm in the field of Machine Learning (ML). Its simplicity and effectiveness in classification tasks have made it widely adopted across various domains [1, 2]. However, the NB classifier is built upon a fundamental assumption of conditional independence, wherein all feature pairs are considered mutually independent given the class variable [3]. In practical real-world scenarios, this assumption is frequently violated, resulting in a reduction in the algorithm's performance [4].
In the context of health and medical domains, the features employed in analysis often originate from diverse aspects related to the subjects under study [5]. These features can encompass symptoms in diagnostic scenarios or risk factors in the context of risk assessment. Consequently, the dependence among these features, even within a specific class, becomes inevitable. This dependency violates the assumption of conditional independence and calls for alternative approaches to effectively model and classify the data.
The issue of non-conditional independence in data has been addressed by various studies, proposing extensions of the Naive Bayes (NB) classifier [6]. These approaches can be classified into two major categories. Firstly, some studies focused on altering the features through subset selection or assigning weights to them [7,8,9,10,11]. These approaches involve a search strategy to identify the most relevant features that optimize the classification performance of NB. Feature selection methods aim to identify critical variables based on their contribution to classification and eliminate less influential ones [12]. Alternatively, feature weighting algorithms retain all variables in the model while assigning them importance weights [13,14,15]. However, these algorithms heavily rely on the characteristics of the observed data, and their results can vary accordingly. Moreover, the application of these methods is computationally demanding, as they pose NP-hard (NP-hard: Denoting a computational problem that is at least as difficult to solve as the hardest problems in the class of problems known as NP, which includes a wide range of challenging computational tasks) problems requiring extensive computational resources [13].
In an alternative approach, some studies have proposed expanding the structure of the Naive Bayes (NB) classifier to accommodate conditional independence. Examples of such methods include the Augmented Naive Bayes (ANB) [16, 17], Tree Augmented Naive Bayes (TAN) [18], extended Tree Augmented Naive Bayes (eTAN) [19], k-dependence Bayesian classifier [20], and Averaged One-Dependence Estimators (AODE) [21]. These algorithms share a common feature of augmenting the relationship set by introducing additional arcs between features. However, as more relationships are added to the original NB structure, the computational complexity increases. Hence, the challenge lies in striking a balance between the trade-off of increased relationships and computational complexity. Consequently, the search algorithms employed in this context face the same issue of being NP-hard [22].
An appealing alternative approach in extending the structure involves incorporating a latent variable into the model. By introducing a latent variable, we can effectively capture the correlation between features and enforce conditional independence within the structure [23,24,25]. The utilization of latent variables holds particular relevance in health and medical applications, especially in cases where the underlying causal mechanisms of diseases remain unknown. Additionally, latent variables find application in situations where the direct cause of a disease is not directly measurable, but certain observable variables can provide valuable insights into it [5]. Real medical data often involves complex interactions and relationships among various factors that influence health outcomes. The inclusion of latent variables provides a mechanism to capture these hidden factors, which may not be directly observable or measured [26, 27]. By incorporating latent variables into our models, we can account for unobserved factors that impact the observed features, leading to a more comprehensive understanding of the underlying mechanisms and improved predictive accuracy.
Defining a latent variable in the context of Naive Bayes (NB) requires careful consideration. Firstly, the placement of the latent variable within the structure determines its relationship with the features and class. For example, Langseth and Nielsen (2006) proposed a hierarchical NB model where class variables serve as the root, attributes act as leaf nodes, and multiple latent variables act as parents to the leaf nodes [28]. Calders and Verwer (2010) presented an NB model for discrimination-free classification, incorporating a single latent variable as the parent of the class variable [29]. Similarly, Alizadeh et al. (2021) introduced a multi-independent latent component extension of NB, featuring a latent variable as the parent of attributes and also linked to the class variable [23].
Additionally, defining the latent variable(s) requires careful consideration. The latent variable should encapsulate all relevant information from the attributes while assisting the NB structure in maintaining the assumption of conditional independence. Striking a balance between capturing the dependencies in the data and preserving the conditional independence assumption is essential in defining the latent variable(s).
This study introduces a novel approach by incorporating a latent variable as the parent of attributes, similar to the model proposed by Calders and Verwer. However, our proposed model offers reduced complexity compared to the previous approach. The latent variable is defined using Bayesian Latent Class Analysis (BLCA), providing flexibility in modeling. As a result, our final model combines elements of both Naive Bayes (NB) and BLCA, and we refer to it as NB-BLCA. To learn the model's parameters, we provide two options: the Expectation-Maximization (EM) algorithm and the Gibbs sampling approach. A comprehensive simulation study is conducted to assess the classification performance of the proposed model. Furthermore, we apply the model to real-world data, specifically in classifying patients as either GC or NUD based on their attributes. By employing the NB-BLCA model, we aim to enhance classification accuracy while effectively capturing latent dependencies within the data, contributing to improved decision-making in healthcare settings.
Material and methods
Naïve Bayesian classifier
Suppose in a classification problem, the levels of target variable \(C\) indicate the different classes. For instance, \(C\) could be the disease status indicator. In this example, the \(C\) levels indicate the disease's presence or absence. Another example could be a physician's diagnosed stages of GC patients. In such examples, we are interested in exploring the prediction power of a set of attributes \((X_{1},\dots ,{X}_{m})\) for accurately detecting \(C\) levels. In an NB classifier framework, we assume the attributes \((X_{1},\dots ,{X}_{m})\) are conditionally independent given the information about class variable \(C\). Therefore, we aim to find the level \(c\) of the class variable \(C\) which maximizes the posterior probability of this variable given the observed values of attributes:
Using the Bayes rule for this posterior probability, we have:
As we mentioned before, the primary assumption of NB is conditional independency between attributes given the class variable. Therefore equation (2) could be rewritten as:
In equation (3), the denominator is constant for all the possible values of class variable \(C\). Hence we could eliminate it and find the best class according to the below formula:
Therefore we allocate the subjects to the class variable levels, which are maximized according to their attributes.
Bayesian latent class analysis
BLCA is a model-based clustering that finds explicitly unobserved homogenous subgroups among the total population and uses the Bayesian paradigm in this manner [30, 31]. This study introduces a version of Bayesian Latent Class Analysis (BLCA) specifically tailored for binary attributes while accommodating a multinomial distributed class variable. While it is possible to generalize the method for multinomial attributes or predictors, it requires the use of binary indicator variables, which is a common practice in various statistical applications such as regression. By employing this approach, for a dependent factor variable with q levels, one can include q-1 binary indicators, with each indicator representing a specific level of the original dependent variable by taking the value 1 and 0 for the other levels. The elimination of the last level is necessary to avoid redundancy. However, it is important to note that the binary version of BLCA often suffices for many health and medical applications.
Suppose we express the attributes by an M-dimensional vector-valued \({\varvec{X}}=({{\varvec{X}}}_{1},\dots ,{{\varvec{X}}}_{N})\), where these come from G sub-populations. The sub-populations are typically referred to as classes or components. Therefore, we have two sets of parameters. A G-dimensional vector \({\varvec{\tau}}=({\tau }_{1},\dots ,{\tau }_{G})\), including parameters for prior belief in the proportions of each class. In addition, a matrix \({\varvec{\theta}}\) with dimension \(G\times M\) for item probability of all classes. In this way, all elements \({\varvec{\tau}}\) are equal or greater than 0 and \(\sum_{g=1}^{G}{\tau }_{g}=1\) and \({\theta }_{gm}\) is the probability of \({X}_{im}=1\) given the information about membership of group \(g\) for any \(i\in 1,\dots ,N\) of individuals in the study. Hence, we have \(P\left({X}_{im}|{\theta }_{gm}\right)={\theta }_{gm}^{{X}_{im}}{(1-{\theta }_{gm})}^{1-{X}_{im}}\) for \({X}_{im}\in [\mathrm{0,1}]\), according to the definition of Bernoulli distribution.
If we make a naïve Bayes assumption of conditional independence of observations given the group membership, we can express the \(P\left({{\varvec{X}}}_{i}|{{\varvec{\theta}}}_{g}\right)={\prod }_{m=1}^{M}P({X}_{im}|{\theta }_{gm})\) and the distribution of all \({{\varvec{X}}}_{i}\) s are:
The actual values for parameters \({\varvec{\theta}}\) and \({\varvec{\tau}}\) are unknown, and we suppose prior information about them. Therefore, the direct calculation of equation 5 is not feasible. In application, we introduce a set \({\varvec{Z}}=\left({{\varvec{Z}}}_{1},\dots ,{{\varvec{Z}}}_{N}\right)\) where each \({{\varvec{Z}}}_{i}=({Z}_{i1},\dots ,{Z}_{iG})\) is a vector representing the actual class membership of \({{\varvec{X}}}_{i}\). In this manner, \({Z}_{ig}=1\) if individual \(i\) belongs to subgroup \(g\) and 0 for otherwise. The new task is to find the best values for \({\varvec{Z}},\) which maximize the posterior probability of class membership, including the \({\varvec{Z}}\) parameters.
The complete density of observed variables \({{\varvec{X}}}_{i}\) and missing values \({{\varvec{Z}}}_{i}\) is:
Using the Bayes theorem leads to the posterior probability of \({{\varvec{Z}}}_{i}\), class membership for observation \(i\), as:
The drawback of unknown actual values for parameters \({\varvec{\theta}}\) and \({\varvec{\tau}}\) still exist. An iterative approach that updates the prior information of these parameters in each step according to the observed data is proposed to achieve the best posterior distribution. In this regard, we assume conjugate prior distribution \(Beta({\alpha }_{gm},{\beta }_{gm})\) for binary variables \({\varvec{\theta}}\), and \(Dirichlet({\varvec{\delta}})\) for multinomial variables \({\varvec{\tau}}\). Note that hyperparameters \({\alpha }_{gm}\) and \({\beta }_{gm}\) for Beta prior distributions, specify the item response probabilities of attributes \(m\) in class \(g\). In the same manner, hyperparameter \({\varvec{\delta}}=({\delta }_{1},\dots ,{\delta }_{G})\) specify the share of each class from the total samples.
Supposing these prior distributions for \({\varvec{\theta}}\) and \({\varvec{\tau}}\) we have:
For each \(g\in [1,\dots ,G]\) and \(m\in [1,\dots ,M]\). These assumptions lead to the joint posterior distribution \({\varvec{\tau}}\) and \({\varvec{\theta}}\) as:
In the following parts, we present two well-known iterative approaches for parameter estimation. These are the EM algorithm and Gibbs sampling method.
The EM algorithm for BLCA
This algorithm follows an iterative process that continues until convergence is achieved, iteratively refining the results. The algorithm consists of two steps that are repeated in each iteration. In the first step, the algorithm calculates the expectation of the logarithm posterior probability. This step involves estimating the probabilities associated with each parameter based on the available data. In the second step, the algorithm determines the parameter values that maximize the expectation function obtained in the previous step. This maximization step involves adjusting the parameter values to optimize the fit of the model to the data [32]. To initiate the algorithm, an initial guess of the parameter values is required for the first iteration. However, regardless of the initial values chosen, the algorithm is guaranteed to converge to the actual values of the parameters. The number of iterations required for convergence may vary depending on the specific dataset and initial values chosen.
By iteratively performing these two steps, the algorithm refines the parameter estimates, improving the accuracy and performance of the model until a satisfactory level of convergence is achieved [33]. If we show the values of the parameters \({\varvec{\tau}}\) and \({\varvec{\theta}}\) in steps \(k\) by \({{\varvec{\tau}}}^{(t)}\) and \({{\varvec{\theta}}}^{(t)}\), respectively the expected function in E-step for a BLCA is:
In the M-step, we update the parameters as follows:
Here the \(\Theta\) and \(\mathrm{T}\) are parameter space for \({\varvec{\theta}}\) and \({\varvec{\tau}},\) respectively. For all item response probability and class proportions, we have \(\Theta ={[\mathrm{0,1}]}^{G\times M}\) and \(\mathrm{T}={[\mathrm{0,1}]}^{G}\) given \(\sum_{g=1}^{G}{\tau }_{g}=1\).
It has been shown that the practical formulations for these steps are [34]:
E-step:
M-step:
The Gibbs sampling for BLCA
As we already mentioned, calculating the joint posterior distribution of parameters \({\varvec{\tau}}\) and \({\varvec{\theta}}\) and unobserved class membership \({\varvec{Z}}\) is directly impossible. However, determining the class membership of samples is possible in the case of knowing the parameter values. Gibbs sampling is a Markov Chain Monte Carlo (MCMC) method that simplifies such issues and, instead of using the joint distribution, iteratively draws samples from the conditional distributions using the Markov property. These samples reflect the properties of the accurate joint posterior distribution [35].
The following steps are the practical approach for handling a BLCA using the Gibbs sampling:
-
1-
Set initial values for parameters \({\varvec{\tau}}\) and \({\varvec{\theta}}\) and randomly assign each observation to a class. Although this step plays a crucial role in determining the convergence speed of the algorithm, it is important to provide guidance on how users can specify the initial values effectively. In our proposed method, one approach for specifying initial values is to use random initialization, which allows for exploration of different parts of the parameter space. This can help avoid potential biases that may arise from using fixed initial values. Additionally, users may consider conducting sensitivity analyses by running the algorithm multiple times with different initializations to assess the stability of the results.
-
2-
Considering the conjugate prior of Beta distribution, generate elements of \({{\varvec{\theta}}}^{(t)}\) randomly from the following distribution:
$$\theta_{gm}^{(t)}\sim Beta(\sum_{i=1}^NX_{im}Z_{ig}^{(t-1)}+\alpha_{gm},\sum_{i=1}^NZ_{ig}^{\left(t-1\right)}\left(1-X_{im}\right)+\beta_{gm})$$ -
3-
Considering the conjugate prior of Dirichlet distribution, generate elements of \({{\varvec{\tau}}}^{(k+1)}\) randomly from the following distribution:
$${{\varvec{\tau}}}^{(t)}\sim Dirichlet(\sum_{i=1}^{N}{Z}_{i1}^{\left(t-1\right)}+{\delta }_{1},\dots ,\sum_{i=1}^{N}{Z}_{iG}^{(t-1)}+{\delta }_{G})$$ -
4-
Consider the generated values of parameters and assign the individuals to classes randomly from a multinomial distribution according to their observed attributes \({{\varvec{X}}}_{i}\) which specify the posterior probabilities of membership in the classes:
$${{\varvec{Z}}}_{i}^{(t)}\sim Multinomial(1,\frac{{\tau }_{1}^{\left(t\right)}P\left({{\varvec{X}}}_{i}|{{\varvec{\theta}}}_{1}^{\left(t\right)}\right)}{\sum_{h=1}^{G}{\tau }_{h}^{\left(t\right)}P\left({{\varvec{X}}}_{i}|{{\varvec{\theta}}}_{h}^{\left(t\right)}\right)},\dots ,\frac{{\tau }_{G}^{\left(t\right)}P({{\varvec{X}}}_{i}|{{\varvec{\theta}}}_{G}^{(t)})}{\sum_{h=1}^{G}{\tau }_{h}^{\left(t\right)}P({{\varvec{X}}}_{i}|{{\varvec{\theta}}}_{h}^{(t)})})$$ -
5-
Repeat steps 2 to 4 until making sure about convergence.
After running the Gibbs sampling, like all other MCMC methods, it is essential to check if the chain converged using the statistical criteria and trace plots. In addition, burn-in and thinning are necessary [36].
NB-BLCA
In this study, we present an extension of the NB classifier that uses BLCA to impose conditional independence assumptions on the structure of the model. NB and BLCA assume the Naïve assumption of conditional independence assumption given the information of class variable. In contrast to NB, which only requires this assumption for efficient classification, The BLCA model estimates the parameter values considering this purpose. The presentation of the NB classifier and our proposed model are depicted in Fig. 1, parts A and B, respectively. In this figure, the latent class of BLCA is shown by \({L}_{i} [i=1,\dots ,K]\) to differentiate from classes of the primary outcome \(C\). Remember that latent class \(L\) is unobserved, but the class variable \(C\) is observable.
In the NB-BLCA model, the only child node of class variable \(C\) is the latent class variables \({L}_{i}\). Therefore the posterior density in equation 3 could be reformed to:
As the latent class variables \({L}_{i}\) come from a mixture distribution with parameters \(({\varvec{\tau}},{\varvec{\theta}},{\varvec{Z}}\)), the calculation of this posterior probability is not straightforward. However, the generalized forms of the EM algorithm and Gibbs sampling in the previous sections enable us to predict class membership \(C\) due to information about the latent class assignment \({L}_{i}\) concluded from the observed attributes.
Adjusting EM algorithm for NB-BLCA
In order to explain the EM algorithm for an NB-BLCA, we should define the following parameters:
The parameter \(q\left(c\right)\) is the probability of seeing the level \(c\) of the class variable. Hence, it is subject to constraints \(q(c)\ge 0\) and \(\sum q(c)=1\) for all the possible levels of this variable.
The parameter \({q}_{i}(l|c)\) for any \(i=1,\dots ,K\) is the probability of latent class \(i\) taking value \(l\), conditioned on the class \(c\). This parameter is subject to constraints \({q}_{i}(l|c)\ge 0\) and \(\sum {q}_{i}\left(l|c\right)=1\) for all levels of class and latent class variables.
The practical formulations of the EM algorithm are presented in Fig. 2. The algorithm estimates latent class variables membership using the attributes and then estimate the posterior probability of class membership of the target variable.
Adjusting Gibbs sampling for NB-BLCA
The Gibbs sampler simplifies a complex joint posterior distribution into a set of steps, including generating samples from the conditional distributions. We explained how to generate latent class membership samples for a BLCA problem in 5 steps. The added task of generating samples for the NB part of NB-BLCA is quickly done by adding an extra step. The sample generation could be done from a multinomial (if the class variable has more than two categories) or binomial distribution (the class variable only includes two levels). The practical formulations of the Gibbs sampler are presented in Fig. 3.
Simulation study
We conducted a simulation study to evaluate the predictive performance of our model compared to a simple NB model. Furthermore, we included two alternative approaches that have been suggested to improve the correct classification of NB when the conditional assumption is violated. These approaches are Averaged one-dependence estimators (AODE), proposed by Webb et al. [21], and Hill-climbing tree augmented naive Bayes (TAN-HC), proposed by Keogh and Pazzani [37].
To generate the datasets, we utilized the Iterative Proportional Fitting Procedure (IPFP), originally proposed by Deming and Stephan in 1940 as an algorithm aimed at minimizing the Pearson chi-squared statistic [38]. The details of this method, as described by Suesse et al. [39], can be found in the 'mipfp' R package developed by Barthélemy and Suesse [40]. Using this method we were able to simulate multivariate Bernoulli distributions assuming the Hypothetical Marginal Probabilities (HMP) of each variable and a matrix that includes the Odds Ratio (OR) of all pairs of variables.
The elements of the HMP vector were randomly generated from a uniform distribution between 0 and 1 (\({\mathrm{HMP }}_{i}\sim U(\mathrm{0,1})\)) for each iteration. Similarly, the elements of the paired OR matrix were randomly generated from a uniform distribution within the range of 0.25 and 4 (\({OR}_{ij}\sim U\left(\mathrm{0.25,4}\right) for i\ne j\)). To reduce computational complexity, we generated the feature variables in batches of 5 dimensions. Consequently, for scenarios involving only 5 features, we generated a single batch. For scenarios with 10 features, we generated 2 batches, and so on.
The response class variable \(Z\) was generated using a logistic regression approach. We assumed a regression coefficient of 2 (\(\beta =2\)) for all feature variables and applied the inverse logit transformation to their linear combination to calculate the probability of belonging to class 1. Additionally, a random error term from a Gaussian distribution with mean parameter 0 and standard deviation parameter 4 was added to this linear combination. The intercept coefficient (\(\alpha\)) of the logistic regression served as a tuning parameter for specifying the marginal probability of the class variable.
Finally, the values of the response variable were generated from a Binomial distribution, taking into account the calculated probabilities.
We assumed marginal probabilities of 0.3, 0.5, and 0.7 for the class variable to explore their effect on the model's performance. To assess the impact of sample size on the model's performance, we considered samples consisting of 500, 1000, and 2000 subjects. Furthermore, we generated scenarios with 5, 10, and 20 feature variables.
For all algorithms, we used 70% of the randomly selected data as a training dataset, while the remaining 30% was used to evaluate algorithm performance. The validity of the algorithms was measured by calculating the mean values of sensitivity (recall), specificity, positive predictive value (precision), negative predictive value, and precision across 1000 replicates."
Real-world data application
In this section, we used multicenter hospital-based data to demonstrate the application of the model in a real-world example. This data was related to 976 GC and 1189 NUD patients referred to the national cancer institute of Iran (NCII) from July 2003 to Jan 2020. Trained technicians interviewed each participant at the time of recruitment using a structured questionnaire after accepting enrolment in the study. The questionnaire includes 64 attributes in the five subdomains, demographic variables, dietary habits, self-reported medical status, narcotics use, and SES indicators. All the predictors were recoded into binary variables, and the list, including their names and levels, is available in Supplementary Table 1.
We fitted the NB classifier, NB-BLCA using the EM algorithm, and NB-BLCA using Gibbs sampler to data. A random sample with a proportion of 70% sample size was selected to train the models. The model's validity and prediction ability were explored using the other 30% of subjects. The identical measurements in the simulation section were calculated and reported.
Results
In the simulation study, we compared the sensitivity, specificity, positive predictive value, negative predictive value, and precision of the ordinary Naive Bayes (NB) classifier, NB-BLCA, and other alternative models. Tables 1, 2 and 3 present these performance metrics for different scenarios, considering varying marginal probabilities of the class variables (0.3, 0.5, and 0.7) and different numbers of predictors.
When the marginal probability of the class variable is set to 0.3 and the number of predictors is low (5 attributes), the sensitivity of all models is relatively lower, failing to exceed 50%. However, as the sample size increases, the sensitivity improves. Even in the scenario with the highest sample size of 2000, the sensitivity remains below 50%. This indicates that all algorithms are sensitive to the lower rate of events in the data. It is worth noting that both increasing the number of predictors and the marginal probability of the class variables enhance the sensitivity of the models.
In all scenarios, except for the marginal probability of the class variable 0.7 when the number of predictors is 5, the precision of our proposed model (NB-BLCA) is higher compared to the other approaches. This indicates that our model performs better in terms of correctly identifying positive instances among the predicted ones.
When the marginal probability of the class variable is low (0.3) and the number of predictors is less than 20, the superiority of our model is based on higher specificity. Increasing the number of predictors also leads to a greater increase in the sensitivity of our model compared to the other approaches. This trend is observed consistently across the different scenarios (as shown in Tables 2 and 3).
Similar to many classification algorithms, the performance of NB, AODE, TAN, and our proposed model is influenced by the prevalence of the outcome, with a lower rate of events having a significant impact on the sensitivity of these models.
Overall, the results demonstrate that the performance of the models is affected by the marginal probability of the class variable, the number of predictors, and the prevalence of the outcome. Our proposed model (NB-BLCA) shows favorable precision and specificity, particularly in scenarios with low marginal probability and a smaller number of predictors.
These findings highlight the importance of considering these factors when applying classification algorithms and emphasize the potential benefits of our proposed model in handling such scenarios.
In Table 4, we present the results of comparing the models' predictions for real world data (classification of patients into GC or NUD groups). All models showed a significant improvement in prediction accuracy (P-value < 0.001). Among the models, the NB-BLCA model utilizing the Gibbs sampler achieved the highest accuracy of 87.77 (84.87-90.29), according to the 95% confidence interval. Notably, this confidence interval did not overlap with the intervals of the other two models, indicating a statistically significant increase in prediction accuracy.
Additionally, the Gibbs sampler-based NB-BLCA model demonstrated a higher Kappa value compared to the other approaches. This indicates that the model correctly classified patients with a 76% higher accuracy than random assignment. Furthermore, when performing McNemar's test for the NB classifier, the result was not significant (p-value = 0.74), suggesting that the NB approach did not yield a substantial improvement.
While the NB-BLCA model had a lower specificity (74.87) compared to NB (77.12), it exhibited a significantly higher sensitivity. The increased sensitivity indicates a better ability to correctly identify positive cases. Overall, the NB-BLCA model employing the Gibbs sampler outperformed the other two alternatives in terms of prediction accuracy and various performance metrics.
Discussion
We presented a modified version of the ordinary NB classifier called NB-BLCA, which can enhance the model's prediction performance. In addition, we suggested two methods, Gibbs sampling, and the EM algorithm, for parameter estimation. Our findings, based on real-world data examples of GC patients, demonstrate that the Gibbs sampler method yields significantly improved prediction accuracy compared to the EM algorithm. The application of Gibbs sampling in our study has shown superior performance in accurately predicting outcomes, indicating its effectiveness in modeling and analyzing the given dataset. These results underscore the value of incorporating Gibbs sampling as a powerful tool for enhancing prediction accuracy in real-world scenarios involving GC patients. On the other hand, the simulation study revealed that NB-BLCA based on the EM algorithm was superior to the ordinary NB classifier in all the predefined scenarios. However, we should admit that our model is more sophisticated than the standard NB classifier in structure. Therefore, the usual trade-off between complexity and accuracy matters here. However, attention to the properties of each algorithm facilitates the fitting procedure and leads to more accurate results.
In the context of adjusting the naive Bayesian classifier when the conditional assumptions are violated, latent variable models emerge as one of the optimal solutions [4, 41]. This assumption often fails to capture complex relationships and dependencies among features, leading to suboptimal performance. To overcome these limitations, latent variable models offer a powerful framework. By introducing latent variables, these models can capture the hidden dependencies and relationships among features, even in cases where the conditional independence assumption is violated [3]. The inclusion of latent variables allows for more flexible and expressive modeling, enabling the representation of intricate interactions among features [3].
One key advantage of latent variable models is their ability to handle missing data and incomplete feature sets [42]. By incorporating latent variables, these models can effectively impute missing values, mitigating the impact of incomplete information on classification accuracy. This is particularly valuable in real-world scenarios where data may be incomplete or contain missing values [43]. Furthermore, latent variable models provide a means to account for unobserved or latent factors that may influence the observed features [44]. By capturing these latent factors, the models can better explain the underlying data distribution and improve classification performance.
Another benefit of latent variable models is their ability to offer principled probabilistic inference [45]. This allows for robust uncertainty quantification and provides richer insights into the model's predictions. By understanding the uncertainty associated with the predictions, decision-makers can make more informed choices based on the level of confidence or uncertainty in the classification results.
In summary, when the conditional assumptions of the naive Bayesian classifier are violated, latent variable models serve as an optimal solution. By incorporating latent variables, these models capture hidden dependencies, handle missing data, account for unobserved factors, and offer principled probabilistic inference. Their ability to address the limitations of the naive Bayesian classifier makes latent variable models a valuable tool for improving classification performance in scenarios where conditional assumptions are not met.
The Gibbs sampler is one of the most efficient and well-known MCMC algorithms. This algorithm is a special case of Metropolis-Hasting sampling wherein the randomly generated values are always accepted. It works based on the Markov property and generates random samples from the univariate conditional posterior distributions instead of an expensive joint distribution [35, 46]. Therefore, the Gibbs sampler leads to the answers more quickly and needs less computational complexity. However, the samples achieved from this approach still are highly correlated. In this situation, thinning the samples has been suggested to make samples independent. It means picking separated points from the generated chain systematically [47]. Separating the samples from the Markov chain dilutes the dependency and makes them independent. Another drawback of MCMC methods is the impact of misspecification of the initial values on the convergence of the chain. Fortunately, in most cases, the chain corrects itself at each scan, and we ensure that the later samples reflect the actual posterior distribution [48]. Therefore, the only task we need is to burn in the initial values of the chain. Typically references suggest a basic rule of the first 1000 to 5000 sample burn-in [49]. The other proposes a more conservative approach to selecting the starting value close to the distribution mode achieved from a likelihood-based model [50]. We can use all these considerations to ensure chain convergence by correctly tuning the parameters.
As we confronted here, the EM algorithm is widespread in the case of the mixture distribution [51, 52]. However, such a method is not without drawbacks. For instance, there is no guarantee to achieve global optima. In addition, the real value near the boundary makes the estimations unstable. Using parametric bootstrap sampling and refitting the model could benefit these situations [30]. Hence, we restarted all the processes in the EM algorithm ten times in the simulation study and real-world data example. This approach is not straightforward when we sample from low-probability groups. To overcome this problem, using likelihood sampling and logic sampling methods have been proposed [53]. Fortunately, due to appropriate prior distribution, Gibbs's sampler is not a case of this issue. In this study, Beta and Dirichlet priors are proper and conjugate for parameters of interest [54].
The NB-BLCA model needs to determine the number of latent class variables and the number of levels for each of them. Data gathering in many medical and health applications starts after determining risk factors, influential predictors, and related domains [5]. Therefore, the specialist could supervise us in detecting the required latent variables. However, it is not a general rule, especially in data mining applications. More development seems necessary in this situation. On the other hand, the number of levels for each latent variable depends on the data. Like principal component analysis (PCA) and Explanatory Factor Analysis (EFA), the best choice of levels could be made using the scree plot [55]. In this manner, AIC and BIC criteria for both Gibbs sampling and EM algorithm and DIC for Gibbs sampling could lead us to select the best choice.
Conclusion
The addition of a latent component to the NB classifier model offers numerous advantages when compared to other modification attempts. Firstly, it aligns well with the nature of the data, particularly within medical and health contexts. Furthermore, incorporating the latent component allows us to bypass the extensive search algorithm and structure learning required in the local learning and structure extension approach. By utilizing latent class variables, all attributes are incorporated into the model building process, unlike attribute selection approaches that may ignore certain variables and result in the loss of information. As a result, the NB-BLCA model emerges as a suitable alternative to ordinary NB classifiers, particularly when the assumption of independence is violated, especially in the domains of health and medicine.
Availability of data and materials
The datasets used or analyzed during the current study are available from the corresponding author upon reasonable request.
Abbreviations
- NB:
-
Naïve Bayes
- BLCA:
-
Bayesian Latent Class Analysis
- EM:
-
Expectation Maximization
- AIC:
-
Akaike Information Criterion
- BIC:
-
Bayesian Information Criterion
- DIC:
-
Deviance Information Criterion
- NCII:
-
National Cancer Institute of Iran
- GC:
-
Gastric Cancer
- NUD:
-
Non-ulcer dyspepsia
- ML:
-
Machine Learning
- ANB:
-
Augmented Naive Bayes
- TAN:
-
Augmented Naive Bayes
- eTAN:
-
extended Tree Augmented Naive Bayes
- AODE:
-
Averaged One-Dependence Estimators
References
Langarizadeh M, Moghbeli F. Applying naive bayesian networks to disease prediction: a systematic review. Acta Informatica Medica. 2016;24(5):364.
Salma A, Silfianti W. Sentiment analysis of user reviews on covid-19 information applications using naive bayes classifier, Support Vector Machine, and K-Nearest Neighbor. Int Res J Adv Eng Sci. 2021;6(4):158–62
Bishop CM, Nasrabadi NM. Pattern recognition and machine learning. Springer; 2006;4(4):738–838.
Kelly A, Johnson MA: Investigating the statistical assumptions of Naïve Bayes classifiers. In: 2021 55th annual conference on information sciences and systems (CISS): 2021: IEEE; 2021: 1-6.
Rabe-Hesketh S, Skrondal A. Classical latent variable models for medical research. Stat Methods Med Res. 2008;17(1):5–32.
Wickramasinghe I, Kalutarage H. Naive Bayes: applications, variations and vulnerabilities: a review of literature with code snippets for implementation. Soft Computing. 2021;25(3):2277–93.
Langley P, Sage S. Induction of selective Bayesian classifiers. Elsevier; 1994. p. 399–406.
Abraham R, Simha JB, Iyengar S. Medical datamining with a new algorithm for feature selection and naive Bayesian classifier. 10th International Conference on Information Technology (ICIT 2007). 2007;44–9.
Dey Sarkar S, Goswami S, Agarwal A, Aktar J. A novel feature selection technique for text classification using Naive Bayes. Int Sch Res Notices. 2014;2014:717092.
Liu Y. A comparative study on feature selection methods for drug discovery. J Chem Inf Comp Sci. 2004;44(5):1823–8.
Ratanamahatana CA, Gunopulos D. Feature selection for the naive bayesian classifier using decision trees. Appl Artif Intell. 2003;17(5–6):475–87.
Novakovic J: The impact of feature selection on the accuracy of naïve bayes classifier. In: 18th Telecommunications forum TELFOR: 2010; 2010: 1113-1116.
Chen L, Wang S: Automated feature weighting in naive bayes for high-dimensional data classification. In: Proceedings of the 21st ACM international conference on Information and knowledge management: 2012; 2012: 1243-1252.
Lee C-H, Gutierrez F, Dou D: Calculating feature weights in naive bayes with kullback-leibler measure. In: 2011 IEEE 11th International Conference on data mining: 2011: IEEE; 2011: 1146-1151.
Niño-Adan I, Manjarres D, Landa-Torres I, Portillo E. Feature weighting methods: A review. Expert Syst Appl. 2021;184:115424.
Jing Y, Pavlović V, Rehg JM: Efficient discriminative learning of bayesian network classifier via boosted augmented naive bayes. In: Proceedings of the 22nd international conference on Machine learning: 2005; 2005: 369-376.
Zhang H, Ling CX. An improved learning algorithm for augmented naive Bayes. Pacific-Asia Conference on Knowledge Discovery and Data Mining. Heidelberg: Springer Berlin Heidelberg; 2001. p. 581–6.
Long Y, Wang L, Sun M. Structure extension of tree-augmented naive bayes. Entropy. 2019;21(8):721.
Campos CPd, Cuccu M, Corani G, Zaffalon M. Extended tree augmented naive classifier. European Workshop on Probabilistic Graphical Models. Utrecht: Springer International Publishing; 2014. p. 176–89.
Duan Z, Wang L. K-dependence Bayesian classifier ensemble. Entropy. 2017;19(12):651.
Webb GI, Boughton JR, Wang Z. Not so naive Bayes: aggregating one-dependence estimators. Machine learning. 2005;58(1):5–24.
Bielza C, Larranaga P. Discrete Bayesian network classifiers: A survey. ACM Computing Surveys (CSUR). 2014;47(1):1–43.
Alizadeh SH, Hediehloo A, Harzevili NS. Multi independent latent component extension of naive bayes classifier. Knowl Based Syst. 2021;213:106646.
Banerjee A, Shan H: Latent Dirichlet conditional naive-Bayes models. In: Seventh IEEE International Conference on Data Mining (ICDM 2007): 2007: IEEE; 2007: 421-426.
Harzevili NS, Alizadeh SH. Mixture of latent multinomial naive Bayes classifier. Appl Soft Computing. 2018;69:516–27.
Miettunen J, Nordström T, Kaakinen M, Ahmed A. Latent variable mixture modeling in psychiatric research–a review and application. Psychol Med. 2016;46(3):457–67.
Bauer GR, Mahendran M, Walwyn C, Shokoohi M: Latent variable and clustering methods in intersectionality research: systematic review of methods applications. Social psychiatry and psychiatric epidemiology 2022:1-17.
Langseth H, Nielsen TD. Classification using hierarchical naive Bayes models. Mach learn. 2006;63(2):135–59.
Calders T, Verwer S. Three naive Bayes approaches for discrimination-free classification. Data Min Knowl Discov. 2010;21(2):277–92.
Li Y, Lord-Bessen J, Shiyko M, Loeb R. Bayesian latent class analysis tutorial. Multivariate Behav Res. 2018;53(3):430–51.
Asparouhov T, Muthén B: Using Bayesian priors for more flexible latent class analysis. In: proceedings of the 2011 joint statistical meeting, Miami Beach, FL: 2011: American Statistical Association Alexandria, VA; 2011.
McLachlan G, Krishnan T. The EM Algorithm and Extensions. Wiley; 2007. p. 382.
Gupta MR, Chen Y: Theory and use of the EM algorithm. Foundations and Trends® in Signal Processing 2011, 4(3):223-296.
White A, Murphy TB. BayesLCA: An R package for Bayesian latent class analysis. J Stat Softw. 2014;61(13):1–28.
Carlo CM. Markov chain monte carlo and gibbs sampling. Lecture Notes EEB. 2004;581:540.
Christensen R, Johnson W, Branscum A, Hanson TE. Bayesian ideas and data analysis: an introduction for scientists and statisticians. Boca Ranton: CRC Press, Taylor and Francis Group; 2011.
Keogh EJ, Pazzani MJ. Learning the structure of augmented Bayesian classifiers. Int J Artif Intell Tools. 2002;11(04):587–601.
Deming WE, Stephan FF. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Ann Math Stat. 1940;11(4):427–44.
Suesse T, Namazi-Rad M-R, Mokhtarian P, Barthelemy J. Estimating cross-classified population counts of multidimensional tables: an application to regional Australia to obtain pseudo-census counts. 2015.
Barthélemy J, Suesse T. mipfp: An R package for multidimensional array fitting and simulating multivariate Bernoulli distributions. J Stat Softw. 2018;86:1–20.
Zhang NL, Nielsen TD, Jensen FV. Latent variable discovery in classification models. Artif Intell Med. 2004;30(3):283–99.
Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–92.
Little RJA, Rubin DB. Statistical analysis with missing data. NJ: Wiley; 2020. p. 793.
Tipping ME, Bishop CM. Mixtures of probabilistic principal component analyzers. Neural computation. 1999;11(2):443–82.
Ghahramani Z, Beal M. Propagation algorithms for variational Bayesian learning. Advances in neural information processing systems. 2000;13.
Chopin N, Singh SS. On particle Gibbs sampling. Bernoulli. 2015;21(3):1855–83.
Besag J, Green P, Higdon D, Mengersen K. Bayesian computation and stochastic systems. Statistical science. 1995;1:3–41.
Ekvall KO, Jones GL. Convergence analysis of a collapsed Gibbs sampler for Bayesian vector autoregressions. Electron J Stat. 2021;15(1):691–721.
Jones GL, Hobert JP. Sufficient burn-in for Gibbs samplers for a hierarchical random effects model. Ann Stat. 2004;32(2):784–817.
Boissy J, Giovannelli J-F, Minvielle P. An insight into the Gibbs sampler: keep the samples or drop them? IEEE Signal Process Lett. 2020;27:2069–73.
Arcidiacono P, Jones JB. Finite mixture distributions, sequential likelihood and the EM algorithm. Econometrica. 2003;71(3):933–46.
Hathaway RJ. Another interpretation of the EM algorithm for mixture distributions. Stat Probab Lett. 1986;4(2):53–6.
Vermunt JK. Latent class modeling with covariates: Two improved three-step approaches. Political Analy. 2010;18(4):450–69.
Diaconis P, Khare K, Saloff-Coste L. Gibbs sampling, conjugate priors and coupling. Sankhya A. 2010;72(1):136–69.
Zhu M, Ghodsi A. Automatic dimensionality selection from the scree plot via the use of profile likelihood. Comput Stat Data Anal. 2006;51(2):918–30.
Acknowledgments
This study is a part of the research process supported by Tarbiat Modares University to achieve a Ph.D. degree.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
AK and KG contributed to the study conception and design,AK, KG, and AS performed analysis. MM, ME, and SS collect data and describe the clinical result. KG wrote the first draft of the manuscript, and all authors commented on previous versions. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
All methods were carried out following relevant guidelines and regulations. This study was approved by the ethics committee of the school of medical sciences – Tarbiat Modares university under the approval ID IR.MODARES.REC.1399.154. All participants provided written informed consent that their data collected as part of the study could be used in research. All the patients were followed until the event or when they preferred to stop participation.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
S-Table 1. List of questionnaire binary attributes with the categories used in the Real-world data example.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Gohari, K., Kazemnejad, A., Mohammadi, M. et al. A Bayesian latent class extension of naive Bayesian classifier and its application to the classification of gastric cancer patients. BMC Med Res Methodol 23, 190 (2023). https://doi.org/10.1186/s12874-023-02013-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874-023-02013-4