 Research
 Open Access
 Published:
Analysis of zero inflated dichotomous variables from a Bayesian perspective: application to occupational health
BMC Medical Research Methodology volume 21, Article number: 277 (2021)
Abstract
Background
Zeroinflated models are generally aimed to addressing the problem that arises from having two different sources that generate the zero values observed in a distribution. In practice, this is due to the fact that the population studied actually consists of two subpopulations: one in which the value zero is by default (structural zero) and the other is circumstantial (sample zero).
Methods
This work proposes a new methodology to fit zero inflated Bernoulli data from a Bayesian approach, able to distinguish between two potential sources of zeros (structural and nonstructural).
Results
The proposed methodology performance has been evaluated through a comprehensive simulation study, and it has been compiled as an R package freely available to the community. Its usage is illustrated by means of a real example from the field of occupational health as the phenomenon of sickness presenteeism, in which it is reasonable to think that some individuals will never be at risk of suffering it because they have not been sick in the period of study (structural zeros). Without separating structural and nonstructural zeros one would be studying jointly the general health status and the presenteeism itself, and therefore obtaining potentially biased estimates as the phenomenon is being implicitly underestimated by diluting it into the general health status.
Conclusions
The proposed methodology is able to distinguish two different sources of zeros (structural and nonstructural) from dichotomous data with or without covariates in a Bayesian framework, and has been made available to any interested researcher in the form of the bayesZIB R package (https://cran.rproject.org/package=bayesZIB).
Background
In general, zeroinflated models are aimed to addressing the problem that arises from having two different sources that generate the zero values observed in a distribution. In practice, this is due to the fact that the population studied actually consists of two subpopulations: one in which the value zero is by default (structural zero) and the other is circumstantial (sample zero). An example could be the study of sickness presenteeism (SP), i.e. attending work while sick [1]. If it is not previously restricted, the population is made up, among others, of workers who are zero because they have never been sick (structural zeros) and workers who, having been sick, did not attend their work place (sample zeros). Note that the difference is important: roughly the first zero informs us exclusively about the status of health, the second about the exercise of the right to take a sick leave.
The most commonly used zeroinflated models are those that are related to count ing variables, where it is assumed that the zero value has a dichotomous source that determines whether or not the subject is at risk of suffering the event of interest and another source, only for the individuals at risk, that corresponds to the num ber of episodes (counts) that have been experienced by each individual at risk. In this context, the most common available models would be the well known Zero Inflated Poisson (ZIP) and Negative Binomial (ZINB). A good introduction to the mathematical properties of these models can be found in [2], and they have been used in many fields such as quality control ([3]), epidemiology ([4]) or medicine ([5]) among many others. Some guidelines on how to proceed when dealing with count outcomes potentially overdispersed or zeroinflated have been published recently ([6, 7]), based on classical procedures like Vuong’s test ([8]) to check for overdis persion ([9]) and zeroinflation ([10]), although these guides cannot be applied to the case studied here due to the dichotomous nature of the outcome. In general, zeroinflated models can be expressed as
where g is the structural zero probability and f (0) is the zero probability of an appropriate distribution (Poisson, negative binomial or Bernoulli as in our case).
In practice, zeroinflated models with both dichotomous sources (a mixture of two Bernoulli random variables, one with probability of success ω and the other with probability of success p) have received far less attention. This is due, in large part, to the fact that the resulting distribution is once again a Bernoulli with probability.
of success ω · p, so that the proportion of structural zeros (1 − ω) and sample zeros (1 − p) are indistinguishable from the point of view of frequentist statistics. However, from the Bayesian perspective and using known reasonable information about these proportions, it is possible to distinguish the two sources of zeros and estimate ω and p.
Some authors have recently suggested, in other areas such as the classification or identification of images, the usage of Bernoullimixture models, based on numerical algorithms such as ExpectationMaximization (EM) to estimate the parameters [11, 12], given the complexity of the likelihood functions involved. In these cases, however, the inclusion of covariates or adjustment variables is virtually impossible. Also in other areas there are some recent developments in a similar line, such as [13].
In this article we illustrate the use of Zero Inflated Bernoulli (ZIB) models by means of a real dataset on SP, and the results obtained are compared with those of adjusted logistic regressions on the total population or only in those individuals at risk. In the literature, the SP registry is carried out in a selfreported way, asking about the episodes in the last year and later, recorded in a dichotomized way (no SP: 0 episodes; yes SP: 1 or more episodes). The justification for this dichotomization is fundamentally based on two aspects: one, the possible memory bias; second, the excessive influence of workers who report a very high number of episodes.
Methods
Let Y be the variable that indicates occurrence of the phenomenon under study. The proposed model has a probability function defined by
where ω is the probability of exposure and p is the probability of occurrence of the phenomenon of interest among exposed individuals, as shown in Fig. 1. According to this scheme, the proportion of structural zeros will be 1 − ω and the proportion of nonstructural zeros will be ω · (1 − p).
To overcome the impossibility of models without covariates based on the frequen tist approach to differentiate between structural and nonstructural zeros, in this work a model within the Bayesian framework is proposed. In this context, we as sume that the prior distribution of the parameter of the first Bernoulli ω is uniform between 0 and 0.5 while the prior distribution of the probability of success in the second Bernoulli p is uniform between 0.5 and 1. In this way, the proposed model will be able to distinguish the two sources of zeros. Obviously, these hypotheses can be modified based on prior knowledge of the parameters that govern the phenomenon under investigation by making simple changes to the posterior distributions defined in Eq. (6) and recalculating the marginals shown in Eq. (5). This distinction is not necessary if covariates are included in the model since the covariates allow the origin of the zeros to be distinguished. To ensure that the estimates are kept within the appropriate parameter space, the logit link, commonly used in logistic regression, has been used.
where X_{1},. .., X_{k} are the covariates that have a hypotetical impact over the zero inflated part and Z_{1},. .., Z_{m} are the covariates that might have an influence over the non zero inflated part. The parameters θ_{i}, i = 0,. .., k and β_{j}, j = 0,. .., m are assumed to follow a normal distribution with mean 0 and variance σ^{2} and \({\sigma}_{\theta}^2\) and \({\sigma}_{\beta}^2\) respectively, modeled as hyperparameters.
The models proposed to analyse the data described in the following section and in the simulation study have been written in the programming language Stan, within the R environment [14] and are freely available from the authors as a package called bayesZIB [15]. To the best of our knowledge, this is the only package available in R able to fit zeroinflated Bernoulli regression models. The use of the package is very similar to other packages that implement zeroinflated models, such as pscl [16], to facilitate the interpretation of the results, while more advanced users could easily adapt the code to their specific requirements. If necessary, appropriate priors for the parameters ω and p can be defined in the function bayesZIB using the argument priors (only uniforms with different parameters are implemented so far in the package).
No covariates
In the particular case in which the interest is in estimating the proportion of struc tural (1ω) and sample (1p) zeros without accounting for the effect of any covariate, the posterior distributions of ω and p can be obtained analytically assuming some a priori knowledge of their distributions. As mentioned before, one could set ω to be uniform distributed on [0, 0.5] and p to be uniform distributed on [0.5, 1]. Be cause the observations are Bernoulli(p · ω) distributed, the likelihood function can be written as
where m is the frequency of occurrence of the phenomenon of interest and n is the total number of observations. From here, the joint posterior could be obtained as
From here the posterior marginal distributions of the two parameters can be obtained as
where F is the beta distribution function with parameters m + 1 and n − m + 1, implemented in the R function pbeta.
The methods used to analyse the real data example in the following sections are in accordance with relevant guidelines and regulations, in particular with the Inter national Labour Organization criteria, also used to define the target population in the European Working Conditions Survey [17] or the EU Labour Force Survey [18]. Participation in the considered study was voluntary and confidential, and informed consent was obtained from all subjects in order to be included. The data were anal ysed anonymously and all procedures were approved by the Ethics Committee on Animal and Human Experimentation of the Autonomous University of Barcelona (CEEAH/3445).
Results
This section presents the results of the analyses using the proposed methodology over a real data set and they are compared to the most common alternatives. The performance of the method is also studied by means of a comprehensive simulation study, with and without covariates.
Real data
In the database used to exemplify the use of the proposed methodology, we have a total of n = 1564 workers. Among these, it is known that 946 (around 61%) were not at risk of being presenteeist because they were not ill on any day during the study period. These observations correspond to the concept of structural zeros (1 − ω = 0.61), and an estimate of their proportion can be obtained by using zero inflated models, even taking into account the values of the variables used as explanations in the regression model. The proportion of presenteeists among those exposed is p = 0.70. Globally, a total of m = 430 workers experienced the event of interest. In the following subsections CI is used as an abbreviation of the confidence interval for frequentist analyses, and CrI is used for credibility interval when referring to the proposed Bayesian model.
Including the whole population
Taking all the population into consideration (n = 1564), i. e., including individuals at risk and not at risk (those who were not at risk during the study period), we fit a Bayesian zero inflated Bernoulli model, where the proportion of structural zeros 1 − ω is greater than 0.5 (prior uniform for ω at [0, 0.5]) and the proportion of sample zeros 1 − p is less than 0.5 (prior uniform for p at [0.5, 1]). This information is extracted from [19]. In this case, without using covariates, the model allows estimating the values of ωˆ = 0.37 (95% CrI: 0.27–0.49) and pˆ = 0.74 (95% CrI: 0.55–0.99). Here ωˆ and pˆ indicate the median of the marginal posterior of ω and p respectively. The a priori and a posteriori marginal distributions of both parameters are shown in Fig. 2. The different shapes between marginal priors and posteriors show that the models learn from the data.
On the other hand, analysing these data as is traditional in the literature, using a logistic regression model without taking into account that there are subjects who have not been at risk, the proportion obtained from presentists is 0.27 (95% CI: 0.25–0.30), a value with a controversial interpretation since it is significantly underestimating the proportion of presenters if the subjects who have not been at risk of being present are excluded, since it is ultimately an estimate of ωˆ · pˆ, being impossible to identify the two parameters.
Additionally, the proposed model allows incorporating covariates in both pro cesses. To illustrate how it works, we will consider here the selfperceived general state of health (categorized as good or bad) and the feeling of being replaceable, which is an item included in the vulnerability dimension of the Employment Pre cariousness Scale [20], with categories “Always”, “Sometimes” and “Never”. The hypothesis is that the general state of health would be related to the risk of be ing present (zero inflated part of the model) and the feeling of being replaceable would be related to presenting the phenomenon once a worker is exposed (non zero inflated part in the model), so the model is including one covariate in each part (k = m = 1). The model was fitted using 5 Markov chains, 5000 iterations in each chain (half used for warmup and half for inference), a target average acceptance probability of 0.999 and a maximum allowed treedepth of 25. Notice that these technical values might need to be changed depending on the analysed data. As can be seen in the next section, the results of the model shown in Table 1 largely co incide with the standard logistic analyses reported in Table 2, particularly in the direction and impact of associations. All R codes used in this paper are available as Supplementary Material.
Similarly to other regression models, the effect of never having the feeling of being replaceable over the odds of attending work when sick for someone who is at risk compared to workers who always have that sensation can be quantified by e^{− 1.04} = 0.35.
Excluding healthy population
If the information about which subjects are actually exposed to the phenomenon of interest is available (an ideal but unusual situation in practice), unexposed subjects could be excluded and a logistic regression model could be adjusted to the exposed individuals. Using the same explanatory variables described in the previous section, the corresponding coefficients and their 95% confidence intervals are described in Table 2.
Simulation study
In order to check the performance of the proposed methodology, 100 random samples were generated for each considered sample size (n = 500, 1500), and combination of parameters. The zero inflated part was build upon the logistic regression model logit(P (X = 1)) = θ_{0} + θ_{1} · x_{1} + θ_{2} · x_{2}, where x_{1} and x_{2} are two independent covariates, each following a standard normal distribution. The non zero inflated part was build upon the logistic regression model logit(P (Y = 1  X = 1)) = β_{0} + β_{1} · x_{3} + β_{2} · x_{4}, where x_{3} and x_{4} are two independent covariates, each with a standard normal distribution. To cover different effect magnitudes, the following values for each parameter were considered:

β_{0} = 0.5, 1, 2

β_{1} = 2, 3, 4

β_{2} = 3

θ_{0} = − 0.5, − 1, − 2

θ_{1} = − 2, − 3, − 4

θ_{2} = − 3
It is important to notice that this is an extreme situation, in which we consider that all the mass probability of the parameter distributions is concentrated in one point, the “true” value of the parameter.
For each random sample, the posterior marginal distributions of the parameters have been summarised by their median and percentiles 2.5 and 97.5%.
Tables 3 and 4 show, for each combination of parameters, the average estimates and upper and lower limits of the 95% credibility intervals. As no relevant differ ences were observed regarding sample sizes, Tables 3 and 4 shows only the results corresponding to n = 1500. The results corresponding to n = 500 are available as Supplementary Material. It can be seen that in all cases the original parameters used to generate the simulations can be properly recovered by the fitted models.
The R code used for the simulation is available as Supplementary Material. An additional simulation was conducted to evaluate the performance of the proposed methodology when there are no covariates involved, the details and results of this simulation can also be find in the Supplementary Material (Appendix A, Table S2).
Discussion
The proposed methodology is able to distinguish two different sources of zeros (structural and nonstructural) from dichotomous data in a Bayesian framework by assuming priors with different parameters on proportion of structural and non structural zeros. Furthermore, since it is freely available as an R package, it is easily usable for any researcher who needs to adjust this type of data and easily modifiable for more advanced users who need to adapt the model to their context, for example with different choices of the prior distributions of ω and p.
The approach used to analyse the SP is an important topic. Some studies include all working population to estimate SP, whilst other exclude “healthy” workers. As result, different conclusions in terms of prevalence and associated factors are obtained [21]. SP is an outcome resulting from mixing two phenomena, i.e. health status and exercise of rights. Health status plays a role regarding the fact of being exposed; and, among the exposed, the lack of the exercise of the right to take a sick leave determines SP. Using the proposed ZIB approach one could describe, in a single analysis, both phenomena: first, which factors are associated to the exposure to presenteeism (to be “sick”, factors related to health status), and after that, which factors increase the probability of being presenteeist among the exposed workers.
The simulation study shows that, even with relatively small sample sizes the model is capable of producing reasonable estimates for the parameters involved in both the zero inflated and non zero inflated processes. As expected, the credibility intervals length diminishes with sample size while their coverage grows.
Conclusions
The proposed method is a reliable alternative for the analysis of zero inflated di chotomous outcomes, as shown by the simulation study, and can be very useful in situations when there are two potential and indistinguishable sources of zeros. If there are covariates to be included in the model, the method is able to use them in order to identify the subpopulation at risk, and the Bayesian strategy assures that the two sources of zeros may be detected even when there are no covariates by utilizing different priors for the probability of success of each Bernoulli variable. The proposed model has been compiled in the form of the bayesZIB R package [15], so it is publicly available for any researcher facing this issue.
Availability of data and materials
The dataset analysed during the current study is available in the GitHub repository, https://github.com/dmorinya/BayesZIB. R codes used to analyse the real data and to generate the data used in the simulation study are available as supplementary material.
Abbreviations
 CI:

Confidence interval
 CrI:

Credible interval
 EM:

ExpectationMaximization
 SP:

Sickness presenteeism
 ZIB:

Zero inflated Bernoulli
 ZINB:

Zero inflated negative binomial
 ZIP:

Zero inflated Poisson
References
 1.
Johns G. Presenteeism in the workplace: a review and research agenda. J Organ Behav. 2010;31(4):519–42. https://doi.org/10.1002/job.630.
 2.
Cameron AC, Trivedi PK. Regression analysis of count data. New York: Cambridge University Press; 1998. https://doi.org/10.1017/CBO9780511814365.
 3.
Lambert D. Zeroinflated poisson regression, with an application to defects in manufacturing. Technometrics. 1992;34(1):1. https://doi.org/10.2307/1269547.
 4.
Abiodun G, Makinde O, Adeola A, Njabo K, Witbooi P, DjidjouDemasse R, et al. A dynamical and zeroinflated negative binomial regression modelling of malaria incidence in Limpopo Province, South Africa. Int J Environ Res Public Health. 2019;16(11). https://doi.org/10.3390/IJERPH16112000.
 5.
Zhao H, Pan Y, Wang C, Guo Y, Yao N, Wang H, et al. The effects of metal exposures on Charlson comorbidity index using zeroinflated negative binomial regression model: NHANES 20112016. Biol Trace Elem Res. 2021;199(6):2104–11. https://doi.org/10.1007/S12011020023314.
 6.
PerumeanChaney SE, Morgan C, McDowall D, Aban I. Zeroinflated and overdispersed: what’s one to do? J Stat Comput. 2013;83(9):1671–83. https://doi.org/10.1080/00949655.2012.668550http://dx.doi.org.sire.ub.edu/10.1080/00949655.2012.668550.
 7.
Paulo Favero L, de Freitas Souza R, Belfiore P, Luiz Corrêa H, Haddad MF, Paulo L, et al. Count data regression analysis: concepts, overdispersion detection, zeroinflation identification, and applications with R. Pract Assess Res Eval. 2021;26. https://doi.org/10.7275/44nncj68.
 8.
Vuong QH. Likelihood ratio tests for model selection and nonnested hypotheses. Econometrica. 1989;57(2):307. https://doi.org/10.2307/1912557.
 9.
Cameron AC, Trivedi PK. Regressionbased tests for overdispersion in the Poisson model. J Econ. 1990;46(3):347–64. https://doi.org/10.1016/03044076(90)90014K.
 10.
Desmarais BA, Harden JJ. Testing for zero inflation in count models: bias correction for the Vuong test. Stata J. 2013;13(4):810–35. https://doi.org/10.1177/1536867X1301300408https://doiorg.sire.ub.edu/10.1177/1536867X1301300408.
 11.
Barbu A, Wu T, Wu YN. Learning mixtures of bernoulli templates by tworound EM with performance guarantee. Electron J Stat. 2015;8:3004–30. https://doi.org/10.1214/14EJS981.1305.0319v6.
 12.
Diop A, Diop A, Dupuy JF. Simulationbased inference in a zeroinflated Bernoulli regression model. Commun Stats: Simul Comput. 2016;45(10):3597–614. https://doi.org/10.1080/03610918.2014.950743.
 13.
Sun Z, Rosen O, Sampson AR. Multivariate Bernoulli mixture models with application to postmortem tissue studies in schizophrenia. Biometrics. 2007;63(3):901–9. https://doi.org/10.1111/j.15410420.2007.00762.x.
 14.
R Core Team: R: A Language and environment for statistical computing. R Foundation for statistical computing, Vienna, Austria (2021). R Foundation for statistical computing. https://www.Rproject.org/.
 15.
Moriña Soler D, Puig P, Navarro A. bayesZIB: Bayesian zeroinflated Bernoulli regression model. In: R package version 0.0.2; 2021.
 16.
Zeileis A, Kleiber C, Jackman S. Regression models for count data in R. J Stat Softw. 2008;27(8)1–25.
 17.
Eurofound. EWCS 2015 – Methodology [Internet]. https://www.eurofound.europa.eu/surveys/europeanworkingconditionssurveys/sixtheuropeanworkingconditionssurvey2015/ewcs2015methodology. Accessed 10 Sep 2021.
 18.
Eurostat. EU labour force survey  methodology [Internet]. http://ec.europa.eu/eurostat/statisticsexplained/index.php/EU_labour_force_survey__methodology. Accessed 10 Sep 2021.
 19.
Navarro A, SalasNicás S, Moncada S, Llorens C, MolineroRuiz E. Prevalence, associated factors and reasons for sickness presenteeism: a crosssectional nationally representative study of salaried workers in Spain, 2016. BMJ Open. 2018;8(7):021212. https://doi.org/10.1136/bmjopen2017021212.
 20.
Vives A, Amable M, Ferrer M, Moncada S, Llorens C, Muntaner C, et al. Employment precariousness and poor mental health: evidence from Spain on a new social determinant of health. J Environ Public Health. 2013;2013:978656. https://doi.org/10.1155/2013/978656.
 21.
Navarro A, Salas‐Nicás S, Llorens C, Moncada S, Molinero‐Ruíz E, Moriña D. Sickness presenteeism: are we sure about what we are studying? A research based on a literature review and an empirical illustration. Am J Ind Med. 2019;62(7). https://doi.org/10.1002/ajim.22982.
Acknowledgements
Not applicable.
Funding
This work was partially supported by grant RTI2018–096072BI00 from the Spanish Ministry of Science and Innovation and by the Spanish State Research Agency, through the Severo Ochoa and María de Maeztu Program for Centers and Units of Excellence in R&D (CEX2020–001084M). The funding sources had no role in the data collection, analysis or interpretation of the results.
Author information
Affiliations
Contributions
DM, PP and AN developed the statistical model. DM and PP derived the described properties, and DM implemented the model in R software and conducted the analyses. AN provided the context on occupational health and sickness presenteeism. All authors have read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
All procedures were approved by the Ethics Committee on Animal and Human Experimentation of the Autonomous University of Barcelona (CEEAH/3445). Participation in the study was voluntary and informed consent was obtained from all subjects.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Additional tables and code. Additional tables obtained from the simulation study and R code to reproduce the analyses.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Moriña, D., Puig, P. & Navarro, A. Analysis of zero inflated dichotomous variables from a Bayesian perspective: application to occupational health. BMC Med Res Methodol 21, 277 (2021). https://doi.org/10.1186/s12874021014272
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874021014272
Keywords
 Presenteeism
 Bayesian methods
 Zeroinflation
 Simulation study
 Bernoulli mixture models