Analysis of zero inflated dichotomous variables from a Bayesian perspective: application to occupational health

Background Zero-inflated models are generally aimed to addressing the problem that arises from having two different sources that generate the zero values observed in a distribution. In practice, this is due to the fact that the population studied actually consists of two subpopulations: one in which the value zero is by default (structural zero) and the other is circumstantial (sample zero). Methods This work proposes a new methodology to fit zero inflated Bernoulli data from a Bayesian approach, able to distinguish between two potential sources of zeros (structural and non-structural). Results The proposed methodology performance has been evaluated through a comprehensive simulation study, and it has been compiled as an R package freely available to the community. Its usage is illustrated by means of a real example from the field of occupational health as the phenomenon of sickness presenteeism, in which it is reasonable to think that some individuals will never be at risk of suffering it because they have not been sick in the period of study (structural zeros). Without separating structural and non-structural zeros one would be studying jointly the general health status and the presenteeism itself, and therefore obtaining potentially biased estimates as the phenomenon is being implicitly underestimated by diluting it into the general health status. Conclusions The proposed methodology is able to distinguish two different sources of zeros (structural and non-structural) from dichotomous data with or without covariates in a Bayesian framework, and has been made available to any interested researcher in the form of the bayesZIB R package (https://cran.r-project.org/package=bayesZIB). Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01427-2.


Background
In general, zero-inflated models are aimed to addressing the problem that arises from having two different sources that generate the zero values observed in a distribution. In practice, this is due to the fact that the population studied actually consists of two subpopulations: one in which the value zero is by default (structural zero) and the other is circumstantial (sample zero). An example could be the study of sickness presenteeism (SP), i.e. attending work while sick [1]. If it is not previously restricted, the population is made up, among others, of workers who are zero because they have never been sick (structural zeros) and workers who, having been sick, did not attend their work place (sample zeros). Note that the difference is important: roughly the first zero informs us exclusively about the status of health, the second about the exercise of the right to take a sick leave.
The most commonly used zero-inflated models are those that are related to count-ing variables, where it is assumed that the zero value has a dichotomous source that determines whether or not the subject is at risk of suffering the event of interest and another source, only for the individuals at risk, that corresponds to the number of episodes (counts) that have been experienced by each individual at risk. In this context, the most common available models would be the well known Zero-Inflated Poisson (ZIP) and Negative Binomial (ZINB). A good introduction to the mathematical properties of these models can be found in [2], and they have been used in many fields such as quality control ( [3]), epidemiology ( [4]) or medicine ( [5]) among many others. Some guidelines on how to proceed when dealing with count outcomes potentially overdispersed or zeroinflated have been published recently ( [6,7]), based on classical procedures like Vuong's test ( [8]) to check for overdis-persion ( [9]) and zero-inflation ( [10]), although these guides cannot be applied to the case studied here due to the dichotomous nature of the outcome. In general, zero-inflated models can be expressed as where g is the structural zero probability and f (0) is the zero probability of an appropriate distribution (Poisson, negative binomial or Bernoulli as in our case).
In practice, zero-inflated models with both dichotomous sources (a mixture of two Bernoulli random variables, one with probability of success ω and the other with probability of success p) have received far less attention. This is due, in large part, to the fact that the resulting distribution is once again a Bernoulli with probability.
of success ω · p, so that the proportion of structural zeros (1 − ω) and sample zeros (1 − p) are indistinguishable from the point of view of frequentist statistics. However, from the Bayesian perspective and using known reasonable information about these proportions, it is possible to distinguish the two sources of zeros and estimate ω and p.
Some authors have recently suggested, in other areas such as the classification or identification of images, the usage of Bernoulli-mixture models, based on numerical algorithms such as Expectation-Maximization (EM) to estimate the parameters [11,12], given the complexity of the likelihood functions involved. In these cases, however, the inclusion of covariates or adjustment variables is virtually impossible. Also in other areas there are some recent developments in a similar line, such as [13].
In this article we illustrate the use of Zero Inflated Bernoulli (ZIB) models by means of a real dataset on SP, and the results obtained are compared with those of adjusted logistic regressions on the total population or only in those individuals at risk. In the literature, the SP registry is carried out in a self-reported way, asking about the episodes in the last year and later, recorded in a dichotomized way (no SP: 0 episodes; yes SP: 1 or more episodes). The justification for this dichotomization is fundamentally based on two aspects: one, the possible memory bias; second, the excessive influence of workers who report a very high number of episodes.

Methods
Let Y be the variable that indicates occurrence of the phenomenon under study. The proposed model has a probability function defined by where ω is the probability of exposure and p is the probability of occurrence of the phenomenon of interest among exposed individuals, as shown in Fig. 1. According to this scheme, the proportion of structural zeros will be 1 − ω and the proportion of non-structural zeros will be ω · (1 − p).
To overcome the impossibility of models without covariates based on the frequen-tist approach to differentiate between structural and non-structural zeros, in this work a model within the Bayesian framework is proposed. In this context, we as-sume that the prior distribution of the parameter of the first Bernoulli ω is uniform between 0 and 0.5 while the prior distribution of the probability of success in the second Bernoulli p is uniform between 0.5 and 1. In this way, the proposed model will be able to distinguish the two sources of zeros. Obviously, these hypotheses can be modified based on prior knowledge of the parameters that govern the phenomenon under investigation by making simple changes to the posterior distributions defined in Eq. (6) and recalculating the marginals shown in Eq. (5). This distinction is not necessary if covariates are included in the model since the covariates allow the origin of the zeros to be distinguished. To ensure that the estimates are kept within the appropriate parameter space, the logit link, commonly used in logistic regression, has been used.
where X 1 ,. .., X k are the covariates that have a hypotetical impact over the zero inflated part and Z 1 ,. .., Z m are the covariates that might have an influence over the non zero inflated part. The parameters θ i , i = 0,. .., k and β j , j = 0,. .., m are assumed to follow a normal distribution with mean 0 and variance σ 2 and σ 2 θ and σ 2 β respectively, modeled as hyperparameters.
The models proposed to analyse the data described in the following section and in the simulation study have been written in the programming language Stan, within the R environment [14] and are freely available from the authors as a package called bayesZIB [15]. To the best of our knowledge, this is the only package available in R able to fit zero-inflated Bernoulli regression models. The use of the package is very similar to other packages that implement zero-inflated models, such as pscl [16], to facilitate the interpretation of the results, while more advanced users could easily adapt the code to their specific requirements. If necessary, appropriate priors for the parameters ω and p can be defined in the function bayes-ZIB using the argument priors (only uniforms with different parameters are implemented so far in the package).

No covariates
In the particular case in which the interest is in estimating the proportion of struc-tural (1-ω) and sample (1-p) zeros without accounting for the effect of any covariate, the posterior distributions of ω and p can be obtained analytically assuming some a priori knowledge of their distributions. As mentioned before, one could set ω to be uniform distributed on [0, 0.5] and p to be uniform distributed on [0.5, 1]. Be-cause the observations are Bernoulli(p · ω) distributed, the likelihood function can be written as where m is the frequency of occurrence of the phenomenon of interest and n is the total number of observations. From here, the joint posterior could be obtained as From here the posterior marginal distributions of the two parameters can be obtained as where F is the beta distribution function with parameters m + 1 and n − m + 1, implemented in the R function pbeta.
The methods used to analyse the real data example in the following sections are in accordance with relevant guidelines and regulations, in particular with the International Labour Organization criteria, also used to define the target population in the European Working Conditions Survey [17] or the EU Labour Force Survey [18]. Participation in the considered study was voluntary and confidential, and informed consent was obtained from all subjects in order to be included. The data were analysed anonymously and all procedures were approved by the Ethics Committee on Animal and Human Experimentation of the Autonomous University of Barcelona (CEEAH/3445).

Results
This section presents the results of the analyses using the proposed methodology over a real data set and they are compared to the most common alternatives. The performance of the method is also studied by means of a comprehensive simulation study, with and without covariates.

Real data
In the database used to exemplify the use of the proposed methodology, we have a total of n = 1564 workers. Among these, it is known that 946 (around 61%) were not at risk of being presenteeist because they were not ill on any day during the study period. These observations correspond to the concept of structural zeros (1 − ω = 0.61), and an estimate of their proportion can be obtained by using zero inflated models, even taking into account the values of the variables used as explanations in the regression model. The proportion of presenteeists among those exposed is p = 0.70. Globally, a total of m = 430 workers experienced the event of interest. In the following subsections CI is used as an abbreviation of the confidence interval for frequentist analyses, and CrI is used for credibility interval when referring to the proposed Bayesian model.

Including the whole population
Taking all the population into consideration (n = 1564), i. e., including individuals at risk and not at risk (those who were not at risk during the study period), we fit a Bayesian zero inflated Bernoulli model, where the proportion of structural zeros 1 − ω is greater than 0.5 (prior uniform for ω at [0, 0.5]) and the proportion of sample zeros 1 − p is less than 0.5 (prior uniform for p at [0.5, 1]). This information is extracted from [19]. In this case, without using covariates, the model allows On the other hand, analysing these data as is traditional in the literature, using a logistic regression model without taking into account that there are subjects who have not been at risk, the proportion obtained from presentists is 0.27 (95% CI: 0.25-0.30), a value with a controversial interpretation since it is significantly underestimating the proportion of presenters if the subjects who have not been at risk of being present are excluded, since it is ultimately an estimate of ωˆ · pˆ, being impossible to identify the two parameters.
Additionally, the proposed model allows incorporating covariates in both pro-cesses. To illustrate how it works, we will consider here the self-perceived general state of health (categorized as good or bad) and the feeling of being replaceable, which is an item included in the vulnerability dimension of the Employment Precariousness Scale [20], with categories "Always", "Sometimes" and "Never". The hypothesis is that the general state of health would be related to the risk of be-ing present (zero inflated part of the model) and the feeling of being replaceable would be related to presenting the phenomenon once a worker is exposed (non zero inflated part in the model), so the model is including one covariate in each part (k = m = 1). The model was fitted using 5 Markov chains, 5000 iterations in each chain (half used for warmup and half for inference), a target average acceptance probability of 0.999 and a maximum allowed treedepth of 25. Notice that these technical values might need to be changed depending on the analysed data. As can be seen in the next section, the results of the model shown in Table 1 largely co-incide with the standard logistic analyses reported in Table 2, particularly in the direction and impact of Similarly to other regression models, the effect of never having the feeling of being replaceable over the odds of attending work when sick for someone who is at risk compared to workers who always have that sensation can be quantified by e − 1.04 = 0.35.

Excluding healthy population
If the information about which subjects are actually exposed to the phenomenon of interest is available (an ideal but unusual situation in practice), unexposed subjects could be excluded and a logistic regression model could be adjusted to the exposed individuals. Using the same explanatory variables described in the previous section, the corresponding coefficients and their 95% confidence intervals are described in Table 2.

Simulation study
In order to check the performance of the proposed methodology, 100 random samples were generated for each considered sample size (n = 500, 1500), and combination of parameters. The zero inflated part was build upon the logistic regression model logit(P (X = 1)) = θ 0 + θ 1 · x 1 + θ 2 · x 2 , where x 1 and x 2 are two independent covariates, each following a standard normal distribution. The non zero inflated part was build upon the logistic regression model logit(P (Y = 1 | X = 1)) = β 0 + β 1 · x 3 + β 2 · x 4 , where x 3 and x 4 are two independent covariates, each with a standard normal distribution. To cover different effect magnitudes, the following values for each parameter were considered: It is important to notice that this is an extreme situation, in which we consider that all the mass probability of the parameter distributions is concentrated in one point, the "true" value of the parameter.
For each random sample, the posterior marginal distributions of the parameters have been summarised by their median and percentiles 2.5 and 97.5%. Tables 3 and 4 show, for each combination of parameters, the average estimates and upper and lower limits of the 95% credibility intervals. As no relevant differ-ences were observed regarding sample sizes, Tables 3 and 4 shows only the results corresponding to n = 1500. The results corresponding to n = 500 are available as Supplementary Material. It can be seen that in all cases the original parameters used to generate the simulations can be properly recovered by the fitted models.
The R code used for the simulation is available as Supplementary Material. An additional simulation was conducted to evaluate the performance of the proposed methodology when there are no covariates involved, the details and results of this simulation can also be find in the Supplementary Material (Appendix A, Table S2).

Discussion
The proposed methodology is able to distinguish two different sources of zeros (structural and non-structural) from dichotomous data in a Bayesian framework by assuming priors with different parameters on proportion of structural and non-structural zeros. Furthermore, since it is freely available as an R package, it is easily usable for any researcher who needs to adjust this type of data and easily modifiable for more advanced users who need to adapt the model to their context, for example with different choices of the prior distributions of ω and p.
The approach used to analyse the SP is an important topic. Some studies include all working population to estimate SP, whilst other exclude "healthy" workers. As result, different conclusions in terms of prevalence and associated factors are obtained [21]. SP is an outcome resulting from mixing two phenomena, i.e. health status and exercise of rights. Health status plays a role regarding the fact of being exposed; and, among the exposed,   Table 3 Simulation study results including covariates (I) β 0 β 1 θ 0 θ 1 β 0 (95% CrI) β 1 (95% CrI) β 2 (95% CrI) θ 0 (95% CrI) θ 1 (95% CrI) θ 2 (95% CrI)   the lack of the exercise of the right to take a sick leave determines SP. Using the proposed ZIB approach one could describe, in a single analysis, both phenomena: first, which factors are associated to the exposure to presenteeism (to be "sick", factors related to health status), and after that, which factors increase the probability of being presenteeist among the exposed workers. The simulation study shows that, even with relatively small sample sizes the model is capable of producing reasonable estimates for the parameters involved in both the zero inflated and non zero inflated processes. As expected, the credibility intervals length diminishes with sample size while their coverage grows.