 Research
 Open access
 Published:
Zeroaugmented betaprime model for multilevel semicontinuous data: a Bayesian inference
BMC Medical Research Methodology volume 22, Article number: 283 (2022)
Abstract
Semicontinuous data characterized by an excessive proportion of zeros and rightskewed continuous positive values appear frequently in medical research. One example would be the pharmaceutical expenditure (PE) data for which a substantial proportion of subjects investigated may report zero. Twopart mixedeffects models have been developed to analyse clustered measures of semicontinuous data from multilevel studies. In this study, we propose a new flexible twopart mixedeffects model with skew distributions for nested semicontinuous cost data under the framework of a Bayesian approach. The proposed model specification consists of two mixedeffects models linked by the correlated random effects: Part I) a model on the occurrence of positive values using a generalized logistic mixed model; and Part II) a model on the magnitude of positive values using a linear mixed model where the model errors follow skew distributions including betaprime (BP). The proposed method is illustrated with pharmaceutical expenditure data from a multilevel observational study and the analytic results are reported by comparing potential models under different skew distributions. Simulation studies are conducted to assess the performance of the proposed model. The DIC_{3}, LPML, WAIC, and LOO as the Bayesian model selection criteria and measures of divergence used to compare the models.
Introduction
Semicontinuous data featured with an excessive proportion of zeros and rightskewed positive values arise frequently in health economics and health services research [1]. Examples include alcohol consumption, householdlevel consumption of food items, medical cost, and substance abuse symptom scales. Statistical models with normality assumptions ignoring the skewness and the spike at zero are not suitable for this type of data and may lead to substantial bias and incorrect statistical inferences. Twopart (zeroaugmented) models, originating in econometrics [2, 3], have been developed extensively in the last three decades to analyse this type of data and have been applied to scientific fields other than economics such as clinical research and health services. In twopart models, we view a semicontinuous variable as the result of two processes: one binomial process determining whether the positive value occurs and one continuous process determining the actual value given it is nonzero. Therefore, a twopart model consists of two components, with the first component (i.e., Part I) modelling the probability of a response being positive using the probit or logistic regression, and the second component (i.e., Part II) modelling the conditional mean of the positive values (given positive values occurred) using the continuous regression. According to the data structures, various methods have been developed for analysing crosssectional and longitudinal semicontinuous data [3,4,5,6,7]. Olsen and Schafter [6] first extended the twopart models developed by Duan et al. [3] and Manning et al. [8] for crosssectional data to the longitudinal setting by introducing correlated randomeffects into the logit and lognormal components, respectively, and applied them to longitudinal alcohol data. In the twopart mixedeffects models, the binomial process is typically modelled with mixedeffects logistic or probit regression, and the continuous process is naturally modelled via linear mixed models (LMMs). The randomeffects in the two components are generally assumed to be correlated through a multivariate normal distribution structure. Ignoring the betweencomponent association mistakenly can yield biased estimates in the second part of the model [9]. The correlated randomeffects can capture not only the betweencomponent association but also the withinsubject correlation among repeated measurements collected from the same individual and nested data. A betweencomponent correlation means that the process giving rise to the positive values is related to the magnitude of the observed value given that a positive response occurred. For example, in a data collection of selfreported daily drinks (DDD) where zero represents no daily drinks and the continuous positive values reflect the mean of drinks per day, a positive correlation suggests that an individual with high odds of drinking tended to drink more alcohol [10].
For the positive part of a semicontinuous variable, LMMs with a normality assumption were used by Husted et al. [11] and Su et al. [9]. However, the positive part of a semicontinuous variable is often rightskewed. The logarithmic transformation was the most commonly used approach to correct the skewness [6, 7] and other monotone increasing functions such as BoxCox transformation that would make the positive component approximately normal were also explored [12, 13]. The limitations with data transformation in Part II include reduced information, difficulty in interpreting the results and possible heteroscedasticity [13, 14]. An alternative approach is to use generalized linear mixed models (GLMMs) with distributions in the exponential family that can model skewed data, such as LogNormal, LogSkewNormal [15], Gamma [13], Inverse Gamma, Inverse Gaussian [16], Beta [17], Bridge [18], Generalized Gamma family, and Weibull distributions [19]. It is noted that GLMMs often involve complicated iterative procedures in estimation which may lead to intensive computation burden and nonconvergence issues. It would be most effective to use a flexible distribution to model the right skewed positive values in twopart models. Recently, studies have been presented using the betaprime (BP) distribution to fit longtail semicontinuous responses [10, 20, 21]. There is very limited research on the application of this skew distribution in a twopart mixedeffects model [22]. This study is an extension of Kamyari et al. (2021), where random effects are added to the linear predictor terms by using real twolevel data.
Parameter estimations in twopart modelling could be computationally difficult. For twopart models with independent randomeffects, maximum likelihood estimates (MLE) can be derived by fitting separate mixedeffects model to each part [1]. For the correlated twopart mixedeffects model with lognormal distribution on the positive values, Olsen and Shafter [6] and Tooze et al. [7] developed different maximum likelihood approaches. Several authors have proposed Bayesian approaches to fit the twopart models [17, 23,24,25,26,27]. For example, Cooper et al. [23] used a Bayesian approach via Markov Chain Monte Carlo (MCMC) to fit a probitlognormal correlated twopart model on medical cost data.
As a result, In this study, we propose a twopart mixedeffects model with a logistic mixed model on the occurrence of positive values and a GLMM with BP distribution on the continuous positive values using the Bayesian approach via MCMC procedure with application to a threelevel pharmaceutical expenditure data. The data used for this study was extracted from the Iranian pharmaceutical expenditure (PE2018) survey. The survey was a crosssectional study that had been conducted by the National Center for Health Insurance Research, Iran Health Insurance Organization. PE2018 is a dataset of yearly pharmaceutical expenditure per person conducted in 429 cities in Iran.
The rest of the article is organized as follows. At the beginning of Methods, we describe the BP regression model. In Model specification, we present the twopart mixed effects model for responses with BP distribution. In Numerical study, we apply the proposed methodologies to real data and report the analysis results. In Simulations, simulation studies are conducted to assess the performance of the proposed models. Finally, we conclude the article with a discussion in Discussion.
Methods
Betaprime regression model
The BP distribution [28, 29] is also known as inverted beta distribution or beta distribution of the second kind, often the model of choice for fitting semicontinuous data where the response variable is measured continuously on the positive real line (Y > 0) because of the flexibility it provides in terms of the variety of shapes it can accommodate. The probability density function (PDF) of a BP distributed random variable Y parameterized in terms of its mean μ and a precision parameter ψ is given by
where B denote the beta function, μ > 0, ψ > 0, E(Y) = μ, and Var(Y) = (μ(1 + μ))/ψ.
Figure 1 displays some plots of the density function in Eq. (1) for some parameter values. It is evident that the distribution is very flexible and it can be an interesting alternative to other distributions with positive support. Figure 1 shows that for a fixed value of the mean μ, higher values of ψ lead to a reduction of Var(Y), and vice versa. If Y has PDF as in Eq. (1), we denote Y~BP(μ, ψ). Next, to connect the covariate vector X_{k}, k = 1, …, m to the random sample Y_{1}, Y_{2}, …, Y_{m} of Y, we use a suitable link function g_{1} that maps the mean interval (0, +∞) onto the real line. This is given as \({g}_1\left({\mu}_k\right)={\boldsymbol{X}}_{\boldsymbol{k}}^{\prime}\boldsymbol{\beta}\), where β is the vector of regression parameters, and the first element of X_{k} is 1 to accommodate the intercept. The precision parameter ψ_{k} is either assumed constant [30, 31] or regressed onto the covariates [30, 32] via another link function g_{2}, such that \({g}_2\left({\psi}_k\right)={\boldsymbol{Z}}_{\boldsymbol{k}}^{\prime}\boldsymbol{\gamma}\), where Z_{k} is a covariate vector (not necessarily similar to X_{k}) and γ is the corresponding vector of regression parameters. Similar to X_{k}, Z_{k} also accommodates an intercept. The link functions g_{1} : ℝ^{y > 0} ⟶ ℝ and g_{2} : ℝ^{y > 0} ⟶ ℝ must be strictly monotone, positive and at least twice differentiable, such that \({\mu}_k={g}_1^{1}\left({\boldsymbol{X}}_{\boldsymbol{k}}^{\prime}\boldsymbol{\beta} \right)\) and \({\psi}_k={g}_2^{1}\left({\boldsymbol{Z}}_{\boldsymbol{k}}^{\prime}\boldsymbol{\gamma} \right)\), with \({g}_1^{1}(.)\) and \({g}_2^{1}(.)\) being the inverse functions of g_{1}(.) and g_{2}(.), respectively. We can estimate the parameters of the BP regression model defined in Eq. (1) using the gamlss function in the R (≥ 3.3.0) language [33] with a package of the same name [34].
Model specification
In this section, we present our model for the yearly pharmaceutical expenditure record in three levels. However, our model can be adapted easily to more complicated settings.
In the threelevel pharmaceutical expenditure record data, level 3 is the province; level 2 is the city level that nested within province; and level 1 is the subject level that is nested in cities. There are two types of correlations at different levels in the pharmaceutical expenditure data. The first type exists at the province level, where cost records of the same province are correlated. For each subject, there also exists another correlation within each city. Thus, a threelevel random effects twopart model is more appropriate for the analysis of the pharmaceutical expenditure data. We are interested in modelling a threelevel semicontinuous pharmaceutical expenditure data, characterized by a large portion of zeros and continuous positive values. We define notations as follows. Suppose that we observe cost record y_{ijk} for the kth subject of city j within province i, where i = 1, 2, ..., n, j = 1, 2, ..., n_{i}, and k = 1, 2, ..., m_{ij}. The total number of cities is \(J=\sum_{i=1}^n{n}_i\), and the total number of subjects is \(N=\sum_{i=1}^n\sum_{j=1}^{n_i}{m}_{ij}\). Let ω_{ijk} = I(y_{ijk} > 0) denote the indicator of y_{ijk} being nonzero. Define by X_{ijk} the covariate vectors for the fixed effect. Let p_{1i} and p_{2i} be the correlated random effects in the province level with joint density (p_{1i}, p_{2i}) for parts I and II of our proposed model, respectively. Similarly, define c_{1ij} and c_{2ij} to be the correlated random effects with joint density (c_{1ij}, c_{2ij}) in the city level. In this paper, we assume that
with Σ_{1} and Σ_{2} being positive definite matrices. We also assume that (p_{1i}, p_{2i}) and (c_{1ij}, c_{2ij}) are independent for all i’s and j’s. Denote by e_{ijk} the error term for the positive value of Y_{ijk}. We assume that \({e}_{ijk}\sim N\left(o,{\sigma}_e^2\right)\) is independent of random effects p_{1i}, p_{2i}, c_{1ij}, and c_{2ij}. Define τ_{ijk} = P(ω_{ijk} = 1 p_{1i}, c_{1ij}) to be the probability of nonzero value for Y_{ijk}.
To obtain interpretable covariate effects on the marginal mean, we propose the following marginalized twopart model that parameterizes the covariate effects directly in terms of the marginal mean, μ_{ijk} = E(Y_{ijk}), on the original (i.e., untransformed) data scale. The marginalized twopart model with random (cluster) effects for the zero and the continuous components, respectively, specifies the linear predictors.
Part I:
Part II:
where X_{1N × (r + 1)} and X_{2N × (q + 1)} have full rank r and q for the zero and the continuous components, respectively; α_{(r + 1) × 1} and β_{(q + 1) × 1} are the corresponding vectors of regression coefficients. As seen in relations (3) and (4), the mixing probability and mean of the component of the continuous parts are linked to the independent variables through logit and logarithmic link functions. The vectors p_{1} = (a_{11}, a_{12}, …, a_{1m})^{′} and p_{2} = (a_{21}, a_{22}, …, a_{2m})^{′} denote random effects of the third level in the components of logistic and continuous, respectively, whereas \({\boldsymbol{c}}_1={\left({b}_{111},\dots, {b}_{11{n}_1},\dots, {b}_{1m1},\dots, {b}_{1m{n}_m}\right)}^{\prime }\) and \({\boldsymbol{c}}_2={\left({b}_{211},\dots, {b}_{21{n}_1},\dots, {b}_{2m1},\dots, {b}_{2m{n}_m}\right)}^{\prime }\) are the random effects of the second level. For simplicity of interpretation and mathematical calculations, the random effects (p_{1}, p_{2}) and (c_{1}, c_{2}) are assumed to be independent and normally distributed with mean zero and variances \({\sigma}_{p_1}^2,{\sigma}_{p_2}^2,{\sigma}_{c_1}^2\) and \({\sigma}_{c_2}^2\), respectively [35, 36]. The error terms \({e}_{ijk}\sim N\left(0,{\sigma}_e^2\right)\) are also assumed to be normal distribution and independent of the random effects at both levels 2 and 3.
Again and according to the data structure in this study (threelevel data), \({\boldsymbol{X}}_{1 ijk}^{\prime }\) is the vector of covariates for the kth measurement at the jth city (level2) at the ith province (level3) for the binary part and \({\boldsymbol{X}}_{2 ijk}^{\prime }\) is the vector of covariates for the kth measurement at the jth city (level2) at the ith province (level3) for the continuous part. The two parts might have common covariates or completely different ones. α is the vector of model coefficients corresponding to the binary part and β is the vector of coefficients corresponding to the continuous part conditional on the values being nonzero. The model can be easily extended to include higherlevel random effects.
The conditional PDF for y_{ijk} is expressed as:
Generally, the estimation of parameters α, β, ψ, Σ_{1} and Σ_{2} is based on the likelihood function of data given as:
where the loglikelihood for the binary part is
and the loglikelihood for the continuous part is
In this likelihood function (Eq. 5), ϕ(p_{1i}, p_{2i}) and ϕ(c_{1ij}, c_{2ij}) represents the bivariate normal distribution for the random effects with mean vector of zeros and variance–covariance matrix Σ_{1} and Σ_{2} for zero and nonzero part respectively. As can be seen from Eq. (5), the likelihood function involves the integral with respect to the multivariate normal PDF. Parameter estimation in the proposed models can be computationally difficult as the likelihood function depends on analytically intractable integrals of a nonlinear function with respect to the multivariate normal distribution of randomeffects.
Bayesian inferential framework
The parameters in part I and II were individually estimated within a Bayesian inferential framework with MCMC sampling of the posteriors.
Let Θ = (α, β, ψ, Σ_{1}, Σ_{2}) be the collection of unknown population parameters in models (2), (3) and (4). To complete the Bayesian formulation, we specify mutually independent prior distributions for all the unknown parameters as follows:
where by considering the information available from literature [1, 16, 37] and range of the parameters Normal (N), Inverse Gamma (IG), and Inverse Wishart (IW) distributions are chosen to simplify computations.
Let the observed data \(\mathfrak{D}=\left\{\left({\omega}_{ij},{y}_{ij},{x}_{1 ij},{x}_{2 ij}\right);i=1,\dots, n;j=1,\dots, {n}_i\right\}\), f(.) be a density function, f(. .) be a conditional density function and h(.) be a prior density function. We assume that the parameters in Θ are independent of each other; that is:
After specifying the models for the observed data and prior distributions of the unknown model parameters, we can draw samples for the parameters based on their posterior distributions under the Bayesian framework. Therefore, the joint posterior density of Θ, conditional on \(\mathfrak{D}\), can be determined by
The integral in (9) has a high dimension and does not have a closed solution. Analytic approximations to the integrals may not be accurate enough. So, the direct calculation of the posterior distribution of Θ based on the observed data \(\mathfrak{D}\) is prohibitive [1]. As an alternative, posterior computation of Θ can proceed using a MCMC procedure via Gibbs sampling or MetropolisHastings (MH) algorithm. While the Gibbs sampler relies on conditional distributions [23, 38,39,40] the MetropolisHastings sampler uses a full joint density distribution to generate a candidate draws [38, 41]. Certainly, there is a large body of work on other computational approaches to sampling (slice sampling, adaptive rejection sampling, Hamiltonian Monte Carlo, etc.); covering such methods is beyond the scope of this study. In an initial review of the software, we concluded that it would be faster to use the OpenBUGS software. Moreover, due to the high volume of data (calculations) and the time limitation, we could not check the performance of other software. However, OpenBUGS was chosen because of its generality and simplicity. The associated OpenBUGS code is available in Additional file 1: Appendix A.1.
Model complexity and fit
There are a variety of methods to select the model that best fits the data. However, in this research article, we focus on the log pseudo marginal likelihood (LPML) and a modified observed deviance information criterion, denoted here by DIC_{3}. In addition, we use of two emerging model selection methods, namely leaveoneout crossvalidation (LOOCV) and widely available information criterion (WAIC), due to their fully Bayesian nature.
The Bayesian Deviance Information Criterion (DIC_{3}) [42] is used to compare the models fitted. It is defined by
where \(\overline{D}\left(\theta \right)=2E\left\{\log \left[p\left(y\uptheta \right)\right]y\right\}\) is the posterior mean deviance taken as Bayesian measure of fit, \(p\left(y\uptheta \right)=\prod_{i=1}^np\left({y}_i\uptheta \right)\), E{log[p(y θ)] y} is the posterior expectation of log[p(y θ)] and p_{D} is the effective number of parameters representing model complexity. The DIC_{3} is a natural generalization of the Akaike Information Criterion (AIC) [43] and interpreted as a Bayesian measure of fit penalized for increased model complexity. The DIC_{3} was developed to solve the problem of determining the ‘effective’ number of parameters (p_{D}) in complex nonnested hierarchical models and its computation has been coded into the latest version of WinBUGS (1.4). As for the usual DIC, minimum DIC_{3} estimates the model that will make the best shortterm predictions [42]. Note that the DIC_{3} is only comparable across models with exactly the same observed data.
The LPML [44] is another measure for comparing models that derived from the conditional predictive ordinate (CPO) statistics and is one of the most widely used model selection criteria available in WinBUGS. It is derived from the posterior predictive distribution. For our proposed models, a closed form of the CPO_{i} is not available. However, a Monte Calro estimate of CPO_{i} can be obtained by using a single MCMC sample from the posterior distribution of θ, through a harmonicmean approximation proposed by [45], as \(\hat{CPO_i}={\left\{\frac{1}{K}\sum_{i=1}^Kg{\left({y}_i{\uptheta}^{(i)}\right)}^{1}\right\}}^{1}\), where θ^{(1)}, …, θ^{(K)} is a post burnin sample of size K from the posterior distribution from θ, and g is the marginal distribution of Y (integrated over the random effects). A summary statistic of the CPO_{i} is the LPML, defined by \(LPML=\sum_{i=1}^n\log \left(\hat{CPO_i}\right)\). Larger values of LPML indicate better fit.
The WatanabeAkaike (or widely applicable) information criterion (WAIC) [46, 47] is closely related to the more widely known DIC measure, which is based on a deviance. The WAIC is a more fully Bayesian approach for estimating outofsample expectation. In general, the WAIC is defined as:
The deviance term in DIC is \({\log}\left(p\left(y\overset{\sim }{\theta}\right)\right)\) where \(\overset{\sim }{\theta }\) is a point estimate of θ. For WAIC, this term is replaced by the log pointwise predictive density (LPPD), defined as:
Just like DIC, there are variants of WAIC which depend on how p_{WAIC} is defined. Gelman, Hwang, and Vehtari also propose \({p}_{WAIC2}=\sum_{i=1}^n{Var}_{post}\left[\log p\left({y}_i\theta \right)\right]\) as a penalty term, where p_{WAIC2} is “the variance of individual terms in the log predictive density summed over the n data points” [48]. Although DIC is a commonly used measure to compare Bayesian models, WAIC has several advantages over DIC, including that it closely approximates Bayesian crossvalidation, it uses the entire posterior distribution and it is invariant to parameterisation [49].
Exact crossvalidation requires refitting the model with different training sets. Approximate leaveoneout crossvalidation (LOOCV) can be computed easily using importance sampling [50]. The Bayesian LOO estimate of outofsample predictive fit is
where p_{post}(θ y_{−i}) is the posterior distribution based on the data without the ith data point. Unlike LPPD that uses data point i for both the computation of posterior distribution and the prediction, here LPPD_{LOO} only uses it for prediction, and hence there is no need for a penalty term to correct the potential bias introduced by using data twice [51].
The question regarding the real data is whether the data had better support a true model. To that end, we fit each model using OpenBUGS and compute DIC and LPML for each. We also export the joint posterior distributions from OpenBUGS into R and compute WAIC and LOO with “loo” package [52].
Numerical study
Specific models and implementation
In this section, we apply the zeroaugmented gamma with random effects and zeroaugmented betaprime with random effects to analyze the multilevel pharmaceutical expenditure dataset previously described, where response (y_{ijk}) is the total pharmaceutical expenditure ($USD) for all drugs prescribed during a 1 year period related to the subject k (k = 1, …, 29,354) that nested within city j (j = 1, …, 429) that are nested within province i (i = 1, …, 31). From now on, the zeroaugmented gamma regression model and zeroaugmented betaprime regression model with multilevel random effects, will be called ZAGRE model and ZABPRE model, respectively.
Figure 2(ad), shows the quintiles of adjusted pharmaceutical expenditure and counts of drugs by provinces and cities in Iran in 2018. Variation in pharmaceutical expenditure and counts of drugs among clusters (provinces and cities) is well shown. The PE2018 dataset contains 16.1% of observations with no cost for drugs during the 2018 year. In addition, there is accentuated asymmetry in the empirical distribution of the positive responses, which is confirmed by the sample skewness and sample quartiles (Table 1). These results proposed a skew distribution as a candidate for fitting the pharmaceutical expenditure.
Here we model τ_{ijk} and μ_{ijk} as follows:
where \({\boldsymbol{X}}_{1 ijk}^{\prime }\) and \({\boldsymbol{X}}_{2 ijk}^{\prime }\) are the matrices of population effects related to subject k, containing an intercept and the following six covariates: Total_{k} (total inpatients expenditure ($) per year), ICPD_{k} (insurance coverage for prescription drugs per year), NDPP_{k} (number of drugs per prescription), NOP_{k} (number of prescriptions), Age_{k} (in year), and Sex_{k} (1 = male, 0 = female); α and β are coefficient vectors for the mean and zeropart regression, respectively; p_{1i} and p_{2i} be the correlated random effects in the province level with joint density (p_{1i}, p_{2i}) for parts I and II of our model, respectively. Similarly, define c_{1ij} and c_{2ij} to be the correlated random effects with joint density (c_{1ij}, c_{2ij}) in the city level.
In the absence of historical data/experiment, our prior choices follow the specifications described in Model specification. Thus, we consider the following independent (weak) priors for the MCMC sampling:
We generate two parallel independent MCMC runs of size 200,000 – each of them with widely dispersed initial values – and discard the first 100,000 iterations (burnin samples) for later computing of posterior estimates. We consider a lag of size 100 to eliminate potential problems due to autocorrelation and monitor the convergence of the MCMC chains using trace plots and the R statistic [53], which indicates convergence of about 1. To improve convergence, we divide the response (Y_{ijk}) by 100.
Estimation and model comparison
We consider significant those effects whose 95% equaltail credible intervals (CI) do not include zero (Table 2 and Fig. 3). Except NDPP and sex in continuous part of ZAGRE, 95% equaltail CIs show that, all other variables were significant in two parts of models (Fig. 3). Therefore, the final ZAGRE and ZABPRE models have, respectively, the following systematic setting:
The posterior estimates of parameters of model (Eq. 10) shown in Table 2, are quite close in both ZAGRE and ZABPRE models while estimates of variance components differ between them. However, 95% equaltail CI for \({\sigma}_{p_1{p}_2}\) and \({\sigma}_{c_1{c}_2}\) includes zero in both models, indicating no correlation between variances in level 2 and level 3. Posterior standard deviations of variance components are a bit larger under the ZAGRE model. Also, it is important note that the meaning of parameter ψ differs between the ZAGRE and ZABPRE models. In ZAGRE model, ψ represents the dispersion parameter, while in ZABPRE model, it represents the invariance of Y_{ijk}, conditioned on the random effects.
We use DIC_{3}, LPML, WAIC, and LOO as the Bayesian model selection criteria and measures of divergence discussed previously to compare the ZAGRE and ZABPRE to fit the PE2018. Except in computational time, The ZABPRE model performs better according to all other criteria, because it has the smaller DIC_{3}, WAIC, and LOOCV and greater LPML (Table 2). Based on those results, we select the ZABPRE as our best model.
We also conducted a sensitivity analysis on the prior assumptions for the dispersion parameter (ψ) and the fixed effects precision parameter. In particular, we allowed that the dispersion ψ ∼ Gamma (k, k) with k ∈ {0.001, 0.1} and the normal precision on the fixed effects to be 0.1, 0.25 and 0.001. We checked the sensitivity in the posterior estimates of β by changing one parameter at a time and refitting both models. Although slight changes were observed in parameter estimates and model comparison values, the results appeared to be robust and did not change our conclusions regarding the best model, inference, and sign of the fixedeffects.
From these findings, we further report the results in detail only for the best ZABPRE model in the following Section.
Results for the ZABPRE model
We use the ZABPRE model to interpret parameters effects on the mean of positive expenditure (μ_{ijk}) and the probability of noncost (τ_{ijk}) by considering individual effects as zero. To measure effects directly on μ_{ijk} and τ_{ijk}, we take the antilogarithm of log(μ_{ijk}) and logit(τ_{ijk}) in Eq. (10), obtaining
We use the posterior means in Table 2 as estimates of the parameters. From Eq. (11), parameter β_{i} represents the rate of change in the logarithm of the mean of the positive expenditure as each of total, ICPD, NDPP, NOP, and age increases one unit. Therefore, increasing NOP variable of subject k in the original scale by one, the log (μ_{ijk}) increases by 0.22, where exp(0.22) = 1.25 is the increasing value of response variable in its original scale. Parameters α_{0}, α_{1}, α_{2} and α_{3} contribute to the calculation of τ_{ijk} in Eq. (11). Here, α_{0} represents the effect of being a female respondent with age and NDPP set to their respective mean. Setting NDPP and age variables to zero – which implies they are set to their respective mean in the original scale – the probability of no consumption is 1 − τ_{ijk} = 1 − exp (1.645)/(1 + exp(1.645) ) = 0.16 if subject k is female and 1 − τ_{ijk} = 1 − exp (1.645 − 0.183)/(1 + exp(1.645 − 0.183) ) = 0.19 if subject k is male. Overall, females tend to declare a larger expenditure since the estimate of α_{3} is negative. Parameters α_{1} (0.117) and α_{2} (−0.010) represent the effect of NDPP and age variables in logit (τ_{ijk}). In particular, as NDPP variable increase by one unit, with every additional pharmaceutical item, the odds of having a positive expenditure increase by 12.41%. In addition, with each year growing in age, the Odds of having a positive cost decreases by 1%.
In order to evaluate the predictive performance of our best model, we generate 2000 replicates of 𝐘, say, Y^{∗} = (Y^{(1)}, …, Y^{(2000)} )^{T}. The ijkth element of the lth replicate \({Y}_{ijk}^{(l)}\) is generated through the ZABPRE(\({\mu}_{ijk}^{(l)},{\phi}^{(l)},{\tau}_{ijk}^{(l)}\)) model, where \({\mu}_{ij k}^{(l)}={{\log}}^{1}\left(\sum_{\theta =1}^q{\boldsymbol{\beta}}_{\theta}^{(l)}{x}_{\theta_i}+{p}_i^{(l)}+{c}_{ij}^{(l)}\right)\) and \({\tau}_{ij k}^{(l)}={logit}^{1}\left(\sum_{\gamma =1}^r{\boldsymbol{\alpha}}_{\gamma}^{(l)}{x}_{\gamma_i}+{p}_i^{(l)}+{c}_{ij}^{(l)}\right)\). The values of \({\boldsymbol{\alpha}}^{(l)}=\left({\alpha}_0^{(l)},\dots, {\alpha}_3^{(l)}\right)\), \({\boldsymbol{\beta}}^{(l)}=\left({\beta}_0^{(l)},\dots, {\beta}_6^{(l)}\right)\) and ϕ^{(l)} are post burnin samples of size 2000 from the posterior distribution of all parameters. Figure 4 (above panel) presents the histogram of PE2018 placed with the plot of the ZABPRE and ZAGRE predictive posterior density. In this figure, it is also quite clear that with a slightly more computational time, the ZABPRE model provides an adequate fit to the PE2018 data.
Finally, to evaluate the adequacy of the loglink function used to model the conditional nonzero mean μ, we follow the suggestion given in [54] as depicted in Fig. 4 (below panel). We divide the values of the linear predictor μ_{ijk} into 10 intervals, with each interval containing a similar number of observations. Then, for each group, we build a boxplot of the posterior predictive mean (black boxplot) and a boxplot of the nonzero observed values (gray boxplot). In Fig. 4, we observe no evidence of link misspecification for the nonzero mean μ_{ijk}, because the shapes of the fitted and observed trends are similar.
Simulations
In this section, we propose a simulation study to illustrate the performance of the proposed method. Our goals for the simulation study were: 1) to investigate the behaviour of Bayesian estimates based on the empirical mean squared error (MSE), relative bias and percentage of times that the 95% credible intervals (CI) contains the true parameter value and 2) to investigate if the DIC_{3} and LPML Bayesian criteria properly select the best model. We conduct the simulation study considering 100 datasets generated from ZAGRE and ZABPRE models, considering different sample sizes n, say n = 50, 100, 150 and 200. For each dataset of size n, we model the location parameter μ_{ijk} through log(μ_{ijk}) = β_{0} + β_{1}x_{ijk} + p_{i} + c_{ij}, with \({p}_i\sim N\left(0,{\sigma}_p^2\right)\) and \({c}_{ij}\sim N\left(0,{\sigma}_c^2\right)\), i = 1, …, n, j = 1, …, n_{i}, k = 1, …, m_{ij}. To keep the simulation simple and fast, τ is considered constant across observations. We generate independent explanatory variables X_{ijk} from a Bernoulli distribution with a parameter equal to 0.8 and set β_{0} = 2, β_{1} = 1.5, τ = 0.2, \({\sigma}_p^2=1.8\), \({\sigma}_c^2=2.3\), ψ = 1.0 for the ZAGRE model and β_{0} = 2, β_{1} = 1.5, τ = 0.2, \({\sigma}_p^2=1.8\), \({\sigma}_c^2=2.3\), ψ = 0.1 for the ZABPRE model. We consider the following independent noninformative priors β_{k}~N(0,100), ψ~Gamma(0.01, 0.01), \({\sigma}_p^2\sim IGamma\left(0.01,0.01\right)\), \({\sigma}_c^2\sim IGamma\left(0.01,0.01\right)\) and τ~U(0, 1).
For each dataset of size n, we calculate Bayesian estimates with 500 points from the posterior distribution. These points are based on two parallel independent MCMC runs of size 100,000 each, discarding the first 50,000 points to eliminate the effect of the initial values. To avoid correlation among observations, we consider a thinning of size 100, obtaining 500 points from the posterior distribution.
To study the frequentist properties of Bayesian estimates, we calculate the relative bias, the MSE and the 95% coverage probability (CP). Let 𝜽 = { α, β, ψ, 𝜎^{2}} be the true vector of parameters and 𝜃_{s} an element of 𝜽. Let \({\hat{\theta}}_s\) be the posterior mean of 500 points from the posterior distribution of 𝜃_{s} based on dataset i of size n, i = 1, …, 100, n = 50, 100, 150, 200. The relative bias, the MSE and the 95% CP for \({\hat{\theta}}_s\) are defined as follows:
where I is the indicator function such that θ_{s} lies in the interval \(\left[{\hat{\theta}}_{is, LCL},{\hat{\theta}}_{is, UCL}\right]\), with \({\hat{\theta}}_{is, LCL}\) and \({\hat{\theta}}_{is, UCL}\) as the estimated lower and upper 95% CIs, respectively. Fig. 5A and B present a visual comparison of the parameters β_{0} and β_{1} under ZAGRE and ZABPRE generated data for varying sample sizes, where the dotted and black lines represent the ZAGRE and the ZABPRE fitted model, respectively.
As expected, these figures reveal that if we use a true (ZAGRE and ZABPRE) model to fit zero augmentedskew data, relative bias and MSE, for parameters β_{0} and β_{1}, tend to decrease as sample size increases indicating that the Bayesian estimates possess good consistency properties. In addition, both CP of β_{0} and β_{1} tend to be around 95% as the sample size increases when the true model is considered. However, when we do not fit the data by their respective true model (model misspecification), the relative bias and the MSE tend to be smaller in ZABPRE model. Moreover, as expected in both cases we can see that the performance of the CP gets worse when a misspecified model is considered. For the sake of completeness, the MSE, relative bias and CP for all the parameters (β_{0}, β_{1}, τ, σ^{2}) are presented in Table B1 (Additional file 1: Appendix B). It can be seen from this table that the Bayesian estimates of the mixture proportion τ are highly robust to model misspecification, and this behaviour is independent of the sample size. The dispersion parameters (σ^{2} and ψ) in both models are not comparable, because they are in different scale. In all results, relative bias of \({\sigma}_c^2\) are greater than \({\sigma}_p^2\) and it shows that the dispersion of responses is more at the level2 than the level3. In the comparison of the two models, the differences in MSE values of both \({\sigma}_p^2\) and \({\sigma}_c^2\) are absolutely more in misspecified model. The results of coverage probability of sigmas show that the coverage level in both models are acceptable, however, in true models, the values of the CP are slightly larger than the misspecified model.
Summary of model performance in simulations are presented in Table 3. Table 3 presents the averages of the Bayesian model comparison criteria. We calculate the LPML, DIC_{3}, WAIC, LOO, and convergence rate using 100 samples of size n = 100 each. All criteria favoured the true (simulated) model.
Discussion
In this article, we proposed a Bayesian mixture model with random effects for modelling semicontinuous data augmented by zeros. We suggest the Gamma and BP distributions in continuous part of the models. A simulation study and real data analysis are conducted to compare the ZAGRE and ZABPRE on the multilevel semicontinuous data and results demonstrated that the ZABPRE performs better on the zeroaugmented multilevel semicontinuous data.
Our flexible class contains the zeroaugmented versions of the two parametric exponential family of distributions, such as Gamma, betaprime, inverse Gaussian, Weibull, lognormal, and Tweedie. Our model is able to simultaneously accommodate zeros and positive outcomes, rightskewness, within subject correlation because of nested measurements and betweensubject heterogeneity. One of the differentials of this study was the inclusion of random effects in the analysis of factors related to semicontinuous data using the betaprime distribution that were not considered in before studies and statistical packages [10, 20,21,22], and this is our major contribution.
One of the advantages of the Bayesian approach compared to the classical approach is the estimates in the part I. Where, the maximum likelihood estimator of a probability of nonzero value, when zero response is observed, does not perform well on the boundary of the parameter space [37]. For a simple BP model, the Maximum Likelihood estimation is available using GAMLSS. However, the MLE results in our data did not reach convergence for some parameters by adding random effects. In this research, using Bayesian statistics with Gibbs and MetropolisHasting sampling, this problem is avoided. In the future, it would be interesting to continue the study of various different MCMC methods and hopefully apply them to health cost data.
Simulation studies reveal good consistency properties of the Bayesian estimates as well as high performance of the model selection techniques to pick the appropriately fitted model. We also apply our model to a dataset from yearly pharmaceutical expenditure data conducted in 429 cities in Iran (PE2018) to illustrate how the procedures can be used to evaluate model assumptions and obtain unbiased parameter estimates. Although our modelling is primarily motivated from the PE2018, it can be easily applied to other datasets and distributions, because the models considered in this article have been fitted using standard available software packages, like R and OpenBUGS (code available in Additional file 1: Appendix A). This makes our approach easily accessible to practitioners of many fields of research.
Although the zeroaugmented positive model considered here has shown great flexibility to deal with zeroaugmented clustered data, its robustness can be seriously affected by the presence of heavy tails in the random effects, obscuring important features among individual variation. Liu et al. [13] and Bandyopadhyay et al. [55] proposed a remedy to accommodate skewness in the random effect simultaneously, using skewnormal/independent distributions. We suppose that our method can be used under the zeroaugmented positive model and should yield satisfactory results at the expense of additional complexity in implementation. Another useful extension of the proposed model involves the possibility of heteroscedasticity of ψ by allowing the dependence of g(ψ) on covariates, with g(.) being an appropriate link function, as proposed in [56]. An indepth investigation of such extensions is beyond the scope of the present research article, but it is an interesting topic for further research.
Availability of data and materials
The data that support the findings of this study are available from the corresponding author reasonable request.
Abbreviations
 PE:

Pharmaceutical expenditure
 BP:

Betaprime
 LMMs:

Linear mixed models
 DDD:

Daily drinks data
 GLMMs:

Generalized linear mixed models
 MCMC:

Markov Chain Monte Carlo
 MH:

MetropolisHastings
 LPML:

Log pseudo marginal likelihood
 DIC:

Deviance Information Criterion
 CPO:

Conditional predictive ordinate
 WAIC:

Watanabeakaike information criterion
 LOOCV:

Leaveoneout crossvalidation
 CR:

Convergence rate
 ZAGRE:

Zeroaugmented gamma regression model with random effects
 ZABPRE:

Zeroaugmented betaprime regression model with random effects
 NCHIR:

National Center for Health Insurance Research
References
Xing D, Huang Y, Chen H, Zhu Y, Dagne GA, Baldwin J. Bayesian inference for twopart mixedeffects model using skew distributions, with application to longitudinal semicontinuous alcohol data. Stat Methods Med Res. 2017;26(4):1838–53.
Cragg JG. Some statistical models for limited dependent variables with application to the demand for durable goods. Econom J Econom Soc. 1971;39(5):829–44.
Duan N, Manning WG, Morris CN, Newhouse JP. A comparison of alternative models for the demand for medical care. J Bus Econ Stat. 1983;1(2):115–26.
Hall DB, Zhang Z. Marginal models for zero inflated clustered data. Stat Model. 2004;4(3):161–80.
Moulton LH, Curriero FC, Barroso PF. Mixture models for quantitative HIV RNA data. Stat Methods Med Res. 2002;11(4):317–25.
Olsen MK, Schafer JL. A twopart randomeffects model for semicontinuous longitudinal data. J Am Stat Assoc. 2001;96(454):730–45.
Tooze JA, Grunwald GK, Jones RH. Analysis of repeated measures data with clumping at zero. Stat Methods Med Res. 2002;11(4):341–55.
Manning WG, Morris CN, Newhouse JP, Orr LL, Duan N, Keeler EB, et al. A twopart model of the demand for medical care: preliminary results from the health insurance study. Heal Econ Heal Econ. 1981;137:103–23.
Su L, Tom BDM, Farewell VT. Bias in 2part mixed models for longitudinal semicontinuous data. Biostatistics. 2009;10(2):374–89.
SantosNeto M, RibeiroBezerra T, Bourguignon M, de Castro M. Package “BPmodel” Title BetaPrime Regression Model. R package version 1.1.2; 2021.
Husted JA, Tom BD, Farewell VT, Schentag CT, Gladman DD. A longitudinal study of the effect of disease activity and clinical damage on physical function over the course of psoriatic arthritis: Does the effect change over time? Arthritis Rheum. 2007;56(3):840–9.
Kipnis V, Midthune D, Buckman DW, Dodd KW, Guenther PM, KrebsSmith SM, et al. Modeling data with excess zeros and measurement error: application to evaluating relationships between episodically consumed foods and health outcomes. Biometrics. 2009;65(4):1003–10.
Liu L, Strawderman RL, Cowen ME, Shih YCT. A flexible twopart random effects model for correlated medical costs. J Health Econ. 2010;29(1):110–23.
Duan N. Smearing estimate: a nonparametric retransformation method. J Am Stat Assoc. 1983;78(383):605–10.
Smith VA, Preisser JS, Neelon B, Maciejewski ML. A marginalized twopart model for semicontinuous data. Stat Med. 2014;33(28):4891–903.
RodriguesMotta M, Galvis Soto DM, Lachos VH, Vilca F, Baltar VT, Junior EV, et al. A mixedeffect model for positive responses augmented by zeros. Stat Med. 2015;34(10):1761–78.
Hatfield LA, Boye ME, Carlin BP. Joint modeling of multiple longitudinal patientreported outcomes and survival. J Biopharm Stat. 2011;21(5):971–91.
Su L, Tom BDM, Farewell VT. A likelihoodbased twopart marginal model for longitudinal semicontinuous data. Stat Methods Med Res. 2015;24(2):194–205.
Jaffa MA, Gebregziabher M, Garrett SM, Luttrell DK, Lipson KE, Luttrell LM, et al. Analysis of longitudinal semicontinuous data using marginalized twopart model. J Transl Med. 2018;16(1):1–15.
Tulupyev A, Suvorova A, Sousa J, Zelterman D. Beta prime regression with application to risky behavior frequency screening. Stat Med. 2013;32(23):4044–56.
Bourguignon M, SantosNeto M, de Castro M. A new regression model for positive random variables with skewed and long tail. Metron. 2021;79(1):33–55.
Kamyari N, Soltanian AR, Mahjub H, Moghimbeigi A. Diet, nutrition, obesity, and their implications for COVID19 mortality: Development of a marginalized twopart model for semicontinuous data. JMIR Public Heal Surveill. 2021;7(1):e22717.
Cooper NJ, Lambert PC, Abrams KR, Sutton AJ. Predicting costs over time using Bayesian Markov chain Monte Carlo methods: an application to early inflammatory polyarthritis. Health Econ. 2007;16(1):37–56.
Ghosh P, Albert PS. A Bayesian analysis for longitudinal semicontinuous data with an application to an acupuncture clinical trial. Comput Stat Data Anal. 2009;53(3):699–706.
Neelon B, O’Malley AJ, Normand ST. A Bayesian twopart latent class model for longitudinal medical expenditure data: assessing the impact of mental health and substance abuse parity. Biometrics. 2011;67(1):280–9.
Neelon BH, O’Malley AJ, Normand SLT. A Bayesian model for repeated measures zeroinflated count data with application to outpatient psychiatric service use. Stat Model. 2010;10(4):421–39.
Zhang M, Strawderman RL, Cowen ME, Wells MT. Bayesian inference for a twopart hierarchical model: An application to profiling providers in managed health care. J Am Stat Assoc. 2006;101(475):934–45.
Keeping ES. Introduction to statistical inference. Princeton: D. Van Nostrand Company, Inc.; 1962.
McDonald JB. Some generalized functions for the size distribution of income. In: Modeling income distributions and Lorenz curves: Springer; 2008. p. 37–55.
Bourguignon M, SantosNeto M, de Castro M. A new regression model for positive data. arXiv Prepr arXiv180407734; 2018.
Ferrari S, CribariNeto F. Beta regression for modelling rates and proportions. J Appl Stat. 2004;31(7):799–815.
Smithson M, Verkuilen J. A better lemon squeezer? Maximumlikelihood regression with betadistributed dependent variables. Psychol Methods. 2006;11(1):54.
Core Team R. R: a language and environmental for statistical computing. Vienna: R Foundation for Statistical Computing; 2017.
Stasinopoulos DM, Rigby RA. Generalized additive models for location scale and shape (GAMLSS) in R. J Stat Softw. 2008;23:1–46.
Li X, Hedeker D. A threelevel mixedeffects location scale model with an application to ecological momentary assessment data. Stat Med. 2012;31(26):3192–210.
Liu L, Ma JZ, Johnson BA. A multilevel twopart random effects model, with application to an alcoholdependence study. Stat Med. 2008;27(18):3528–39.
RodriguesMotta M, Forkman J. Bayesian Analysis of Nonnegative Data Using DependencyExtended TwoPart Models. J Agric Biol Environ Stat. 2022;27(2):201–21.
Davidian M, Giltinan DM. Nonlinear models for repeated measurement data: Routledge; 2017.
Huang Y, Wu H. A Bayesian approach for estimating antiviral efficacy in HIV dynamic models. J Appl Stat. 2006;33(2):155–74.
Sahu SK, Dey DK, Branco MD. A new class of multivariate skew distributions with applications to Bayesian regression models. Can J Stat. 2003;31(2):129–50.
Ntzoufras I. Bayesian Modeling Using Winbugs. Canada: Wiley; 2009.
Spiegelhalter DJ, Best NG, Carlin BP, Van Der Linde A. Bayesian measures of model complexity and fit. J R Stat Soc Ser B Stat Methodol. 2002;64(4):583–639.
Akaike H. Information Theory and an Extension of the Maximum Likelihood Principle. In: Selected Papers of Hirotugu Akaike.; 1998. p. 199–213.
Carlin BP, Louis TA. Bayesian methods for data analysis: CRC Press; 2008.
Dey DK, Chen MH, Chang H. Bayesian Approach for Nonlinear Random Effects Models. Biometrics. 1997;53(4):1239 Available from: http://www.jstor.org/stable/2533493.
Watanabe S, Opper M. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J Mach Learn Res. 2010;11(12):1–28.
Watanabe S. A widely applicable Bayesian information criterion. J Mach Learn Res. 2013;14(27):867–97.
Gelman A, Hwang J, Vehtari A. Understanding predictive information criteria for Bayesian models. Stat Comput. 2014;24(6):997–1016.
Vehtari A, Gelman A, Gabry J. Practical Bayesian model evaluation using leaveoneout crossvalidation and WAIC. Stat Comput. 2017;27(5):1413–32.
Gelfand AE. Model determination using samplingbased methods. Markov Chain Monte Carlo Pract. 1996;4:145–61.
Yong L. LOO and WAIC as model selection methods for polytomous items. arXiv Prepr arXiv180609996; 2018.
Gabry MJ. Package ‘loo’; 2022.
Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Stat Sci. 1992;7(4):457–72.
Hatfield LA, Boye ME, Hackshaw MD, Carlin BP. Multilevel Bayesian models for survival times and longitudinal patientreported outcomes with many zeros. J Am Stat Assoc. 2012;107(499):875–85.
Bandyopadhyay D, Lachos VH, AbantoValle CA, Ghosh P. Linear mixed models for skewnormal/independent bivariate responses with an application to periodontal disease. Stat Med. 2010;29(25):2643–55.
FigueroaZúñiga JI, ArellanoValle RB, Ferrari SLP. Mixed beta regression: A Bayesian perspective. Comput Stat Data Anal. 2013;61:137–47.
Acknowledgments
We are thankful to the National Center for Health Insurance Research (NCHIR), which has provided the data for this article. Authors would like to thank the reviewers for the very helpful comments, which lead to considerable improvement of the paper.
Funding
The Abadan University of Medical Sciences, under Grant numbers 1453, supported this work.
Author information
Authors and Affiliations
Contributions
NK, AS, HM, AM and MS: contributed to the conception, design, and data collection; NK and MS: contributed to the sampling, data gathering, and data assessments. NK, AS, AM, HM and MS: contributed to the statistical analysis and drafting of the manuscript; and AS: supervised the study. All authors read and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The Research Ethics Committee of Abadan University of Medical Sciences approved this study with a specific code IR.ABADANUMS.REC.1401.048. Written informed consent was obtained from all subjects prior to participation in this project.
Consent for publication
Not applicable.
Competing interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
Appendix A.1. OpenBUGS code. Appendix B. Results of the simulations. Table B1. Relative bias, MSE and CP for parameter estimates with different sample sizes for the ZAGRE and ZABPRE models.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Kamyari, N., Soltanian, A.R., Mahjub, H. et al. Zeroaugmented betaprime model for multilevel semicontinuous data: a Bayesian inference. BMC Med Res Methodol 22, 283 (2022). https://doi.org/10.1186/s12874022017360
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874022017360