Impact of communities, health, and emotional-related factors on smoking use: comparison of joint modeling of mean and dispersion and Bayes’ hierarchical models on add health survey

Background The analysis of correlated binary data is commonly addressed through the use of conditional models with random effects included in the systematic component as opposed to generalized estimating equations (GEE) models that addressed the random component. Since the joint distribution of the observations is usually unknown, the conditional distribution is a natural approach. Our objective was to compare the fit of different binary models for correlated data in Tabaco use. We advocate that the joint modeling of the mean and dispersion may be at times just as adequate. We assessed the ability of these models to account for the intraclass correlation. In so doing, we concentrated on fitting logistic regression models to address smoking behaviors. Methods Frequentist and Bayes’ hierarchical models were used to predict conditional probabilities, and the joint modeling (GLM and GAM) models were used to predict marginal probabilities. These models were fitted to National Longitudinal Study of Adolescent to Adult Health (Add Health) data for Tabaco use. Results We found that people were less likely to smoke if they had higher income, high school or higher education and religious. Individuals were more likely to smoke if they had abused drug or alcohol, spent more time on TV and video games, and been arrested. Moreover, individuals who drank alcohol early in life were more likely to be a regular smoker. Children who experienced mistreatment from their parents were more likely to use Tabaco regularly. Conclusions The joint modeling of the mean and dispersion models offered a flexible and meaningful method of addressing the intraclass correlation. They do not require one to identify random effects nor distinguish from one level of the hierarchy to the other. Moreover, once one can identify the significant random effects, one can obtain similar results to the random coefficient models. We found that the set of marginal models accounting for extravariation through the additional dispersion submodel produced similar results with regards to inferences and predictions. Moreover, both marginal and conditional models demonstrated similar predictive power. Electronic supplementary material The online version of this article (doi:10.1186/s12874-017-0303-y) contains supplementary material, which is available to authorized users.


Background
The standard logistic regression model is commonly used in the analysis of uncorrelated binary response observations with several covariates [1]. It is a member of the generalized linear models (GLM) where the canonical link is the logit, and the random component is binomial. Its ability to appeal to the odds of occurrence of an event has gained great interest and popularity in fitting binary data. However, when correlated observations are modeled with a standard logistic regression, the results are not necessarily efficient or reliable. In such cases, we can use the generalized estimating equation (GEE) model or a generalized linear mixed model (GLMM), or as emphasized in this paper, resort to the joint modeling of the mean and the dispersion. This additional modeling of the dispersion (beyond the generalized linear model for the mean) allows us to correct for the inflated standard errors due to the intraclass correlation and thereby provides more reliable and efficient estimates. The additional submodel has other benefits, as there are cases when one may be only interested in modeling the dispersion. Smyth and Verbyla [2] among others proposed joint modeling of the mean and dispersion parameters for certain distributions. Taylor and Verbyla [3], Wu, Zhang and Xu [4] proposed a unified procedure for selecting significant covariates in the mean and the dispersion sub models.
In survey research, one often encounters correlated data due to the designed nature of the study, longitudinal data or clustered data. Such data usually referred to collectively as repeated measures, often emanates in healthcare, education, psychology, sociology, and other related areas. As an example, in an obesity study among adolescents, students were sampled from within schools, and schools were sampled from within school districts. In such a situation, the observations were correlated due to the hierarchical structure of the design. Thus, to fit standard logistic regression models to such data will require one to violate of the usual assumption of independence. As such one may declare covariates as significant when in fact, they are not. In fact, it is not appropriate to model correlated data with any generalized linear model, as such models assume that the dispersion parameter is fixed [5]. In other words, the generalized linear model addresses location as a centering of the data, but the spread is not assumed to vary. However, in most of these cases when the distribution of response distribution belongs to the exponential family the spread is often related to the mean. On the other hand, it is well known that in most cases in the analysis of correlated data, that fixed dispersion or dispersion tied directly to the mean is less common than one assumes when confronted with survey data. However, the joint modeling of the mean and the dispersion is an approach where covariates are used to model both the mean and the dispersion simultaneously. In fact, the estimators of the regression parameters in the mean submodel is more efficient when the dispersion is correctly addressed [6].
In this paper, we modeled smoking based on a set of covariates. We compared different approaches to modeling these correlated data. These methods included marginal models based on the joint modeling of the mean and the dispersion and conditional models based on the hierarchical structure of the data with the use of random effects. The basis for the joint modeling is to directly address the correlation in the data [6]. In this paper, the joint modeling of the mean and the dispersion includes the generalized linear model (GLM) and generalized additive model (GAM). These are marginal models.
We fitted, compared, and validated several logistic regression models as we modeled smoking, as defined by having at least one cigarette every day for 30 days.
Smoking is detrimental as it can cause lung cancer, COPD (chronic obstructive pulmonary disease), heart disease, stomach cancer, as well as sudden infant death syndrome for women. Studies related to smoking have found that race and gender accounted for a differential role in the regular smoking habit of adults, while other studies have found an interactional effect between smoking, drinking, and income [7]. One study found that advertisements through social media such as movies and television have been linked with adolescent smoking initiation [8]. Yet another study found that patients with mental illness have a higher incidence of smoking than the general population [9].

Data
We obtained 4,484 observations from the fourth wave of Add Health survey [10], which was initially designed based on a two-level clustered design with an in-home questionnaire survey. Respondents were selected randomly from 132 different communities and were followed from 1995 to 2009. We selected variables based on social demographics, health status, and psychological status. A description of chosen predictors is presented in Table 1.
This sample consisted of respondents aged between 25 and 34 years, with 46.01% being male, and 32.4% being non-Caucasians. Approximately 45.2% of the respondents smoked cigarettes regularly (that is, at least one cigarette every day for 30 days), whereas 75% drank socially. 81% had both smoked and drunk at least once. Among those individuals who had both used alcohol or smoked, 20.9% had experienced smoking first, 23.1% had experienced alcohol first and 12.5% first experienced smoking and drink alcohol at the same time. On the health aspect, 37.5% of the sample had their last routine checks more than one year ago, 21.4% had no health insurance, and 30.6% had routine sports time. Regarding emotional satatus, 10.3% had psychological consulting experience, 13.5% had childhood mistreatment experience from their parents, 29.1% had negative or neutral attitude towards their future, and 28.8% felt isolated from the social life. Job satistfaction was average d at 27.7%.

Methods
We started with an unconditional hierarchical logistic regression model to obtain the intraclass correlation coefficient (ICC) 1 . The ICC measures the closeness of the individuals within a cluster as it pertains to smoking [11]. We found that ICC to be sufficiently large to warrant a correlated model [12]. We then conducted an analysis of smoking based on four (two marginal and two conditional) models appropriate for correlated binary data. We identified and addressed similarities and differences with these models as they addressed and accounted for the intraclass correlation. The two hierarchical models, one frequentist and one Bayes, and the two joint mean and dispersion models, one generalized linear model (GLM) and the other generalized additive model (GAM), were utilized. We compared the fit of these models as they pertained to their predictive ability on smoking. We stayed clear of addressing their mathematical or computational differences [13]. Our comparisons were made based on withholding a portion of the data for validation. We randomly selected 75% for the training dataset and 25% for validation dataset. We did this four different times, choosing a different 25% each time.

Hierarchical logistic regression models
We presented two conditional logistic regression or hierarchical logistic regression models for analysis. The dependency among observations was accounted for through adjustment to the systematic component with the addition of random intercepts and random slopes. This method of adjustment to the systematic component requires additional distributional assumptions as it now consists of random effects in addition to the fixed effects. This approach results in modeling the conditional mean rather than the marginal mean, thus, are referred to as subject-specific models [14]. Essentially, the subject-specific model consists of a product of a set of distributions, one set based on the conditional mean given the random effects and the other for the assumed distribution of the random effects. In our example, a subject -specific model tells us about the probability that the individual smoked given the community (random effect) [13].
Consider modeling smoking as the outcome as a Bernoulli random variable Y ij , which takes on the value of one (if smoking) with probability P ij and covariates X ijk for the i th individual in j th community for the k th covariate, i = 1,2,....I; j = 1,2,....J; k = 1,2,.....K. Let the overall differential cluster random effects be denoted as γ 0j (random intercept in the model) and assumed to be distributed as a normal random variable with mean zero and variance δ 0 2 . This addresses the unmeasured impact of the communities on the individual. The random intercept γ 0j represents the combined effect of all omitted covariates that causes individuals to be more prone to smoking. Our initial exploration of the data and some complimentary research suggested a differential effect across communities with regard to arrest. Thus, let the coefficient for arrest show a differential rate from community to community (random slope in the model). Random slope is denoted by γ 1j and assumed to be distributed normally with mean zero and variance δ 1 2 . Finally, the hierarchical logistic What was the total household income before taxes and deductions?
How many of these children are still living?
Have you ever been Arrested?

Gender
Are you white?
Health Do you have health insurance?
How long ago did you last have a routine check-up?
Do you have more than 5 drinks or drink alcohol more than 3 days a week during the past 12 months?
Do you have prescription drug misuse?
On the average, do you use a physical fitness or recreation center in your neighborhood per week?
In the past seven days, how many hours did you watch television or videos, including VHS, DVDs or music videos?
How old were you when you first had an Alcoholic drink?
Emotional How important (if at all) is your religious faith to you?
Before your 18th birthday, did a parent or other adult caregiver say things that really hurt your feelings or made you feel like you were not wanted or loved or hit you with a fist, kick you, or throw you down on the floor, into a wall, or down stairs more than 10 times?
In the past 12 months, did you spend on volunteer or community service work?
Are you optimistic about your future?
Do you feel isolated from others?
Do you satisfy with your current job?
regression model with 19 covariates and random intercept and random slope is: where X ijk represents the k th covariate measure for the i th person in the j th cluster (community) β i are the regression coefficients associated with covariate X ijk β 0 is the overall fixed effects and Z ij1 is the covariate arrest in this case associated with the random slope coefficient of arrest, with conditional distribution Y ij | X ijk γ 0j , γ 1j~B in (n i , p ij ) where Bin denotes the binomial distribution, the random intercept for communities γ 0j~N (0, δ 0 2 ) and random slope γ 1j~N (0, δ 1 2 ) where N denotes the normal distribution and the correlation of the random effects intercept and slope parameters is σ γ0j,γ1j . The correlation is due to the fact that observations in the same cluster share similar effects. The random slope γ 1j represents that there are differential rates of change with subjects in each cluster (community) as it relates to the covariate arrest Z ijk . We referred to this as frequentist hierarchical logistic regression model as opposed to the Bayes' hierarchical logistic regression model with its additional set of prior distributions on the parameters. We fitted the frequentist hierarchical model with PROC GLIMMIX in SAS, which is presented in Additional file 1: Appendix.

Bayes' hierarchical logistic regression model
Bayesian hierarchical logistic regression model is presented as: differs from the frequentist hierarchical model [1] merely in the set of prior distributional assumptions attached to the parameters in the Bayes' logistic regression model. The Bayes' logistic regression model requires that there are distributional assumptions specified for the unknown β i parameters, as well as the covariance parameters δ 0 2 δ 1 2 and σ γ0j,γ1j associated with the distribution for the random effects γ 0j , γ 0j~N (0, δ 0 2 ) and γ 1j the random slope γ 1j~N (0, δ 1 2 ) respectfully. The β's are assumed to be normally distributed. The Bayes' model requires that we incorporate any prior information on the unknown parameters, β = β 0 , β 1 , …, β K ; δ 0 2 , δ 1 2 and σ γ0j,γ1j together with the information we obtained from the observed data. Thus we concentrate on the resulting posterior distribution from which we seek posterior modes. In fact, the prior knowledge is represented through the distributional assumptions. It allows updating the knowledge regarding the unknown parameter distribution as if its prior information is known. We used the prior information through the distribution Pr(β) which is based on initial beliefs to estimate and make inferences. It is customary to assume that the normal distribution is the most appropriate prior. Thus, we obtained the posterior distribution, as proportional (∝) to the product of the likelihood function and the prior distribution on, which we concentrate. So that the probability is followed through the posterior, Since the data Y have distribution, Pr(Y) as constant relative to β. If the prior distributions were chosen to be uniform instead of normal, then the estimates for equations [1] and [2] are equivalently both maximum likelihood estimates. We obtained estimates from the posterior distribution, [2.3]. Such posterior inference typically requires simulation techniques such as Gibbs sampling and the Metropolis Hasting sampling as a MCMC method to obtain estimates. In fact, it generates new values from a proposed distribution that determines how to select new parameter values based on the current values. The Bayesian procedure produces consistent and efficient estimates. It has also been shown that if we were to choose conjugate priors, it guarantees that a posterior distribution possesses the same property as the prior distribution [15]. We used MLWin to fit these Add Health data with Bayes' hierarchical logistic regression model. MLwiN is a powerful program designed for fitting multilevel models. It provides features such as variance function window to calculate the residual variance at any level.

Joint modeling of the mean and dispersion
The subject-specific models concentrated modeling the conditional mean through the use of random effects to address the correlation. However, there is added value to address the correlation through jointly modeling of the mean and the dispersion [7]. This joint modeling consists of two interlinked models, one for the mean and one for the dispersion [16]. This differs from the random coefficient models and its approach to addressing correlation. It is a reflection of the double exponential family that allows us to model the mean parameter while making use of a second parameter that addresses the variance but independently of the mean [17]. Each submodel is considered to be a generalized linear model with three components: random component, systematic component, and link component [18]. The joint modeling of the mean and dispersion consists of starting initially with the mean submodel. Then, the residuals from the mean submodel or a function of them are used as the response (random component) in the dispersion submodel. The systematic component in the dispersion submodel is allowed if needed, to have the same or a subset of covariates as in the mean submodel or a totally different set of covariates not yet used. The link function in the dispersion submodel follows similar procedure though not necessarily the same function as the mean submodel. We used the same model checking techniques for both the mean and the dispersion submodel [19]. We considered two different joint modeling of both the mean and the dispersion submodel (Double Joint Modeling Mean and Dispersion), one based on GLM and the other based on GAM.

Double GLM Joint Modeling mean and dispersion
The joint modeling of the mean and the dispersion, also referred to as double generalized linear models (DGLM) consists of three parts: a function for the variance, a GLM submodel for the mean, and a GLM submodel for the dispersion [2]. Thus, we need to obtain an estimation of the mean function and estimation of the variance function [20].
In particular, let us assume that the binary random variable Y i is Bernoulli with probability p i , which depends on a set of covariates X = (X 1 ,X 2 , … X K ). The mean submodel is first fitted with logit link. The deviances from the mean submodel d i have mean ϕ i with variance V di (ϕ i ) representing the random component. The systematic component of the dispersion submodel consists of the vector of covariates Z = (Z 1 , Z 2 , … Z t ) with link component as log. In our model we have one covariate in the dispersion submodel as arrest. In summary, we have two interlinked generalized linear models consisting of the mean submodel, measuring smoking use Yi~Ber(p i ), where p i represents probability of smoking with logit link, and the dispersion submodel based on d i such that d i~Dd (ϕ i ,V di (ϕ i )) and log link We used PROC QLIM and Macro HPGLIM in SAS to fit these models, as presented in Additional file 1: Appendix.

Double GAM Joint Modeling mean and dispersion
In the joint modeling of the mean and dispersion one can replace those GLM with GAM. As such we still present two submodels, one for mean and one for the dispersion, but in each we have a generalized additive model instead of GLM. A general additive model is similar to generalized linear model as it relates the mean of the random response variable Y and a set of covariates X i , …, X p , [21]. The generalized additive model (GAM) distinguishes itself from the generalized linear model in that where γ i (X i ), i = 1, …,K; are smooth functions. These smooth functions γ i , are not given a parametric form but instead are estimated in a nonparametric or semiparametric form. The GAM consists of a random component with response Y i , a systematic component that is additive in the function of covariates, and a link function relating the response in terms of the mean with a combination of the covariates. Whereas GLM has as its systematic component the linear predictor of the form Σβ i X i the GAM instead, uses the additive component, as a sum of smooth functions Σγ i (X i ). The function γ i is determined or estimated based on a nonparametric technique usually influenced by the examination of actual plots. The GAM requires one to determine the smoothing parameter through the generalized cross validation (GCV) function in nonparametric regression methods [22]. The functions γ (⋅) and f (⋅) are additive smooth functions, unspecified linear functions or nonparametric methods, for both mean and dispersion submodels.
The GAM is known to be more flexible than the parametric methods as an exploratory analysis in identifying the relationship between the response and the covariates. Also it is more flexible as opposed to other models when detecting quadratic or higher order power in a piecemeal fashion [21]. In addition, the GAM can also be used more efficiently to uncover nonlinear effects among the covariates. While the generalized linear model is used for estimation and inferences for the regression parameters, the GAM is seen as an exploratory method based on a nonparametric approach while making use of an apparent response-covariate relationship. In summary, we have the systematic component for the mean additive submodel logit (p i ) = ∑ i = 1 p γ i (X i ) and the systematic component for dispersion additive submodel log(deviance i ) = ∑ i = 1 k f i (X i ). The log (deviance) is assumed to be distributed as a normal distribution in the dispersion with identity link function [23]. We used PROC GAM in SAS to fit these models, as presented in Additional file 1: Appendix.

Results
We fitted two logistic regression models with random effects, frequentist hierarchical logistic regression model and Bayes' hierarchical logistic regression model. We also fitted two joint modeling of the mean and dispersion models, one based on the generalized linear model, and the other on generalized additive modeling. Results of all models are presented in Table 2.
In general, we concluded that individuals who had higher income level were less likely to have smoked. People who had kids were about 1.23 times more likely to have smoked regularly. If one had alcohol or drug misuse, they were about 3.26 times more likely to smoke. Individuals who had been arrested or had spent more time watching TV or playing video games were about 1.96 times more likely to smoke than those with no such habits. Non-Caucasian were about 1.58 more likely to smoke than Caucasian. People not practicing religion were about 1.34 times more likely to smoke than those who were religious. People with lower education level were about more than 1.32 times likely to smoke. Individuals who drank alcohol early in life were more than 1.52 times more likely to be a regular smoker. People who experienced mistreatment from their parents during childhood were more than 1.70 times more likely to smoke regularly. Job satisfaction had no significant impact on smoking use. Finally, people who felt isolated from social activities had no impact.

Results of standard logistic regression
As a point of reference regarding the extra variation or overdispersion present in the data due to its hierarchical nature, we fitted a standard logistic regression with 19 covariates. The goodness-of-fit test for overdispersion (Hosmer-Lemeshow, X 2 = 19.97 , p-value = 0.011) suggested that the model is not a good fit and overdispersion is clearly present. This may be in part due to ignoring the hierarchical structure of the design. Initially, we fitted an unconditional logistic regression model with random effects (community) and no covariates and obtained an ICC value of 0.146. Thus, the intraclass correlation is large enough to suggest using a correlated model to address the hierarchical structure [12].

Results of frequentist hierarchical and bayes' hierarchical logistic regression model
A frequentist hierarchical logistic regression model with random intercept and random slope based on the differential rate due to the covariate arrested, resulted in the estimate of the variance for the intercept as 0.099 and random slope of 0.116 with its standard errors of 0.062 and 0.071 respectively. The covariance estimate between intercept and slope is 0.028. Thus, resulting in a standardized value of 0.099/0.062 for p-value of 0.023 for intercept and 0.116/0.071 for p-value of 0.025 and suggesting that the rate of change as it pertains to arrest varied across communities. Covariates, alcohol, kid, arrested, race, drug, TV time, mistreatment and religion showed significant positive effects on the probability of smoking, while covariates religion, public work, starting age, education, income, gender, sports showed significant negative effects on the probability of smoking. Thus, one expected that the probability that an individual who smoked was impacted by factors, although unidentifiable, but were related to the communities and history with the law as it depended on community. Mistreatment, gender, attitude towards the future, social relation and whether or not they had a job had no impact on the smoking behavior. We fitted a Bayes' hierarchical logistic regression model with random effects to measure in community random slope in arrested. A Bayes' hierarchical logistic regression model fitted with MLWin gave random effects in community and arrested as significant with σ 2 d intercept ¼ 0:009 and σ 2 d arrested ¼ 0:118 and σ d int=arr ¼ 0:030 with standard errors 0.002, 0.035, and 0.009, respectively. This suggested that our random effects due to community and due to arrest across communities differed significantly.

Comparison of hierarchical models
The hierarchical models which tell about the conditional means (probabilities) gave similar results though for social relation, routine check, insurance, and attitude towards the future they differed. In particular, the models agreed with the set of covariates we categorized under demographics or health. When difference appeared they occurred mainly with Bayes' and double GLM joint modeling model. In those cases, the Bayes' hierarchical showed some marginal significance while the frequentist did not identify any trace of significance with these covariates. Bayes' results have a tendency to have smaller p-values. This may be due to the fact that the Bayes' method uses a prior distribution on the parameters.

Results of joint modeling of mean and dispersion
A joint modeling of the mean and the dispersion with a generalized linear model for each submodel with covariate arrest in the dispersion sub-model provided a good fit to the data. The mean sub-model consisted of covariates, alcohol, kid, arrested, race, drug, TV time, mistreatment, social relation and job showing significant positive effects on the probability of smoking. The covariates religion, public work, starting age, education, income, insurance, sports, routine check, attitude future showed significant negative effects on the probability of smoking, Table 2.
The joint modeling of the mean and the dispersion with generalized additive submodels was fitted to the data. The smoothing effects were significant (p = 0.0001) with smallest GCV value of 3.96, based on estimating the additive predictors by using a B-spline smoother with 3 degrees of freedom including a strong quadratic pattern.

Comparative dispersion submodel
In both the mean and the dispersion submodel for both the GLM and GAM, covariate arrest had a significant impact (p < 0.001) on the dispersion (see Table 3).
In the next section, we examined the predictive probability for each of these models based on the validation set and the training set in four different datasets. We used this in our comparisons as we took a closer look at the predictive ability of the models.

Accuracy comparison of four models
We used the area under the Receiver Operating Characteristic (ROC) curves for both the training dataset and the validation dataset as a measure of fit for our four logistic regression models. We compared the four models based on their predicted probabilities in the training dataset and the validation dataset (see Table 4). We repeated the process four times, with a different 25% of the data omitted for validation each time. The joint modeling of the mean and the dispersion with GAMs had the best predicted probabilities among validation dataset. However, the four models did not show marked differences in their predicted probabilities.

Discussion
We fitted four binary logistic regression models, frequentist hierarchical, Bayes' hierarchical model, GLM joint modeling of mean and dispersion, and GAM joint modeling mean and dispersion to model smoking. Two of these models (the joint mean and dispersion submodel) addressed the marginal probabilities while the other two hierarchical models (Frequentist and Bayes') addressed the conditional probabilities. The conditional probability models found arrest as a covariate with differential rates across communities while the marginal model also found that arrest is a key variable in explaining the dispersion. Overall, the models did not perform markedly different and gave similar results based on the training dataset and the validation dataset. The marginal models were similar with their predictability in the validation datasets and the training datasets.
We acknowledge the limitation of using a single dataset and encourage future research to conduct similar studies on a variety of correlated data. In the meantime, we are confident about the applicability of our conclusion to other research domains. Even though we did not perform a simulation study, we found similar patterns with hierarchical models with the UCLA dataset and recent research [12]. Similar results were also seen in the fit of the conditional models. The consistency of the models and the fit with validation and training datasets led us to believe in the generalizability of this paper to broader receivers.

Conclusion
We fitted both conditional and marginal models to study smoking behavior for adolescents who had eventually become adults. While these models are addressing two different questions, we did not detect significant differences in both model performances based on training or validation. Further investigation showed that most communities were alike with a few showing that different random effect. By wave 4 these adolescents were moved away from their original communities so over time the community impact was negligent or independent of the smoking behavior.  The Add Health data files do not contain respondent identifiers or any links to identifiers. The data do contain ID numbers which are necessary to allow researchers to link data across the waves and to friends and partners (also de-identified).
Author details