We started with an unconditional hierarchical logistic regression model to obtain the intraclass correlation coefficient (ICC)^{Footnote 1}. The ICC measures the closeness of the individuals within a cluster as it pertains to smoking [11]. We found that ICC to be sufficiently large to warrant a correlated model [12]. We then conducted an analysis of smoking based on four (two marginal and two conditional) models appropriate for correlated binary data. We identified and addressed similarities and differences with these models as they addressed and accounted for the intraclass correlation. The two hierarchical models, one frequentist and one Bayes, and the two joint mean and dispersion models, one generalized linear model (GLM) and the other generalized additive model (GAM), were utilized. We compared the fit of these models as they pertained to their predictive ability on smoking. We stayed clear of addressing their mathematical or computational differences [13]. Our comparisons were made based on withholding a portion of the data for validation. We randomly selected 75% for the training dataset and 25% for validation dataset. We did this four different times, choosing a different 25% each time.

### Hierarchical logistic regression models

We presented two conditional logistic regression or hierarchical logistic regression models for analysis. The dependency among observations was accounted for through adjustment to the systematic component with the addition of random intercepts and random slopes. This method of adjustment to the systematic component requires additional distributional assumptions as it now consists of random effects in addition to the fixed effects. This approach results in modeling the conditional mean rather than the marginal mean, thus, are referred to as subject-specific models [14]. Essentially, the subject-specific model consists of a product of a set of distributions, one set based on the conditional mean given the random effects and the other for the assumed distribution of the random effects. In our example, a subject -specific model tells us about the probability that the individual smoked given the community (random effect) [13].

Consider modeling smoking as the outcome as a Bernoulli random variable Y_{ij}, which takes on the value of one (if smoking) with probability P_{ij} and covariates X_{ijk} for the i^{th} individual in j^{th} community for the k^{th} covariate, i = 1,2,....I; j = 1,2,....J; k = 1,2,.....K. Let the overall differential cluster random effects be denoted as γ_{0j}(random intercept in the model) and assumed to be distributed as a normal random variable with mean zero and variance δ
^{2}_{0}
. This addresses the unmeasured impact of the communities on the individual. The random intercept γ_{0j} represents the combined effect of all omitted covariates that causes individuals to be more prone to smoking. Our initial exploration of the data and some complimentary research suggested a differential effect across communities with regard to *arrest*. Thus, let the coefficient for *arrest* show a differential rate from community to community (random slope in the model). Random slope is denoted by γ_{1j} and assumed to be distributed normally with mean zero and variance δ
^{2}_{1}
. Finally, the hierarchical logistic regression model with \( 19 \) covariates and random intercept and random slope is:

$$ \mathrm{logit}\ \left({\mathrm{p}}_{\mathrm{ij}}\ \Big|\ {\mathrm{X}}_{\mathrm{ij}\mathrm{k}}\ {\upgamma}_{0\mathrm{j}},\ {\upgamma}_{1\mathrm{j}}\right) = {\upbeta}_0+{\upbeta}_{1\ }{\mathrm{X}}_{\mathrm{ij}1} + {\upbeta}_{2\ }{\mathrm{X}}_{\mathrm{ij}2}+\cdots + {\upbeta}_{\mathrm{K}\ }{\mathrm{X}}_{\mathrm{ij}\mathrm{K}}+\kern0.5em {\upgamma}_{0\mathrm{j}} + {\upgamma}_{1\mathrm{j}}\ {\mathrm{Z}}_{\mathrm{ij}1} $$

(1)

where X_{ijk} represents the k^{th} covariate measure for the i^{th} person in the j^{th} cluster (community) β_{i} are the regression coefficients associated with covariate X_{ijk} β_{0} is the overall fixed effects and Z_{ij1} is the covariate *arrest* in this case associated with the random slope coefficient of arrest, with conditional distribution Y_{ij}| X_{ijk} γ_{0j}, γ_{1j} ~ *Bin* (n_{i}, p_{ij}) where *Bin* denotes the binomial distribution, the random intercept for communities γ_{0j} ~ *N*(0, δ
^{2}_{0}
) and random slope γ_{1j} ~ *N*(0, δ
^{2}_{1}
) where *N* denotes the normal distribution and the correlation of the random effects intercept and slope parameters is σ_{γ0j,γ1j}. The correlation is due to the fact that observations in the same cluster share similar effects. The random slope γ_{1j} represents that there are differential rates of change with subjects in each cluster (community) as it relates to the covariate *arrest* Z_{ijk}. We referred to this as frequentist hierarchical logistic regression model as opposed to the Bayes’ hierarchical logistic regression model with its additional set of prior distributions on the parameters. We fitted the frequentist hierarchical model with PROC GLIMMIX in SAS, which is presented in Additional file 1: Appendix.

### Bayes’ hierarchical logistic regression model

Bayesian hierarchical logistic regression model is presented as:

$$ \mathrm{logit}\ \left({\mathrm{p}}_{\mathrm{i}\mathrm{j}}\ \Big|\ {\mathrm{X}}_{\mathrm{i}\mathrm{j}\mathrm{k}}\ {\upgamma}_{0\mathrm{j}},\ {\upgamma}_{1\mathrm{j}}\right) = {\upbeta}_0+{\upbeta}_{1\ }{\mathrm{X}}_{\mathrm{i}1} + {\upbeta}_{2\ }{\mathrm{X}}_{\mathrm{i}2}+\cdots + {\upbeta}_{\mathrm{K}\ }{\mathrm{X}}_{\mathrm{i}\mathrm{K}}+\kern0.5em {\upgamma}_{0\mathrm{j}} + {\upgamma}_{1\mathrm{j}}\ {\mathrm{Z}}_{\mathrm{i}\mathrm{j}1} $$

(2)

differs from the frequentist hierarchical model [1] merely in the set of prior distributional assumptions attached to the parameters in the Bayes’ logistic regression model. The Bayes’ logistic regression model requires that there are distributional assumptions specified for the unknown β_{i} parameters, as well as the covariance parameters δ
^{2}_{0}
δ
^{2}_{1}
and σ_{γ0j,γ1j} associated with the distribution for the random effects γ_{0j}, γ_{0j} ~ *N*(0, δ
^{2}_{0}
) and γ_{1j} the random slope γ_{1j} ~ *N*(0, δ
^{2}_{1}
) respectfully. The β’s are assumed to be normally distributed. The Bayes’ model requires that we incorporate any prior information on the unknown parameters, **β** = β_{0}, β_{1}, …, β_{K}; δ
^{2}_{0}
, δ
^{2}_{1}
and σ_{γ0j,γ1j} together with the information we obtained from the observed data. Thus we concentrate on the resulting posterior distribution from which we seek posterior modes. In fact, the prior knowledge is represented through the distributional assumptions. It allows updating the knowledge regarding the unknown parameter distribution as if its prior information is known. We used the prior information through the distribution Pr(**β**) which is based on initial beliefs to estimate and make inferences. It is customary to assume that the normal distribution is the most appropriate prior. Thus, we obtained the posterior distribution, as proportional (∝) to the product of the likelihood function and the prior distribution on, which we concentrate. So that the probability is followed through the posterior,

$$ \Pr\ \left[\boldsymbol{\upbeta} \kern0.75em {\upgamma}_{0\mathrm{j}},\ {\upgamma}_{1\mathrm{j}},\ {\updelta}_0^2,{\updelta}_1^2\Big|\mathrm{Y}\right] \propto \left[ \Pr \left[\mathrm{Y}\Big|\boldsymbol{\upbeta},\ {\upgamma}_{0\mathrm{j}},\ {\upgamma}_{1\mathrm{j}},\ {\updelta}_0^2,{\updelta}_1^2\right]\ast \Pr \left(\boldsymbol{\upbeta},\ {\upgamma}_{0\mathrm{j}},\ {\upgamma}_{1\mathrm{j}},\ {\updelta}_0^2,{\updelta}_1^2\right)\right] $$

(3)

Since the data Y have distribution, Pr(Y) as constant relative to **β**. If the prior distributions were chosen to be uniform instead of normal, then the estimates for equations [1] and [2] are equivalently both maximum likelihood estimates. We obtained estimates from the posterior distribution, [2.3]. Such posterior inference typically requires simulation techniques such as Gibbs sampling and the Metropolis Hasting sampling as a MCMC method to obtain estimates. In fact, it generates new values from a proposed distribution that determines how to select new parameter values based on the current values. The Bayesian procedure produces consistent and efficient estimates. It has also been shown that if we were to choose conjugate priors, it guarantees that a posterior distribution possesses the same property as the prior distribution [15]. We used MLWin to fit these Add Health data with Bayes’ hierarchical logistic regression model. MLwiN is a powerful program designed for fitting multilevel models. It provides features such as *variance function window* to calculate the residual variance at any level.

### Joint modeling of the mean and dispersion

The subject-specific models concentrated modeling the conditional mean through the use of random effects to address the correlation. However, there is added value to address the correlation through jointly modeling of the mean and the dispersion [7]. This joint modeling consists of two interlinked models, one for the mean and one for the dispersion [16]. This differs from the random coefficient models and its approach to addressing correlation. It is a reflection of the double exponential family that allows us to model the mean parameter while making use of a second parameter that addresses the variance but independently of the mean [17]. Each submodel is considered to be a generalized linear model with three components: random component, systematic component, and link component [18]. The joint modeling of the mean and dispersion consists of starting initially with the mean submodel. Then, the residuals from the mean submodel or a function of them are used as the response (random component) in the dispersion submodel. The systematic component in the dispersion submodel is allowed if needed, to have the same or a subset of covariates as in the mean submodel or a totally different set of covariates not yet used. The link function in the dispersion submodel follows similar procedure though not necessarily the same function as the mean submodel. We used the same model checking techniques for both the mean and the dispersion submodel [19]. We considered two different joint modeling of both the mean and the dispersion submodel (Double Joint Modeling Mean and Dispersion), one based on GLM and the other based on GAM.

### Double GLM Joint Modeling mean and dispersion

The joint modeling of the mean and the dispersion, also referred to as double generalized linear models (DGLM) consists of three parts: a function for the variance, a GLM submodel for the mean, and a GLM submodel for the dispersion [2]. Thus, we need to obtain an estimation of the mean function and estimation of the variance function [20].

In particular, let us assume that the binary random variable Y_{i} is Bernoulli with probability \( {\mathrm{p}}_{\mathrm{i}} \), which depends on a set of covariates **X** = (X_{1},X_{2}, … X_{K}). The mean submodel is first fitted with logit link. The deviances from the mean submodel \( {\mathrm{d}}_{\mathrm{i}} \) have mean ϕ_{i} with variance V_{di} (ϕ_{i}) representing the random component. The systematic component of the dispersion submodel consists of the vector of covariates **Z** = (Z_{1}, Z_{2}, … Z_{t}) with link component as log. In our model we have one covariate in the dispersion submodel as *arrest*. In summary, we have two interlinked generalized linear models consisting of the mean submodel, measuring smoking use *Yi ~ Ber*(*p*
_{
i
}), where *p*
_{
i
} represents probability of smoking with logit link,

$$ {\upeta}_{\mathrm{i}} = \mathrm{logit}\ \left({\mathrm{p}}_{\mathrm{i}}\right) = \log \left(\frac{{\mathrm{p}}_{\mathrm{i}}}{{1\hbox{-} \mathrm{p}}_{\mathrm{i}}}\right) = {\upbeta}_0 + {\upbeta}_1{\mathrm{X}}_1+\cdots +{\upbeta}_{\mathrm{K}}{\mathrm{X}}_{\mathrm{K}} $$

(4)

and the dispersion submodel based on \( {\mathrm{d}}_{\mathrm{i}} \) such that d_{i} ~ D_{d} (ϕ_{i},V_{di}(ϕ_{i})) and log link

$$ {\mathrm{n}}_{\mathrm{di}} = \log \left({\upphi}_{\mathrm{i}}\right) = {\upgamma}_0 + {\upgamma}_1{\mathrm{Z}}_1. $$

We used PROC QLIM and Macro HPGLIM in SAS to fit these models, as presented in Additional file 1: Appendix.

### Double GAM Joint Modeling mean and dispersion

In the joint modeling of the mean and dispersion one can replace those GLM with GAM. As such we still present two submodels, one for mean and one for the dispersion, but in each we have a generalized additive model instead of GLM. A general additive model is similar to generalized linear model as it relates the mean of the random response variable Y and a set of covariates X_{i}, …, X_{p}, [21]. The generalized additive model (GAM) distinguishes itself from the generalized linear model in that

$$ \mathrm{E}\left[\mathrm{Y}\right] = \kern0.5em {\upgamma}_0+\kern0.5em {\upgamma}_1\left({\ \mathrm{X}}_1\right)+\dots + {\upgamma}_K\left({\ \mathrm{X}}_K\right) $$

(5)

where γ_{i} (X_{i}), i = 1, …,K; are smooth functions. These smooth functions γ_{i}, are not given a parametric form but instead are estimated in a nonparametric or semi-parametric form. The GAM consists of a random component with response Y_{i}, a systematic component that is additive in the function of covariates, and a link function relating the response in terms of the mean with a combination of the covariates. Whereas GLM has as its systematic component the linear predictor of the form Σβ_{i}X_{i} the GAM instead, uses the additive component, as a sum of smooth functions Σγ_{i}(X_{i}). The function γ_{i} is determined or estimated based on a nonparametric technique usually influenced by the examination of actual plots. The GAM requires one to determine the smoothing parameter through the generalized cross validation (GCV) function in nonparametric regression methods [22]. The functions γ (⋅) and f (⋅) are additive smooth functions, unspecified linear functions or non-parametric methods, for both mean and dispersion submodels.

The GAM is known to be more flexible than the parametric methods as an exploratory analysis in identifying the relationship between the response and the covariates. Also it is more flexible as opposed to other models when detecting quadratic or higher order power in a piecemeal fashion [21]. In addition, the GAM can also be used more efficiently to uncover nonlinear effects among the covariates. While the generalized linear model is used for estimation and inferences for the regression parameters, the GAM is seen as an exploratory method based on a nonparametric approach while making use of an apparent response-covariate relationship. In summary, we have the systematic component for the mean additive submodel logit (p_{i}) = ∑
^{p}_{i = 1}
γ_{i}(X_{i}) and the systematic component for dispersion additive submodel log(deviance_{i}) = ∑
^{k}_{i = 1}
f_{i}(X_{i}). The log (deviance) is assumed to be distributed as a normal distribution in the dispersion with identity link function [23]. We used PROC GAM in SAS to fit these models, as presented in Additional file 1: Appendix.