Alternative regression models to assess increase in childhood BMI

Background Body mass index (BMI) data usually have skewed distributions, for which common statistical modeling approaches such as simple linear or logistic regression have limitations. Methods Different regression approaches to predict childhood BMI by goodness-of-fit measures and means of interpretation were compared including generalized linear models (GLMs), quantile regression and Generalized Additive Models for Location, Scale and Shape (GAMLSS). We analyzed data of 4967 children participating in the school entry health examination in Bavaria, Germany, from 2001 to 2002. TV watching, meal frequency, breastfeeding, smoking in pregnancy, maternal obesity, parental social class and weight gain in the first 2 years of life were considered as risk factors for obesity. Results GAMLSS showed a much better fit regarding the estimation of risk factors effects on transformed and untransformed BMI data than common GLMs with respect to the generalized Akaike information criterion. In comparison with GAMLSS, quantile regression allowed for additional interpretation of prespecified distribution quantiles, such as quantiles referring to overweight or obesity. The variables TV watching, maternal BMI and weight gain in the first 2 years were directly, and meal frequency was inversely significantly associated with body composition in any model type examined. In contrast, smoking in pregnancy was not directly, and breastfeeding and parental social class were not inversely significantly associated with body composition in GLM models, but in GAMLSS and partly in quantile regression models. Risk factor specific BMI percentile curves could be estimated from GAMLSS and quantile regression models. Conclusion GAMLSS and quantile regression seem to be more appropriate than common GLMs for risk factor modeling of BMI data.


Background
The prevalence of childhood obesity increased dramatically during the last decades in industrialized countries [1,2]. This increase in prevalence seems rather to be due to a shift of the upper part of the body mass index (BMI) distribution than to a shift of the entire BMI distribution as for example observed in the NHANESIII survey from 1988 to 1994 [3]. This increased positive skewness could be due to exposure to obesogenic environmental determinants among a subpopulation with a high degree of susceptibility. TV watching, formula feeding, smoking in pregnancy, maternal obesity or parental social class are well known environmental, constitutional or sociodemographic risk factors [4,5]. However, it remains unknown if these factors affect the entire BMI distribution or only parts of it. A recent descriptive study reported an effect of several risk factors for childhood obesity on upper BMI percentiles, while the middle part of the BMI distribution was virtually unaffected. However, this study did not adjust for potential confounders [6].
In the literature most authors used linear or logistic regression to model effects on body mass index (BMI) measures. However, BMI data are usually positively skewed, and therefore a transformation of the response variable and/or other regression methods might be more appropriate. Possible approaches include lognormal or Box Cox power transformations of the BMI prior to linear regression modeling, gamma regression, quantile regression or GAMLSS models.
Quantile regression has been applied in various BMIrelated studies [7][8][9]. Several risk factors for increased adult body size had different effects on specific quantiles. Comparisons between different regression models were discussed, but not quantified by model fit criteria such as Akaike Information Criterion (AIC) [10].
The aim of our study was to compare generalized linear models, GAMLSS models and quantile regression models among BMI data on 4967 preschoolers in order to identify the best approach for obesity risk factor analysis. Additionally, we aimed to assess the effect of different risk factors on the BMI distribution (change of mean, variance, skewness or kurtosis) that might have implications for preventive measures (population based approach vs. targeted approach).

Data
Data on 7026 children participating in the school entry health examination in Bavaria, Southern Germany, were collected between September 2001 and August 2002. Children's age ranged from 54 to 88 months. Parental questionnaires on sociodemographic, lifestyle and other risk factors for obesity were distributed together with the invitation to the compulsory school entry examination. Children's weight and height were measured in light clothing and with calibrated balances and fixed stadiometers during the examination. The study has been described in detail elsewhere [4].
Sex and age were considered as confounders, while explanatory variables with previously reported associations to childhood body composition were a priori considered as exposures (abbreviations in brackets). These exposure variables included maternal smoking in pregnancy (PS), amount of watching TV (TV), breast feeding (BF), daily meal frequency (MF), highest graduation of either parent (elementary/secondary/at least A-level) (PG), maternal BMI (MB) and child's weight gain from birth to 2 years of life (WG) [4,5,11]. The sample was confined to cases with complete information on these variables leaving data of 4967 children for the analyses.

Statistical methods
Simple linear regression uses an identity link and models the relationship between a dependent variable Y i , independent variables (z 1 , ..., z m ) with m as total number of covariates included, and residuals (ε 1 , ..., ε n ) for the individual i, i = 1, ..., n. The model can be denoted as Generalized linear models (GLM) allow a more flexible modeling [12] of the linear predictor η i = g(μ i ) which can be denoted as The link function g(.) can be specified e.g. by • the identity link g(μ) = μ, resulting in the simple linear regression model, • the log link g(μ) = log(μ) yielding loglinear regression, • the Box Cox power link [13] • or the inverse link g(μ) = μ -1 .
The inverse link function is the natural link function for the normal gamma distribution and was used in this study to perform gamma regression. One approach for model selection is the Generalized Akaike Information Criterion (GAIC) with c = 2 for the 'classical' Akaike Information Criterion (AIC) [10], and c = log(n) for the Bayes Information Criterion (BIC) [14]. The GAIC includes the log likelihood containing the relevant parameter vector (e. g. μ) and a penalty term c × p for the number of parameters and p = m + f with f for the extra degrees of freedom needed for special model fitting techniques (e. g. splines). A statistical model is considered as better fitting if its GAIC is smaller than the GAIC of another statistical model.
Generalized Additive Models for Location, Scale and Shape (GAMLSS) offer an approach to model data with consideration of μ as location parameter as well as σ as scale parameter, and the skewness parameter ν and the kurtosis parameter ζ as shape parameters. A GAMLSS model is based on independent observations y i for i = 1, ..., n and monotone link functions g k (.), relating the parameters μ, σ, ν and ζ to the J k explanatory variables [15,16] through semiparametric predictors. The common choice of the link functions is: A multiplicative rather than an additive model for μ can be obtained by setting g 1 (μ) = log(μ). Calculations with GAMLSS in this study use the Box Cox t (BCT) distribution, which is defined as with z assumed to follow a t distribution with ζ degrees of freedom (ζ > 0). Under this assumption it is possible to perform likelihood calculations.
Additionally, cubic and penalized splines were considered to model continuous covariates [17,18]. The model selection can also be performed by GAIC because GAMLSS represents a general framework of regression models, including the class of GLMs [19]. The authors of GAMLSS used values for c in the range of 2 to 3 to calculate the GAIC [19].
In contrast to the above mentioned distribution based methods, quantile regression estimates conditional quantile functions. It can be used to obtain information about specific quantiles of the underlying distribution.
Quantile regression for the sample quantile τ works by minimizing with the so-called check function [20] In (3), the predictor in equation (1) is taken as η = Q τ with Q τ being the modeled τ quantile.
The comparison of quantile regression and generalized linear models is a major challenge due to the inapplicability of the GAIC in quantile regression. To compare GAMLSS and quantile regression, we plotted estimated values of the 90 th and 97 th BMI percentiles for weight gain in the first two years, while the other covariates were considered at their mean values (if continuous) or their modes (if categorical). We similarly calculated the estimated percentiles for each category of meal frequency, holding the other variables fixed accordingly.

Results
The overall mean of the BMI of the 4967 children was 15.34 kg/m 2 with a median of 15.08 kg/m 2 . The data included 2585 males (vs. 2382 females), 417 (vs. 4550) children whose mother had smoked in pregnancy, 384 children with more than 2 TV hours per day (vs. 4583 in 3 lower categories), 1197 (vs. 3770) children who had never been breastfed, 816 children with 3 daily meals at maximum (vs. 4151 with 4 or more meals), and 1466 children whose parents had only an elementary school degree or less (vs. 3501 in other categories). In addition to these categorical covariates, we considered the metric variables children's age in months with a mean of 72.86 (SD 4.77), the maternal BMI (in kg/m 2 ) which ranged from 15.9 to 49.5 (mean 23.44, SD 3.99), and the children's weight gain (in kg) in the first 2 years of life, ranging from 5.5 to 15.3 (mean 9.45, SD 1.40).

LL
L y y f y Figure 1 shows univariate non-parametric kernel density estimates of the children's BMI distributions with regard to underlying risk factors. Maternal BMI and weight gain in the first 2 years were categorized by common cut points (Maternal BMI > 25 kg/m 2 , weight gain ≥ 10 kg [4]). When present, most risk factors seemed to increase BMI values of upper BMI regions: For example, there was a higher proportion of children with a BMI > 18 in non-breastfed compared to breastfed children, although the distribution curves of both strata were of almost identical shape for BMI values of < 18.
Simple linear models assessing the impact of certain risk factors might be limited under such varying key characteristics of the density distributions with and without underlying risk factors due to their intense assumptions.
In the multivariable regression analyses, we considered the following a priori defined interaction terms with reported or assumed interrelations: a) sex as confounder with every covariate except age, b) weight gain in the first 2 years with parental education [4], c) weight gain in the first 2 years with breast feeding [21] and d) maternal smoking in pregnancy with breastfeeding [22].
Full multivariable linear, loglinear, gamma and linear regression models with Box Cox power transformed BMI values included all covariates and all a priori defined interaction terms. The backward elimination procedure yielded models without any interaction term and without parents' graduate, maternal smoking in pregnancy or breastfeeding for all 4 GLM models, with η = μ for LR, for example.
We chose c = 3 in equation (2) for the GAIC because this factor yielded stable and plausible results in a univariate preanalysis (data not shown). We decided not to fit the multivariable GAMLSS model by considering all covariates from the beginning and starting the fitting process due to the high computational demand of this approach. Instead, we calculated separate univariate GAMLSS models for all covariates and thereafter combined the resulting models to a multivariable model in terms of a pre-selecting forward selection procedure. During the fitting process of univariate models, we considered the strict parameter hierarchy for GAMLSS models in four steps, according to the suggestion of the GAMLSS authors [23]: first a model for μ should be fitted, after that for σ, followed by ν and ζ. If a parameter term did not reduce the GAIC(3), it was not considered for the univariate model of the respective covariate. For example, ν and ζ did not enhance the fit of the univariate model for the variable watching TV, yielding (table 1): Cubic and penalized splines up to three degrees of freedom were considered in models of the continuous covariates age, maternal BMI and weight gain in the first 2 years. Parameters that were not significant anymore in the combined multivariable model were excluded from the final multivariable model. Apart from age, increase (or decrease) in the location parameter μ for covariates was always associated with significant increase (or decrease) in the scale parameter σ.
The final multivariable GAMLSS model yielded the same significant covariates as the GLM methods using backward selection, with exception of breastfeeding for which the scale parameter σ was significant in the GAMLSS (tables 1 and 2). The a priori defined interaction terms were not significant in any considered model.
An overview on significant variables in respective models and differences across models is shown in table 2. The covariates TV watching, meal frequency, maternal BMI and weight gain in the first two years of life were significantly associated with child's BMI regardless of the method or chosen link. In contrast, parental education was not significant in any multivariable model. Its influ-Univariate density distributions of children's BMI with regard to underlying risk factors Figure 1 Univariate density distributions of children's BMI with regard to underlying risk factors. Maternal BMI and weight gain in the first two years were divided up into two categories. The risk factors seem to produce a slightly right-skewed distribution for exposed in comparison to non-exposed children, whereas the confounder variable sex does not.

Discussion and conclusion
In our study, GAMLSS showed a much better fit examining obesity risk factors compared to GLM models by GAIC. The same explanatory variables had significant associations to body composition across all GLM models, although models contained either additive (linear regression) or multiplicative components (loglinear regression, Box Cox regression and gamma regression).
In general, GAMLSS offers a flexible approach due to the large number of implemented distribution families. With GAMLSS, it is possible to assess the effect of specific parameters on the outcome variable distribution. For example, we observed that some variables did not only affect the mean, but additionally the scale of the BMI distribution. Additionally, interdependencies of considered parameters can be examined by GAMLSS. We observed that an increase (decrease) of the mean (μ) was mostly associated with an increase (decrease) of the scale (σ). The scale parameter σ in the distribution used (BCT) in GAMLSS is an approximative centile based coefficient of variation measure [16]. Therefore risk factors of over- "+" denoting significant variables, "0" non-significant variables and, in case of GAMLSS, "(+)" variables only significant for the σ term and " [0]" variables only significant in the univariate models.5  * "never" as reference † "1-3/day" as reference ‡ Parameter only significant in the respective univariate model § Splines used for parameter estimationμσνς weight seem to affect both, the BMI itself and its variation. For example, children with a high weight gain in the first 2 years of life had higher BMI values as well as a higher coefficient of variation in BMI compared to those with a low infant weight gain. Thus, low infant weight gain might be a better predictor for underweight than is high infant weight gain for overweight. A change of the skewness term ν, however, did not improve the goodness of fit for modeling the skewed BMI distribution. This might be due to a sufficient consideration of skewness by a change of both parameters μ and σ.
Quantile regression allows additional interpretation, e.g. of risk factors affecting only parts of the distribution [7]. While GAMLSS models consider the entire BMI distribution, quantile regression directly examines possible associations between explanatory variables and certain predefined percentiles. Logistic regression is in principal based on a similar idea, but in case of overweight, for example, it has to deal with a big loss of information due to transformation of the continuous BMI to a binary variable. Quantile regression, in contrast, uses the whole information of the data. Furthermore, the interpretations of logistic and quantile regression differ. For example, logistic regression assesses the odds ratio for overweight in relation to certain risk factors, whereas quantile regression quantifies the linear impact of risk factors on overweight children.
Values for the 90 th and 97 th BMI percentiles in respect to weight gain in the first two years (in kg), estimated by GAMLSS (dark lines) and quantile regression (grey lines), with fixed values for all other covariates    In our study, the variables TV watching, maternal BMI and weight gain in the first 2 years of life were directly and meal frequency was inversely significantly associated with body composition in every examined model type. However, the strength of the associations was of different magnitude across model types (table 4).
In our study breastfeeding seemed to have a protective effect on the upper percentiles of the BMI estimated by quantile regression (e.g. -0.41 for the 90 th percentile, s. table 3), although generalized regression models and GAMLSS did not assess breastfeeding as being significantly associated with the mean BMI (although it was a significant predictor of σ). The latter is in accordance with a recent study on mean BMI and DXA derived fat mass measures [24]. Additionally, different aspects might be detected by modeling different quantiles, for example quantiles referring to underweight.
We confined our sample to cases with complete information in all variables. Since underreporting with respect to pregnancy smoking and high values of maternal BMI is well-known, this might have led to underestimation of the effects of the corresponding covariates on childhood BMI. However, such an underestimation is likely to similarly affect all examined statistical approaches and therefore be of minor relevance for assessment of the appropriate approach. It might be of interest, however, to compare how sensitive the statistical models are to several methods of missing data imputation such as multiple imputation. However, this question leads deeply into other statistical methodology and is therefore beyond the scope of our study.
GAMLSS and quantile regression have recently been compared, along with many other methods, in a WHO study to identify standard reference values for child growth [25]. Four out of five construction methods taken under further examination were GAMLSS methods with different distribution functions: Box Cox t (like in this study), Box Cox power exponential [26], Box Cox normal [27] and Johnson's SU (sinh -1 normal) [28]. The other considered method used modulus-exponential-normal distribution [29]. The authors finally calculated reference values by GAMLSS with Box Cox power exponential distribution, using AIC and GAIC(3) in parallel for model selection [30]. This indicates that GAMLSS is a very appropriate method for constructing reference curves which are based on estimated percentile curves.
In our study, a comparison of GAMLSS and quantile regression by estimated values of the 90 th and 97 th percentiles with respect to certain covariates (weight gain and meal frequency) showed similar results for both methods at the 90 th percentile, while the estimated 97 th percentile was slightly higher in the quantile regression model. Since implementation of percentile curves is existent only for univariate models in the gamlss package, some computational effort was necessary to gain the respective GAMLSS curves with fixed effects of other covariates. Furthermore, it might be worthwhile to consider nonlinear quantile regression (20) in future studies.
The statistical model that should be used, largely depends on the observed data and on the aim of the study. GAMLSS models provide exact modeling of continuous outcomes, e.g. for the calculation of standard reference values. While GLMs provide helpful information on mean response changes, GAMLSS additionally provides information on distribution parameters like scale or skewness.
On the other hand, quantile regression can be used to model specific parts of the BMI distribution such as the 90 th or 97 th percentile and should be preferred to logistic regression if the original scale of the outcome variable was continuous and a GLM or GAMLSS cannot answer the research question.