- Research article
- Open Access
- Open Peer Review
Prediction intervals for future BMI values of individual children - a non-parametric approach by quantile boosting
- Andreas Mayr^{1}Email author,
- Torsten Hothorn^{2} and
- Nora Fenske^{2}
https://doi.org/10.1186/1471-2288-12-6
© Mayr et al; licensee BioMed Central Ltd. 2012
- Received: 26 April 2011
- Accepted: 25 January 2012
- Published: 25 January 2012
Abstract
Background
The construction of prediction intervals (PIs) for future body mass index (BMI) values of individual children based on a recent German birth cohort study with n = 2007 children is problematic for standard parametric approaches, as the BMI distribution in childhood is typically skewed depending on age.
Methods
We avoid distributional assumptions by directly modelling the borders of PIs by additive quantile regression, estimated by boosting. We point out the concept of conditional coverage to prove the accuracy of PIs. As conditional coverage can hardly be evaluated in practical applications, we conduct a simulation study before fitting child- and covariate-specific PIs for future BMI values and BMI patterns for the present data.
Results
The results of our simulation study suggest that PIs fitted by quantile boosting cover future observations with the predefined coverage probability and outperform the benchmark approach. For the prediction of future BMI values, quantile boosting automatically selects informative covariates and adapts to the age-specific skewness of the BMI distribution. The lengths of the estimated PIs are child-specific and increase, as expected, with the age of the child.
Conclusions
Quantile boosting is a promising approach to construct PIs with correct conditional coverage in a non-parametric way. It is in particular suitable for the prediction of BMI patterns depending on covariates, since it provides an interpretable predictor structure, inherent variable selection properties and can even account for longitudinal data structures.
Keywords
- Body Mass Index
- Quantile Regression
- Prediction Interval
- Coverage Probability
- Conditional Coverage
Background
Childhood obesity is more and more becoming a problem of epidemic dimensions in modern societies [1, 2]. The body mass index (BMI) has proved to be a reliable measure to assess childhood obesity and can also be seen as an indicator for obesity in adulthood [3, 4]. Therefore, the prediction of future BMI values for individual children may be used as a warning bell for clinicians, parents and children. Predicting future BMI values raises awareness for problems to come - as long as they are still avoidable - and can thus lower the risk of later obesity.
In this setting, we focus on obtaining reliable predictions for future BMI values of children. Prediction intervals (PIs) offer information on the expected variability by providing not only a point prediction but a covariate-specific interval which covers the future BMI for this individual child with high probability. We construct child-specific prediction intervals for the LISA study, a recent German birth cohort study with 2007 children [5]. Data include up to ten BMI values per child from birth until the age of 10, as well as variables that are discussed to be potential early childhood risk factors for later obesity, such as breastfeeding, maternal BMI gain and smoking during pregnancy, parental overweight, socioeconomic factors, and weight gain during the first two years [6, 7]. In our analysis, we first construct PIs for the children's BMI at approximately the age of four, relying on data available for the children at the age of two. In a second step, we explore the longitudinal structure of the present data and construct PIs for child-specific BMI patterns from two up to ten years.
Predicting child-specific BMI values is a great challenge from two different perspectives: From the epidemiological perspective, it is difficult to predict BMI values as they depend on factors which are hard to measure; such as physical activity, healthy nutrition, and lifestyle habits. From the statistical point of view, the distribution of BMI values is typically skewed and the degree of skewness depends on children's age, see e.g. Beyerlein et al. [8], which makes standard strategies to construct PIs relying on distributional and homoscedasticity assumptions problematic.
In these standard parametric approaches, first, a point prediction for the future BMI value is estimated based on mean regression models with Gaussian distributed errors, then a symmetric PI is constructed around that point based on distributional assumptions. To predict BMI values, however, these standard parametric approaches are problematic due to two reasons: not only the model assumptions for the point prediction might not be fulfilled but also the length of the PI depends on an assumed fixed variance which does not reflect the reality of an age-specific BMI skewness [9]. One possibility to overcome these problems would be the usage of more sophisticated parametric approaches, as for example generalized additive models for location scale and shape ("GAMLSS" [10]). GAMLSS are modelling up to four parameters of the conditional response's distribution and could therefore take age-specific skewness into account. This model class has already been used for constructing PIs in combination with boosting [11]. However, the construction of PIs based on GAMLSS depends totally on the assumed distribution and the interpretation of covariate effects with respect to the interval borders is not straightforward.
We avoid making distributional assumptions here by developing a new approach to constructing non-parametric prediction intervals based on quantile boosting. Instead of constructing intervals around a point prediction, the new approach directly models the interval borders by additive quantile regression [12]. The borders are fitted as BMI quantiles conditional on the child-specific covariate combination. We use quantile boosting for the estimation [13], which offers the advantage of flexible and inter-pretable covariate effects and an intrinsic variable selection property (which is in particular useful in high-dimensional data settings). The size of the resulting PIs is not fixed but depends on covariates - in longitudinal settings it might also depend on child-specific effects (corresponding to "random effects" in linear and additive mixed models).
During the work on this paper, we found a severe pitfall in the correct validation of prediction intervals. The appropriate measure for validating PIs is conditional coverage, not sample coverage (although being more intuitive) which makes it unfeasible in almost any data setting to evaluate the intervals in practice. The only way to demonstrate the correctness of PIs is therefore based on an empirical evaluation with simulated data. Thus, in a first step we evaluate the correctness of our approach in a set of simulation studies before applying quantile boosting to predict future BMI values.
Methods
Prediction intervals by conditional quantiles
The resulting PI should cover a new observation y _{new} with probability (1 - α) while its length depends on x _{new}. There might be combinations of co-variates that allow for a very precise prediction for y _{new} resulting in a narrow interval, whereas wide intervals imply that for a given x _{new} the prediction is more inaccurate. As the estimates ${\widehat{q}}_{\tau}\left(x\right)$ depend on a training sample (y _{1}, x _{1}), ..., (y _{ n }, x _{ n }), which are realizations of random variables Y and X, the boundaries of the intervals itself can be seen as random variables. This is an analogy to confidence intervals, which usually should cover unknown but fixed parameters. The boundaries of confidence intervals depend on the underlying sample and thus differ from sample to sample. Yet, for every sample, they cover the true parameter with a probability of 1 - α. Prediction intervals are constructed in the same way, but they cover a future realization of a random variable, which itself is random. The result is that the length of a prediction interval for y _{new} is always larger than the length of a confidence interval for the expected mean of Y. Prediction intervals do not only take into account the sampling error made by the estimation based on a sample, but also the unexplained variability of Y given X = x. In conclusion, as long as Y|X = x is not deterministic, the length of the corresponding PI - in contrast to a confidence interval - does not reduce to 0, not even for infinitely large sample sizes.
Conditional coverage vs. sample coverage
We stated that a correctly specified prediction interval PI_{(1 - α)}(x _{new}) covers a new observation y _{new} with probability π : = (1 - α). To validate a method for fitting PIs, we obviously need a certain amount of new observations: From a single observation (y _{new}, x _{new}) it is impossible to verify if PI_{(1 - α)}(x _{new}) is correct. It either covers y _{new} or not - both events do not prove anything, at least if α is not 0. Yet, if we have a certain amount of new observations, there still exist two different interpretations for the coverage probability π:
Sample coverage
where I{·} is an indicator function.
Conditional coverage
Although sample coverage is the more intuitive interpretation of PIs, it is obvious that conditional coverage reflects in a better way what we really expect from a PI. For example, after constructing a 95% PI for the BMI of a child at the age of four, given all information available from the child as a two-year-old, we particularly expect the future BMI of this child with its exact measures to be covered with a probability of 95%. In frequentistic language, the BMI of 95% of children with exactly the same measures should be covered by the interval. The coverage should hold for every child and every possible combination of covariates not only on average for all children.
This finding leads to a severe problem, at least for multivariate prediction settings with several continuous covariates: For every combination of covariates only one response observation will be available under almost any practical circumstances. We will only find one child for each combination of covariates - not even twins will have the exact same measures - this is obviously not enough to verify the correct conditional coverage of a fitted PI.
Therefore, to demonstrate the correctness of a method fitting accurate prediction intervals, it is necessary to use artificial simulated data sets to evaluate the conditional coverage in (4) for a selected set of covariate combinations. Here, we will conduct a simulation study to evaluate if quantile boosting is a correct method to fit accurate conditional prediction intervals in potentially high-dimensional data settings before we apply this approach to predict future BMI values of children.
Quantile boosting
The index i = 1, ...,n, denotes the individual, and q _{ τ }(x _{ i }) stands for the τ-quantile of the response y _{ i }conditional on its specific covariate vector x _{ i }= (x _{ i1}, ..., x _{ ip })^{⊤}. The quantile-specific additive predictor η _{ τi }is composed of an intercept β _{ τ0}and a sum of different effects of p covariates x _{ i }= (x _{ i1}, ..., x _{ ip })^{⊤} on the quantile function. The functions f _{ τ1}, ..., f _{ τp }comprise linear effects, i.e. f _{ τj }(x _{ ij }) = β _{ τj } x _{ ij }, as well as non-linear effects whose functional form is not specified in advance. In fact, the additive predictor could also contain a wide variety of additional covariate effects, e.g. varying coefficient terms or spatial effects, as described in [13]. Note that contrary to classical regression, there is no specific distributional assumption for the response in (5). The only restriction is that the response must be continuous.
Standard approaches for solving the optimization problem in (6) rely on linear programming [14, 15]. Quantile regression forest [12] is a recent approach to conducting quantile regression and adapts random forest [16] to estimate the whole conditional distribution function. Since this approach is based on regression trees, the resulting estimates ${\widehat{q}}_{\tau}\left(x\right)$ - in contrast to the additive modelling approach presented here - can only be described as black-box predictions. Nevertheless, we will use quantile regression forest as benchmark in our simulation study.
We will use gradient boosting for the estimation of the additive quantile regression model in (5), and call our approach quantile boosting in the following. Quantile boosting [13] was introduced as a method to flexibly estimate additive quantile regression models. It is an adaptation of component-wise functional gradient descent boosting [17] and aims at minimizing an empirical risk criterion, as given in (6). In case of quantile regression, the appropriate loss is the check function (7).
The minimization of (6) is achieved by stepwise updating the predictor function η _{ τ }. Therefore, base-learners are used, i.e. simple univariate regression models fitting the negative gradient of the empirical loss (7). The base-learners play a key role in the algorithm, since they define the kind of effects between each covariate and response. In our approach, we use simple linear models to represent linear covariate effects and penalized regression splines to represent non-linear effects. The advantage of quantile boosting is that the resulting predictor η _{ τ }is strictly additive and interpretable, following the additive quantile regression model in (5).
In detail, the boosting procedure works as follows: For each covariate, one specific base-learner is defined and in every boosting step the algorithm updates only the covariate with the best performing base-learner. This way, the algorithm is descending the loss by searching in the function space represented by the base-learners. If the algorithm is stopped before every base learner was at least once updated ("early stopping"), less important covariates will never have been updated during the boosting process and are effectively excluded from the final model. Thus, boosting comes along with an inherent variable selection property and produces sparse models in potentially high-dimensional settings. It even allows for candidate models that contain more covariates than observations.
Regarding prediction, early stopping is a desirable property, since it yields shrunk effect estimates. Shrinkage of effect estimates is a widely established method in statistical modelling [18, 19] and tends to produce a more stable solution leading to an improved prediction accuracy of the model [20–22], even though an increase of the model bias (towards underlying data) has to be accepted. The primary aim is not to minimize the loss in the underlying training sample best - resulting in a small model bias - but to get accurate predictions with a small variance for new data. Since our work focuses on predictions for future BMI values, the shrinkage effect is of high relevance in our approach and is promising in order to provide accurate PIs.
A crucial parameter that has to be tuned with care during the boosting process is the number of stopping iterations. It should be tuned regarding the empirical loss in (6) on a test data sample, or - in case that no additional data is available - by applying cross-validation techniques or bootstrapping on the training data [19, 23]. Quantile boosting is implemented within the R [24] add-on package mboost [25, 26].
Simulation study
We have already mentioned that the correct empirical validation of PIs should be based on conditional coverage. Since it is almost impossible to evaluate the conditional coverage in practical data analyses, we carried out a simulation study to provide some kind of proof that PIs fitted by quantile boosting are provided with correct conditional coverage. As benchmark, we used quantile regression forest [12] for which an implementation is available in the R add-on package quantregForest [12, 27].
- 1.
Are the proposed PIs able to cover future observations with a predefined conditional coverage probability?
- 2.
Is quantile boosting able to identify relevant informative covariates, also in high-dimensional settings, e.g. data sets with a potentially large number of covariates?
The first lines of the model formulas represent the contribution of the covariates x _{1}, ..., x _{ p }on the expected mean of the response y, whereas the bottom line specifies their contribution to heteroscedasticity. Both settings include only four informative covariates x _{1},...,x _{4}. The error terms ε _{ i }were drawn independent and identically from a standard normal distribution, i.e. ε _{ i }~ N(0,1), whereas the covariates were sampled independent and identically from a continuous uniform distribution, i.e. x _{ i1},..., x _{ ip }~ U(0,1) for the linear setup and x _{ i1}, ...,x _{ ip }~ U(0, 3) for the non-linear setup. To evaluate the ability of quantile boosting to select relevant covariates, we generated data for both settings once in a low-dimensional scenario with p = 10 and once in a high-dimensional scenario with p = 500 which, in conclusion, included 496 non-informative covariates.
in the following way: We generated in each simulation run a training sample (y _{1}, x _{1}), ..., (y _{ n }, x _{ n }), with n = 2000 observations and an additional data set with 5000 observations to select the optimal number of stopping iterations for quantile boosting. Then, we fitted additive quantile regression models and quantile regression forest for τ _{1} = 0.025 and τ _{2} = 0.975, including all p covariates.
By designing our simulation in this way, we were able to evaluate the conditional coverage of the constructed PIs and avoided the pitfall of averaging over a new sample, corresponding to the sample coverage.
Predicting childhood BMI
Data
Data contains observations from a prospective longitudinal birth cohort study (called "LISA study", [5]), including newborns between 11/1997 and 01/1999 from four German cities. Our aim is to predict future BMI values for children relying on the data available when they were two years old. Originally, the study included 3097 healthy children - of whom 2007 are complete cases in the sense that the necessary covariates at the age of two are all available for our analysis and at least one future BMI value until the age of ten is recorded. Continuous covariates from early childhood are the BMI of the child at birth (cBMI0) and as a two-year-old (cBMI2), the exact age of the child at the future measurement (cAge), the BMI of the mother at the beginning of pregnancy (mBMI) and the following BMI gain during pregnancy (mDiffBMI). The considered binary categorical covariates are the sex of the child (cSex), the area the child is living in (cArea - rural or urban), exclusive breastfeeding until the age of four months (cBreast), maternal smoking during pregnancy (mSmoke) and - with four covariate levels - the maternal level of education (mEdu - increasing by level). As potential response variables, the data comprises BMI values at approximately the age of four (cBMI4), six (cBMI6) and ten (cBMI10). See [9] for further description of the LISA study.
Cross-sectional analysis
Here, q _{ τ }(x _{ i }) denotes the τ-quantile of the response cBMI4 for child i with covariate combination x _{ i }. It will represent the borders of child-specific PIs for τ _{1} = 0.025 and τ _{2} = 0.975. We included a nonlinear effect for cBMI2 and linear effects for all other covariates in our candidate model.
As a benchmark, we compared PIs resulting from our approach to black box estimates for ${\hat{\text{PI}}}_{0.95}\left({x}_{\text{new}}\right)$ from quantile regression forest. Yet, it was impossible to evaluate the conditional coverage of the PIs in our analysis as already discussed above. As a consequence, we focused on the empirical loss (6) for model comparison, which can be seen as a reliable measure not to validate but to compare algorithms fitting PIs by quantile regression. Thus, we determined the empirical loss for the two quantiles and both models in a 10-fold cross-validation analysis. The optimal stopping iteration for quantile boosting was selected by 25-fold bootstrapping on each of the 10 training data sets separately. Goodness-of-fit of the chosen models was assessed by a recent approach presented in Wei and He [28], originally developed for conditional growth charts. We generated test samples from the conditional model distribution and compared them to the observed empirical distribution of the response, see [28] for details.
Longitudinal analysis
This model contains child and quantile specific intercept b _{ τ1i }and slope b _{ τ2i }to account for the correlation between repeated measurements of the same child, which typically occurs in longitudinal data. These individual-specific "random" effects are estimated by a specially designed base-learner employing L _{1} regularization methods [30]. In connection with L _{1} regularization, quantile regression for longitudinal data was first proposed by Koenker [31]. Here, we also include individual-specific slopes and smooth non-linear effects in the flexible predictor.
Contrary to the cross-sectional analysis, cAge is included and differs for different time points. The non-linear fixed effect f _{1τ }describes an overall BMI pattern depending on age which is valid for all children, whereas the random effects b _{ τ2i }express child-specific linear deviations from this overall BMI pattern. All other covariates are time-constant. Again, we used the method presented in [28] to assess goodness-of-fit, in this case separately for the three different time points.
The optimal stopping iteration for the boosting algorithm was selected by applying subject-wise bootstrap. For this setting, it was impossible to compare quantile boosting to the benchmark algorithm, since quantile regression forest cannot account for a longitudinal data structure. Thus, we only calculated the PIs for BMI patterns of "new" children by ten-fold cross validation. To determine child-specfic PIs, for those children the child-specific intercepts and slopes were set to zero, which corresponds to their expected mean.
Results
Simulation study
Results simulation study
95% PIs | p= 10 | p= 500 | ||
---|---|---|---|---|
mboost | quantregForest | mboost | quantregForest | |
Linear setup | ||||
$\widehat{\pi}|{x}_{1}$ | 0.9454 | 0.9948 | 0.9361 | 0.9997 |
$\widehat{\pi}|{x}_{2}$ | 0.9489 | 0.9689 | 0.9425 | 0.9889 |
$\widehat{\pi}|{x}_{3}$ | 0.9466 | 0.9561 | 0.9418 | 0.9609 |
$\widehat{\pi}|{x}_{4}$ | 0.9437 | 0.9307 | 0.9400 | 0.9471 |
$\widehat{\pi}|{x}_{5}$ | 0.9405 | 0.9310 | 0.9373 | 0.9534 |
Non-linear setup | ||||
$\widehat{\pi}|{x}_{1}$ | 0.9486 | 0.9721 | 0.9662 | 0.9832 |
$\widehat{\pi}|{x}_{2}$ | 0.9494 | 0.9925 | 0.9623 | 0.9961 |
$\widehat{\pi}|{x}_{3}$ | 0.9490 | 0.9940 | 0.9521 | 0.9954 |
$\widehat{\pi}|{x}_{4}$ | 0.9460 | 0.9785 | 0.9407 | 0.9792 |
$\widehat{\pi}|{x}_{5}$ | 0.9314 | 0.8743 | 0.9171 | 0.8942 |
In conclusion, PIs fitted by quantile boosting seem to cover future observations with the predefined coverage probability, conditional on the test points. The best results can be observed in the center of the x-grid. Quantile boosting outperforms the benchmark in both setups - linear and nonlinear setup - and for both scenarios - for the low-dimensional as well as for the high-dimensional scenario. However, the evaluated simulation setups did not include interaction terms - which could have favored quantile regression forest. For our data analysis, we can rely on the result that PIs constructed by quantile regression lead to correct conditional coverage probabilities. Furthermore, we can benefit from quantile boosting since the algorithm is able to select relevant covariates and yields sparse models in high-dimensional scenarios.
Predicting childhood BMI
Data
Cross-sectional analysis
Linear effect estimates for the LISA study: quantile boosting
Cross-sectional analysis | Longitudinal analysis | |||
---|---|---|---|---|
Variable | τ= 0.025 | τ= 0.975 | τ= 0.025 | τ= 0.975 |
Intercept | 14.208 | 14.867 | 14.627 | 12.723 |
cAge | -- | -- | f(·) | f(·) |
cBMI2 | f(·) | f(·) | f(·) | f(·) |
cBMI0 | 0.008 | |||
mBMI | 0.028 | 0.034 | 0.029 | 0.132 |
mDiffBMI | 0.026 | |||
cSex = male | 0.068 | |||
cArea = urban | -0.029 | -0.075 | -0.043 | |
cBreast = yes | ||||
mSmoke = yes | -0.228 | 0.296 | 0.158 | |
mEdu = 1 (low) | 0.162 | 0.162 | ||
mEdu = 2 | 0.406 | 0.176 | ||
mEdu = 3 | 0.130 | -0.107 | ||
mEdu = 4 (high) | 0.070 | -0.092 |
Longitudinal analysis
Again, level and length of the PIs are child-specific, but the lengths of PIs at the age of ten are larger than the lengths at earlier time points. This seems to be realistic as we try to predict BMI values of children at the age of ten, only relying on information available as two-year-olds. The mean length of the PIs of all children is 4.78 kg/m^{2}, ranging from 2.52 kg/m^{2} to 11.28 kg/m^{2}. The increased length of the intervals again results from the children getting older. This result is further emphasized by the estimated non-linear effects of cAge (presented as Figure S3 in the Additional file 1). The estimated effect for the 97.5% BMI quantile, i.e. the upper border of the PIs, is strongly increasing after the age of six, whereas the effect for the lower border remains constant. This result also corresponds to the empirical age-specific BMI distribution observed in Figure 4. Apparently the resulting PIs reflect the risk of childhood obesity kicking-in somewhere after the age of six.
Effect estimates for other covariates are included in Table 2. The pattern of selected covariates roughly corresponds to the cross-sectional analysis. Even though the effect signs and sizes show minor differences for some covariates, such as mEdu, the other effects on the PI borders remain stable across analyses, including the non-linear effect of cBMI2 (Additional file 1, Figure S3), confirming the presence of these effects. Diagnostic plots (Additional file 1, Figure S4) show a satisfying goodness-of-fit of the underlying models for the ages of four and six. Poorer results are obtained for the age of ten, which reflects the limited information available for this long-term prediction.
Discussion
The aim of the present work was to construct prediction intervals for future BMI values of individual children. We pursued this aim by applying quantile boosting - a boosting approach estimating additive quantile regression models - to directly model the borders of the PIs. As a result, we do not rely on any distributional assumptions.
A main advantage of PIs fitted by quantile boosting is that we can directly interpret the estimated effects with regard to the interval borders. From the results of the cross-sectional analysis, for example, it follows that children whose mothers smoked during pregnancy have larger estimated PIs than other children. These conclusions could not have been drawn from quantile regression forest, an alternative approach to fitting non-parametric PIs, which leads to black box estimates.
The results of our simulation study suggest that quantile boosting outperforms quantile regression forest with respect to conditional coverage - which in our view is the key performance measure to evaluate PIs correctly. However, it is generally not possible to check conditional coverage in practical applications. In our data analyses, we thus had to rely on the findings from the simulation study. These findings were supported by the results of a formal comparison of empirical risks in the cross-sectional analysis, suggesting that quantile boosting provided more accurate predictions than quantile regression forest.
We could also benefit from the inherent shrinkage and variable selection properties of boosting in our analysis. Only a limited number of covariates was selected by the boosting algorithm, leading to sparse models. Note that it would even be possible to apply quantile boosting to data sets with more co-variates than observations, i.e., in high-dimensional data settings. A limitation coming along with the shrinkage property is the absence of standard errors estimations for the effect estimates. As a result, we cannot compute statistical tests regarding the effects of covariates, e.g. report information about their significance. Although researchers in practice often feel uncomfortable in the absence of p-values, we think that this limitation is acceptable here, as the focus is directed towards getting reliable predictions.
The resulting PIs of the longitudinal analysis emphasize further strengths of quantile boosting for fitting PIs. Relying on data available of the children as two-year-olds, we can fit accurate and child-specific PIs not only for BMI values around the age of four, but also for BMI patterns until the age of ten. Quantile boosting allows to explore longitudinal data structures by including individual-specific "random" effects, emphasizing the child-specific character for the resulting PIs. Here, we could observe that the lengths of the intervals strongly increase with the age of the children. From a methodological view, this absolutely reflects what we should expect from a valid method to fit PIs: The intervals do what they should, in reporting the increasing uncertainty in the prediction of BMI values until the age of ten based only on very limited information from the children in early childhood.
The lack of covariates explaining physical activity, nutrition and lifestyle habits of the children is of course a further limitation of the presented work. It would be interesting to see if this information could help for getting more precise predictions as presented in this paper.
Conclusion
In conclusion, we think that quantile boosting is a promising approach to construct prediction intervals with correct conditional coverage in a non-parametric way. It can be applied to longitudinal settings and is therefore in particular suitable for the prediction of BMI patterns or similar data, where assumptions of standard parametric approaches are not fulfilled.
Declarations
Acknowledgements
The authors thank the two referees, Elaine Borghi and Xuming He, for their fair comments and suggestions on how to improve the manuscript during the review process. The authors are grateful to Joachim Heinrich, Peter Rzehak and Heinz-Erich Wichmann from the Institute of Epidemiology, Helmholtz Zentrum München (German Research Center for Environmental Health) for providing the data, in this connection they also thank the LISA-plus Study Group [5] for their work. The LISA-plus study was funded by grants of the German Federal Ministry for Education, Science, Research and Technology (Grant No. 01 EG 9705/2 and 01 EG 9732) and the 6-years follow-up of the LISA-plus study was funded by the German Federal Ministry of Environment (IUF, FKS 20462296). Furthermore, the authors thank Benjamin Hofner for his support and advice regarding the fitting algorithms of mboost [25, 26]. The work of author AM was supported by the Interdisciplinary Center for Clinical Research (IZKF) at the University Hospital of the Friedrich-Alexander-Universität Erlangen-Nürnberg (Project J11). The authors NF and TH received support by the Munich Center of Health Sciences (MC-Health), Ludwig-Maximilians-Universität München, Germany.
Authors’ Affiliations
References
- Sassi F, Devaux M, Cecchini M, Rusticelli E: The Obesity Epidemic: Analysis of Past and Projected Future Trends in Selected OECD Countries. OECD Health Working Papers. 2009, 45:Google Scholar
- Dehghan M, Akhtar-Danesh N, Merchant A: Childhood Obesity, Prevalence and Prevention. Nutrition Journal. 2005, 4: 24-10.1186/1475-2891-4-24.View ArticlePubMedPubMed CentralGoogle Scholar
- Jansen I, Katzmarzykt P, Srinivasan S, Chenl W, Malina R, Bouchard C, Berenson G: Utility of Childhood BMI in the Prediction of Adulthood Disease: Comparison of National and International References. Obesity Research. 2005, 13: 1106-1115. 10.1038/oby.2005.129.View ArticleGoogle Scholar
- Whitaker R, Wright J, Pepe M, Seidel K, Dietz W: Predicting Obesity in Young Adulthood from Childhood and Parental Obesity. New England Journal of Medicine. 1997, 337 (13): 869-873. 10.1056/NEJM199709253371301.View ArticlePubMedGoogle Scholar
- LISA-plus Study Group: 1998, Information about the study is available at http://www.helmholtz-muenchen.de/epi/arbeitsgruppen/umweltepidemiologie/projects-projekte/lisa-plus/index.html
- Reilly JJ, Armstrong J, Dorosty AR, Emmett PM, Ness A, Rogers I, Steer C, Sherriff A: Early Life Risk Factors for Obesity in Childhood: Cohort Study. British Medical Journal. 2005, 330: 1357-1364. 10.1136/bmj.38470.670903.E0.View ArticlePubMedPubMed CentralGoogle Scholar
- Beyerlein A, Toschke AM, von Kries R: Risk Factors for Childhood Overweight: Shift of the Mean Body Mass Index and Shift of the Upper Percentiles: Results From a Cross-Sectional Study. International Journal of Obesity. 2010, 34 (4): 642-648. 10.1038/ijo.2009.301.View ArticlePubMedGoogle Scholar
- Beyerlein A, Fahrmeir L, Mansmann U, Toschke A: Alternative Regression Models to Assess Increase in Childhood BMI. BMC Medical Research Methodology. 2008, 8 (59):Google Scholar
- Fenske N, Fahrmeir L, Rzehak P, Höhle M: Detection of Risk Factors for Obesity in Early Childhood with Quantile Regression Methods for Longitudinal Data. Technical Report, Department of Statistics, University of Munich. 2008, 038:Google Scholar
- Rigby RA, Stasinopoulos DM: Generalized Additive Models for Location, Scale and Shape (with Discussion). Applied Statistics. 2005, 54: 507-554. 10.1111/j.1467-9876.2005.00510.x.Google Scholar
- Mayr A, Fenske N, Hofner B, Kneib T, Schmid M: GAMLSS for High-Dimensional Data - a Flexible Approach Based on Boosting. Journal of the Royal Statistical Society, Series C (Applied Statistics). 2012, [To appear]Google Scholar
- Meinshausen N: Quantile Regression Forests. Journal Machine Learning Research. 2006, 7: 983-999.Google Scholar
- Fenske N, Kneib T, Hothorn T: Identifying Risk Factors for Severe Childhood Malnutrition by Boosting Additive Quantile Regression. Journal of the American Statistical Association. 2011, 106 (494): 494-510. 10.1198/jasa.2011.ap09272.View ArticleGoogle Scholar
- Koenker R: Quantile Regression. 2005, New York: Cambridge University PressView ArticleGoogle Scholar
- Koenker R, Ng P, Portnoy S: Quantile Smoothing Splines. Biometrika. 1994, 81 (4): 673-680. 10.1093/biomet/81.4.673.View ArticleGoogle Scholar
- Breiman L: Random Forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Friedman JH: Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics. 2001, 29: 1189-1232.View ArticleGoogle Scholar
- Tibshirani R: Regression Shrinkage and Selection via the Lasso. J Roy Statist Soc Ser B. 1996, 58: 267-288.Google Scholar
- Bühlmann P, Hothorn T: Boosting Algorithms: Regularization, Prediction and Model Fitting. Journal of Statistical Science. 2007, 22 (4): 477-505. 10.1214/07-STS242.View ArticleGoogle Scholar
- Efron B: Biased Versus Unbiased Estimation. Advances in Mathematics. 1975, 16: 259-277. 10.1016/0001-8708(75)90114-0.View ArticleGoogle Scholar
- Copas JB: Regression, Prediction and Shrinkage. Royal Statistical Society, Series B. 1983, 45: 311-354.Google Scholar
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2009, Springer, 2View ArticleGoogle Scholar
- Hastie T: Comment: Boosting Algorithms: Regularization, Prediction and Model Fitting. Journal of Statistical Science. 2007, 22 (4): 513-515. 10.1214/07-STS242A.View ArticleGoogle Scholar
- R Development Core Team: R: A Language and Environment for Statistical Computing. 2009, R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org. [ISBN 3-900051-07-0]Google Scholar
- Hothorn T, Bühlmann P, Kneib T, Schmid M, Hofner B: mboost: Model-Based Boosting. 2010, http://R-forge.R-project.org/projects/mboost. [R package version 2.1-0]Google Scholar
- Hothorn T, Bühlmann P, Kneib T, Schmid M, Hofner B: Model-based Boosting 2.0. Journal of Machine Learning Research. 2010, 11: 2109-2113.Google Scholar
- Meinshausen N: quantregForest: Quantile Regression Forests. 2007, [R package version 0.2-2]Google Scholar
- Wei Y, He X: Conditional Growth Charts. Annals of Statistics. 2006, 34: 2069-10.1214/009053606000000623.View ArticleGoogle Scholar
- Wei Y, Pere A, Koenker R, He X: Quantile Regression Methods for Reference Growth Charts. Statistics in Medicine. 2006, 25 (8): 1369-1382. 10.1002/sim.2271.View ArticlePubMedGoogle Scholar
- Kneib T, Hothorn T, Tutz G: Variable Selection and Model Choice in Geoadditive Regression Models. Biometrics. 2009, 65 (2): 626-634. 10.1111/j.1541-0420.2008.01112.x. [Including the web-based supplementary]View ArticlePubMedGoogle Scholar
- Koenker R: Quantile Regression for Longitudinal Data. Journal of Multivariate Analysis. 2004, 91: 74-89. 10.1016/j.jmva.2004.05.006.View ArticleGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1471-2288/12/6/prepub
Pre-publication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.