Waist circumference prediction for epidemiological research using gradient boosted trees

Background Waist circumference is becoming recognized as a useful predictor of health risks in clinical research. However, clinical datasets tend to lack this measurement and self-reported values tend to be inaccurate. Predicting waist circumference from standard physical features could be a viable method for generating this information when it is missing or mitigating the impact of inaccurate self-reports. This study determined the degree to which the XGBoost advanced machine learning algorithm could build models that predict waist circumference from height, weight, calculated Body Mass Index, age, race/ethnicity and sex, whether they perform better than current models based on linear regression, and the relative importance of each feature in this prediction. Methods We trained tree-based models (via XGBoost gradient boosting) and linear models (via regression) to predict waist circumference from height, weight, Body Mass Index, age, race/ethnicity and sex (n = 60,740 participants). We created 10 iterations of each model, each using 90% of the dataset for training and the remaining 10% for testing performance (this group was different for each iteration). We calculated model performance and feature importance as an average across 10 iterations. We then externally validated the ensembled version of the top model. Results The XGBoost model predicted waist circumference with a mean bias ± standard deviation of 0.0 ± 0.04 cm and a root mean squared error of 4.7 ± 0.05 cm, with performance varying slightly by sex and race/ethnicity. The XGBoost model showed varying degrees of improvement over linear regression models. The top 3 predictors were Body Mass Index, weight and race (Asian). External validation found that on average this model overestimated waist circumference by 4.65 cm in the United Kingdom population (mainly due to overprediction in females) and underestimated waist circumference by 1.7 cm in the Chinese population. The respective root mean squared errors were 7.7 cm and 7.1 cm. Conclusions XGBoost-based models accurately predict waist circumference from standard physical features. Waist circumference prediction using this approach would be valuable for epidemiological research and beyond. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01242-9.


(Continued from previous page)
Conclusions: XGBoost-based models accurately predict waist circumference from standard physical features. Waist circumference prediction using this approach would be valuable for epidemiological research and beyond.
Keywords: Waist circumference, Machine learning, Gradient boosted trees, Multilayer perceptron Background Epidemiological research studies often assess adiposity and its associated health risks from a standard set of physical features. Body Mass Index (BMI) is a measure of general adiposity and is calculated from height and weight and assessed against age and sex-based disease risk thresholds. It predicts generalized mortality [1] and illnesses like colon cancer [2] and incident diabetes [3]. New indices like Body Shape Index and Waist-to-Height ratio have respectively improved predictions of mortality [4] and cardiometabolic risk [5] in recent years. Both indices require waist circumference. However, since the value of waist circumference has become apparent only recently, it has not been routinely collected in clinical practice [6] and it is less likely to appear in clinical databases. Even when waist circumference information is present, it is frequently collected through self-report, which despite being highly accurate for weight and height is notoriously inaccurate for waist circumference [7]. This limits the use of these new and improved indices.
Existing methods of dealing with missing waist circumference values have limitations. Excluding subjects without waist circumference information limits sample size and in many cases affects sample representativeness; often epidemiological research depends on representative population samples. Mean replacement is undesirable because it provides no information about individual subjects. Multiple imputation is a common way of estimating missing values within a dataset based on a subject's covariate information and uses Bayesian regression techniques. Finally, some studies have used standard linear regression to deliberately model the relation between readily available covariates and waist circumference to estimate missing waist circumference values or evaluate the accuracy of self-reported waist circumference values. Bozeman et al. [8] have formalized such an approach and derived a dedicated equation to predict waist circumference based on Body Mass Index, age, sex, and race/ethnicity. While this approach was highly accurate, its accuracy may nevertheless be limited by not accounting for the possibility of non-linear relations and interactions between variables (the Bozeman approach accounts for just one non-linearity: it derives separate coefficients for female ages above and below 35). It does not consider the non-linearity of Body Mass Index with waist circumference (opting instead to exclude subjects with Body Mass Indexes over 40), and it does not consider potential interactions among predictors (e.g., with respect to race) other than for sex, which it addresses by deriving separate equations for males and females. Such relations occur frequently among variables within biological systems, but they are not frequently investigated.
This issue could be addressed with a machine learning approach. Advanced machine learning algorithms make it possible to automatically and efficiently model complex non-linear relations and interactions involving multiple predictors [9]. This automaticity of modeling is beneficial compared to traditional statistical regression that requires explicit modeling of each relation since it increases the likelihood of identifying novel complex relations (particularly subtle ones) that are not feasible to model manually. This increases the comprehensiveness of the model and thus improves prediction accuracy. The large datasets characteristic of epidemiological studies provide the statistical power necessary to identify these novel and subtle relations in clinically diverse populations [10]. Compared to traditional statistical modeling techniques, machine learning models allow more flexibility in mapping inputs to outputs (compared with deterministic functions), and they are not subject to potentially invalid assumptions like homoscedasticity in the data [11]. Thus, they can potentially be more precise and generate more accurate predictions, while still providing interpretability through calculations of predictor importance. It has yet to be determined how predictive models for waist circumference perform using advanced machine learning techniques.
This study investigated the degree to which the combination of height, weight, Body Mass Index (calculated from height and weight), age, race/ethnicity, and sex predict waist circumference using machine learning. We trained a predictive model using an XGBoost machine learning algorithm and then validated its accuracy on an independent portion of the dataset that was not used in training. To determine whether this model adequately accounted for sex differences, we also trained and tested separate models for males and females and compared their performance to the sex-aggregated model. To determine whether our machine learning approach was beneficial compared to traditional statistical regression, we trained and tested 3 regression models (a linear regression using a semi-Bayesian ridge regression technique, a linear regression using the Bozeman technique, and a linear regression incorporating all possible predictor interactions) and compared their performance to our sex-aggregated XGBoost model. We evaluated model performance overall, by sex, and within sex-race groups. We anticipated that all models would yield highly accurate predictions (based on good past performance with linear models using similar predictors [8]), and that our sex-aggregated XGBoost model would perform similarly to sex-specific XGBoost models and better than any of the linear regression approaches. Finally, we externally validated an ensemble of our sexaggregated XGBoost models on two external datasets to determine how they perform on disparate populations.

Participants
For model building and internal validation we obtained height (centimeters), weight (kilograms), age (years), race/ethnicity (mutually exclusive categories were Asian, Black, white, Hispanic (of any race), with each coded as a binary variable; other/mixed race served as the baseline), sex (male/female), and waist circumference (centimeters) information for 60,740 subjects from the physical examination records of Chinese patients at Nanjing Drum Tower Hospital (affiliated with Nanjing University Medical School) and from publicly available American data collected in the 1999-2017 survey years of the National Health and Nutrition Examination Survey (NHANES) [12]. Chinese data was collected in conjunction with routine annual physical examinations and thus these patients were considered generally healthy. All subjects met the following inclusion criteria: (1) only one record per participant was used (if there were multiple visits on file), (2) complete information for all variables, (3) an age of 18 or older, and (4) no outliers (defined as 3 standard deviations above or below the mean). Participants provided written informed consent to have their data used in the study and the study protocol was approved by the Nanjing University Research Ethics Board and the National Center for Health Statistics Research Ethics Review Board, respectively.
For external validation we obtained nationally representative data for the United Kingdom from the 2000-2001 survey year of the National Diet and Nutrition Survey (NDNS) [13], and nationally representative data for China from the 2015 survey year of the China Health and Nutrition Survey (CHNS) [14]. Participants provided written informed consent to have their data used in the study and the study protocol. NDNS received ethics approval from the Multi-centre Research Ethics Committee (MREC) and National Health Service Local Research Ethics Committees (LRECs). CHNS received ethics approval from the Institutional Review Board of the University of North Carolina at Chapel Hill, the National Institute for Nutrition and Food safety at China Center for Disease Control and Prevention, and the Human and Clinical Research Ethics Committee of the China-Japan Friendship Hospital. We recalibrated sample weights to ensure representativeness by sex-race/ethnicity group after removing samples with missing data (78% of data remained in NDNS; 85% in CHNS). Waist circumference in these datasets was reliably measured by a trained professional.

Waist circumference prediction models Model types
We trained and tested several types of models to predict waist circumference from a set of predictors that included: height, weight, Body Mass Index (calculated from weight and height as kg/m 2 ), age, race/ethnicity, and sex. For the machine learning model, we initially considered neural network-based models via the multilayer perceptron algorithm and decision tree-based models via Random Forest and XGBoost algorithms. All are well established supervised learning algorithms that allow modeling of non-linear relations and interactions among predictors. We ultimately selected gradientboosted trees via the XGBoost algorithm because empirically this performed best in our preliminary investigation. We ultimately used XGBoost to train a sexaggregated model using all predictors, as well as separate models for each sex separately. We did this so we could compare their performance and confirm that the aggregated model adequately accounted for sex differences.
To assess the potential advantages of machine learning over traditional approaches, we further trained and evaluated the performance of three regression models: semi-Bayesian ridge regression (commonly employed in multiple imputation of missing values), linear regression using the current gold standard Bozeman technique, and linear regression incorporating all potential interactions among predictors so that we could determine whether XGBoost model performance was superior. All three were sex-specific models so they could account for known sex differences. We used the semi-Bayesian ridge regression method (BayesianRidge function, scikit-learn 0.24.1 package; https://scikit-learn.org/) to generate a distribution of predictions and then took the mean of that prediction (it was semi-Bayesian in that an expected distribution was not supplied a priori). The Bozeman technique was a linear regression that did not include height and weight, but included separate age terms (coefficients) for females under 35 years of age and females 35 years of age or older. Our linear regression considered all predictors and all interactions among predictors to maximize information gain from predictor interactions. Within each sex, these interactions included 2-way up to 5-way predictor-predictor interaction terms. We selected terms with coefficient p-values < .05 for inclusion in the final model. Such terms had robust (statistically significant) correlations with waist circumference that are likely to generalize well beyond any one sample to positively impact model performance.

Training and testing paradigm
We randomly allocated participants into ten equally sized folds. We then used a 10-fold cross-validation technique to train and test 10 iterations of each model. For each of the 10 iterations, 9 of 10 folds were used to train the model and the remaining fold was used to assess model performance. Calculating accuracy on this unused portion of the data that was not used in training ensured that estimates of model performance for each iteration were not influenced by overfitting to the training data. We then calculated performance as the mean across all 10 model iterations. This provided an estimate of overall performance that was generalizable to the entire dataset (rather than just one data partition) and ultimately to populations that are represented by the dataset.

XGBoost model and training parameters
XGBoost (Extreme Gradient Boosting; https://github. com/dmlc/xgboost) is a machine learning algorithm for building gradient-boosted decision trees and it is wellsuited for regression [15]. Briefly, the algorithm builds an initial decision tree. It then builds subsequent decision trees to predict the residuals (errors) after applying the first tree. It scales this residual by a pre-defined learning rate (lower learning rates favor less variance/ overfitting), and then adds its the scaled residual to the original prediction to produce a slightly improved prediction. It goes on to build additional trees in this way to incrementally increase the prediction accuracy until it reaches a pre-defined number of trees (determined as the number of trees at which additional trees no longer improve performance). Extreme gradient-boosted trees (in XGBoost) are unique in that they cluster residuals into leaves (determine branch splits) based on similarity scores. Groups of leaves are pruned if the gain in the group's similarity versus the parent leaf does not exceed a user-defined value (lambda); this is a form of regularization intended to reduce overfitting. We empirically determined the best parameters to be a maximum tree depth of 5, 1000 trees, 80% sub-sampling and a learning rate of .02. The number of trees was selected empirically by increasing the number of trees by a factor of 10 until there was no additional gain in performance, and then reducing the number of trees until performance began to decline.

Assessment of model performance
We assessed the performance of each model by calculating the mean and standard deviation of various performance metrics across the 10 iterations of each model. Performance metrics calculated for all models were root mean squared error (RMSE) and mean bias (prediction minus the reference value). They were calculated overall, for males, for females, and for each sex-race/ethnicity combination. We used a t-test to determine whether the performance of a given model statistically differed from the performance of our sex-aggregated XGBoost model (comparator model).
For the sex-aggregated XGBoost model we determined additional performance metrics and model details. We calculated the mean ± 95% confidence interval for mean bias, error standard deviation and Pearson correlation. Pearson correlation and its standard deviation were calculated by taking the square root of the mean explained variance and its standard deviation across all 10 model iterations. We visualized performance over the full range of waist circumferences using a scatter plot and Bland Altman plot to assess correlation and agreement, respectively. Finally, we calculated proportional importance of each model predictor using a function built into the XGBoost analytical package and reported its mean and 95% confidence interval.

External validation
We further assessed the performance of the sexaggregated XGBoost model on two nationally representative external datasets: the National Diet and Nutrition Survey (United Kingdom) [13] and China Health and Nutrition Survey (The People's Republic of China) [14]. We took the mean of the predictions from each of the 10 versions of the model (each derived on a different fold of the training dataset) to generate an ensembled prediction for the model. We then calculated the RMSE and mean bias overall, by sex and by sex-race group for each dataset. These calculations considered sample weights and thus were nationally representative.

XGBoost outperforms regression models
We assessed whether the waist circumference model trained with XGBoost would outperform linear regression models in terms of error (RMSE) or mean error (bias). We found that the XGBoost model trained on both males and females (sex-aggregated) achieved an RMSE of 4.70 ± 0.05 and a bias of 0 ± 0.04 overall (a further breakdown of performance by sex and race is shown in Table 1). The two models trained on each sex separately did not perform statistically better than the sex-aggregated model, and so we used the sexaggregated model moving forward.
The root mean square error (RMSE) of all models was either equivalent to or higher than that of the sexaggregated XGBoost model overall, by sex, and across all sex-race/ethnicity categories. The semi-Bayesian ridge regression and Bozeman regression models had higher RMSEs overall (4.89 ± 0.05 cm; p < .001 and 5.01 ± 0.06 cm; p < .001, respectively). RMSEs in the Bozeman regression models were as much as 12% higher than in the XGBoost model for some sex-race/ethnicity categories. Our linear regression model with all predictor interactions performed only marginally worse than our sexaggregated XGBoost model. Its overall RMSE overall  4.72 ± 0.05 cm) was statistically equivalent. It was statistically higher for the Asian female group, although the difference was small (only 3%). The mean bias of all models was minimal (< 1 cm), except for the Bozeman regression, which overpredicted waist circumference by 1.2 cm in males of mixed/other race. Overall, the relative superiority of XGBoost in terms of error justifies its use in place of linear regression where the minimization of error is concerned.

Model performance details
Overall, the sex-aggregated model performed similarly from one validation split to the next (see Fig. 1 and Supplemental Table 1). The mean Pearson correlation (95% confidence interval) between observed and predicted values was 0.945 (0.944-0.946) and the mean bias (95% confidence interval) ± error standard deviation (95% confidence interval) between observed and predicted values was − 0.003 (− 0.032-0.027) ± 4.70 (4.67-4.73) cm over all 10 iterations of the model. Predictions tracked observed values well in all validation splits across the full range of waist circumferences, as assessed visually by correlation and agreement (Fig. 1). For females specifically, the mean Pearson correlation (95% confidence interval) between observed and predicted values was 0.942 (0.940-0.944) and the mean bias (95% confidence interval) ± standard deviation (95% confidence interval) between observed and predicted values was − 0.018 (− 0.058-0.095) ± 5.41 (5.35-5.48) cm over all 10 iterations of the model. Predictions tended to track observed values quite well across a range of waist circumferences, as assessed by correlation and agreement (Fig. 2). Asian females tended to have smaller waist circumferences (Fig. 2) and proportionally smaller errors ( Table 1). The largest errors tended to be in black females (Table 1).
For males specifically, the mean Pearson correlation (95% confidence interval) between observed and predicted values was 0.947 (0.947-0.949) and the mean bias (95% confidence interval) ± standard deviation (95% confidence interval) between observed and predicted values was − 0.014 (− 0.061-0.034) ± 4.05 (4.01-4.08) cm over all 10 iterations of the model. Predictions tracked observed values well across the full range of waist circumferences, as assessed by correlation and agreement (Fig. 3). As with females, Asian males tended to have smaller waist circumferences (Fig. 3) and proportionally smaller errors (Table 1). Hispanic males also had proportionally smaller errors (Table 1).

Feature importance and model details
We next calculated the mean proportional importance (± 95% confidence interval) for all predictors in the sexaggregated XGBoost model across all 10 iterations of the model. Body Mass Index (65.6% ± 0.3%), weight (16.8% ± 0.2%) and Asian race (5.9% ± 0.1%) were the most important features for predicting waist circumference (Fig. 4). The remaining features of white race, age, sex, Hispanic ethnicity, Black race and height all had less than 5% feature importance. For a sample decision tree diagram, see Supplementary Fig. 1.

External validation of aggregated XGBoost model
We took the mean of the predictions from the 10 versions of the sex-aggregated XGBoost model to produce a single (ensembled) waist circumference for each prediction, and then evaluated the accuracy of this overall model in representative datasets of the United Kingdom and China (Table 2). We found that waist circumference was overestimated by 4.63 cm on average in the United Kingdom multi-racial sample; this positive bias arose mainly from a tendency to overpredict in females. There was a small negative bias of − 1.7 cm overall in the

Discussion
Our XGBoost model predicted waist circumference with high accuracy from height, weight, Body Mass Index, age, race/ethnicity and sex information. We found that it performed quite well across various sex-race/ethnicity groups, and very consistently across all 10 model iterations. Together, these features were highly predictive of waist circumference and the XGBoost algorithm is an effective way to model these relations. This was expected given that similar predictors already produced good performance in prior work using linear models [8]. We found that XGBoost performed significantly better than semi-Bayesian ridge regression and the Bozeman linear regression approach [8], likely because the machine learning aspect identified non-linear relations or interactions that accounted for additional variance; such relations (particularly non-linearities) are often subtle and could be difficult to deliberately identify and input as features. There was also improvement to a lesser degree against the linear regression model that considered all possible predictor interactions. Improvement in this case was likely lower because this model accounted for additional variance by considering all possible predictor interactions. Sex-aggregated model performance for males and females did not significantly differ from sex-specific model performance, as expected. This suggests that sex-based differences in waist circumference have been adequately accounted for in the sex-aggregated model and that the sex-aggregated model can be used going forward.
Finally, we found that an ensemble of the sexaggregated XGBoost model generalized reasonably well overall to a representative multi-racial sample of the United Kingdom; the model performed worse in females (particularly Hispanic females). The ensemble performed quite well in a representative sample of the Chinese population. The tendency to overestimate waist circumference in United Kingdom females (particularly of white, Hispanic and mixed/other backgrounds) did not

Limitations
Even greater accuracy might be achieved by further optimizing the machine learning algorithm via its hyperparameters. Additional features could also be considered to see if they add information about waist circumference and thus boost accuracy. The models in the current study are based on multiple races/ethnicities in participants with American and Chinese nationalities; further work along these lines could investigate performance in additional races/ethnicities and nationalities to determine how well this model generalizes beyond the groups studied. Nevertheless, the technique we have utilized in this study is likely to be effective for training generalizable predictive models using a diverse set of participants.  It further remains to be determined whether the relative performance of the methods in the current study will generalize to still other training datasets. We anticipate it would generalize well given the high degree of heterogeneity in the current training data, which encompasses significant biological diversity by being multiracial, and significant of sociocultural diversity given that data comes from both the United States and China (two highly disparate populations). Further, the model performed fairly well when externally validated on datasets from the United Kingdom and China. Further evaluation in additional populations will be necessary to determine the accuracy of this approach elsewhere and potentially identify additional relevant factors that could affect waist circumference prediction.

Application
The high accuracy of the sex-aggregated XGBoost model makes it suitable for various purposes. First, it can be used to fill in missing waist circumferences in medical datasets. Care should be taken with such an approach since individuals who are missing waist circumference or its predictor information may be statistically different in some way from those who are not (e.g., if their data is missing not at random). Such a hypothesis could be tested using logistic regression to try to separate subjects with missing waist circumferences from subjects without missing waist circumferences using auxiliary variables, and then recalibrating sample weights on non-missing observations accordingly to maintain sample representativeness. Second, rather than filling in missing data, the current model could be used to screen existing waist circumference measurements for errors, given that selfreported waist circumference measurements tend to be inaccurate [7]. In this case, it is beneficial to build a well-defined externally validated model using datasets with reliable waist circumference information to apply to the target dataset. If predicted waist circumference is being used as a covariate in a regression model or mathematical equation, it should be considered how the accuracy of the waist circumference prediction will carry forward to impact the accuracy of the overall regression model or equation.
Finally, we acknowledge that since the accuracy of the XGBoost model is only marginally better than the linear regression with all interaction terms, it may remain preferable in some applications to sacrifice this small degree of added accuracy in favor of the more simplistic linear regression model.
Waist circumference can ultimately be used to help predict general mortality [1] and specific illness such as cancers [2], cardiovascular diseases [5] and incident diabetes [3]. Future studies could assess the impact of using predicted waist circumference in evaluating health risks.
Accurate waist circumference prediction achieved using machine learning could vastly aid future epidemiological research, particularly when attempting to predict longterm health outcomes.

Conclusions
This study demonstrates that the XGBoost advanced machine algorithm produces accurate waist circumference prediction models with minimal input data: height, weight, Body Mass Index (calculated from height and weight), age, race/ethnicity and sex. This approach could be immensely useful in epidemiological research, in health-evaluative tools and potentially even for nonhealth-related uses.