Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches

Background Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome. Methods We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods’ ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1000 total variables). We applied the most promising method to the full census data (p = 14,663 variables) linked to prostate cancer registry data (n = 76,186 cases) to identify social-environmental factors associated with advanced prostate cancer. Results In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman’s correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings. Conclusions This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods. Sparse clustered regression models performed best, as they identified many true positive variables while controlling false positive discoveries. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-020-01183-9.


BACKGROUND AND SIGNIFICANCE
The Precision Medicine Initiative suggests that environment, along with genes and lifestyle behaviors, should be considered for cancer treatment and prevention.Nevertheless, the impact of social environment, or the neighborhood in which a person lives, remains understudied.(1) Compared to the biological level where empirical, high-dimensional computing approaches, like genome-wide association studies (GWAS), are often used for hypothesis-generation, risk prediction, and variable selection, empirical methods are only beginning to be employed at the environmental level.(2,3) Social environment, as defined by a patient's neighborhood of residence, is particularly relevant to the study of cancer health disparities.Neighborhood boundaries can be defined by U.S. Census tracts (smaller geographic areas than a county).These neighborhoods can be described by variables measuring economic (e.g., employment, income); physical (e.g., housing/transportation structure); and social (e.g., poverty, education) characteristics.(4,5) Studies linking U.S. Census data with state and national cancer registry data show that neighborhood can help explain differential cancer incidence and mortality rates beyond race/ethnicity or genetic ancestry, and that neighborhood environment often exerts independent effects on cancer outcomes.(6,7) Methodological challenges have limited the incorporation of neighborhood data into Precision Medicine.
Most studies use a priori variable selection approaches, but there are no standard variables to represent particular domains (e.g.poverty, education, employment, etc.), which has limited translation of social environmental variables into clinical use.In the few studies using empiric selection approaches, variable selection and replication of findings were complicated by the high degree of correlation among U.S. Census variables.For instance, similar to a GWAS, we previously conducted a neighborhood-wide association study (NWAS) in both black and white men in Pennsylvania and agnostically identified 22 U.S. census variables (out of over 14,000) significantly associated with advanced prostate cancer.(3) In the first NWAS, social support was identified as an important neighborhood domain, but 2 very similar variables were identified to represent this domain (% male householders living alone vs %male householders over 65 living alone in a non-family household).Thus, multicollinearity (the presence of many highly interrelated variables) is a challenge for variable selection and replication.
The systematic assessments offered by machine learning algorithms, which allow for high dimensionality and collinearity, may be useful for analyses of neighborhood data.In this manuscript, we broadly use the term "machine learning" to refer to any computational method which selects variables automatically, without direct input from a human analyst.While the main objective of machine learners is often predictive accuracy, an additional objective is variable selection and determining which features are truly important.This is analogous to the goals of variant discovery vs risk prediction in genetic studies.(8,9) Motivated by the previous NWAS of prostate cancer cases in Pennsylvania, we sought to understand which machine learning algorithms are most effective for identification of neighborhood factors which have a true association with a health outcome.Machine learning algorithms are often judged by comparing predicted vs. observed outcomes in an independent test set.We cannot use this paradigm to evaluate variable selection, however, as the true underlying variables associated with a given outcome are unknown.This motivated use of a simulation study, where we generated outcomes that are dependent on a small subset of the potential predictor variables.We then applied several popular machine learning approaches, including lasso, elastic net, hierarchical clustering, and random forests, and assessed how well each method identified true positive variables while minimizing false negatives.We compared the results to traditional regression with correction for multiple testing.
Finally, we applied the top performing machine learning approaches to our original NWAS dataset, and compared findings from these analyses to our first NWAS in white men.

Candidate Methods for discovery of important variables
Below, we describe methods for variable selection where both p, the number of potential predictor variables, and N, the number of observations, are large, and discuss how these methods can be applied to analysis of neighborhood-level covariates.We identified methods which provide objective and automatic variable (feature) selection for both continuous and binary outcomes.We also limited our evaluation to methods with largely automated tuning, which are readily implemented using standard R packages, and which one can run within reasonable timeframes.Ultimately, the methods we identified fell within two broad categories: penalized models, and ensemble tree-based methods.

Standard regression models
The simplest approach to variable selection is similar to the GWAS and NWAS approach.(10) A series of univariable tests are conducted to determine the relationship between each possible predictor and the outcome.
Variables which are statistically significant after correction for multiple-testing (11) (i.e.'top hits") are then replicated in a separate set of samples.(12) Although this approach is simple and easy to implement, the separate regression models ignore any correlation structure between candidate predictors.This may lead to selection of a large number of highly correlated variables, necessitating further variable selection steps, as described in the NWAS manuscript.(10) We included this approach as a baseline for comparison, to demonstrate the degree of improvement more advanced methods can provide.

Sparse regression models
Penalized regression addresses some of the limitations of standard regression for high-dimensional data.A useful class of these models provide shrinkage which enforces sparsity; that is, many of the parameter estimates are shrunk to exactly zero.(13) Sparse models have several advantages over traditional regression, including reduced overfitting (which improves prediction), accommodating multicollinearity, and the ability to fit models where p>n.They can also be used for variable reduction, where a zero parameter estimate indicates that the variable is not an important predictor.

Lasso penalized regression
The Least Absolute Shrinkage and Selection Operator (lasso) includes a L1-norm (absolute value) penalty that shrinks many parameter estimates to exactly zero.(13,14) Thus, variables with non-zero coefficients can be considered the important predictors for the outcome of interest.
For a linear regression, the lasso solution is found as where N is the number of observations, p is the number of parameters, Y is a vector of outcomes, X is a N x p matrix of covariates, and β is a vector of effects.The size of the penalty is determined by λ, which can be found via cross-validation to minimize prediction error.Alternatively, one can choose a stricter threshold for λ at 1 standard error above the minimum prediction error (to conservatively allow for error in the estimate of the optimal λ).(13) Although the lasso can find a solution under multicollinearity, if a group of highly correlated variables is present the lasso tends to arbitrarily select one variable and set the others to zero.( 15)

Elastic net
The elastic net was proposed to overcome some of the limitations of the lasso method.It uses a combination of the L1 lasso penalty and the L2 ridge penalty: where 0 ≤ α ≤1, and other parameters are defined as above in (1).The choice of α determines whether the penalty is closer to a ridge penalty (α=0) or a lasso penalty (α=1).The choice of both α and λ can be determined via cross validation.(15) Unlike the lasso, when predictors are collinear, the elastic net tends to classify groups of highly correlated variables as all either zero or nonzero.In many cases the elastic net provides better performance than the lasso.( 15)

Sparse models with clustering
Hierarchical clustering is a way of grouping variables with similar behavior across observations.
Agglomerative clusters are built from the bottom-up by joining the "closest" clusters at each step according to defined distance and linkage functions, and the distance becomes the "height" at which clusters are joined.For the census data we propose to define distance as one minus the absolute value of the Spearman correlation coefficient.Complete linkage is a useful choice here as it maintains the original scale of the distance measures (in this case, from 0 to 1), and the height is therefore interpretable.The resulting clusters are represented via a dendrogram (see Additional file 1).( 16) Cluster membership can be defined by cutting the dendrogram at a specific height, so that any observations that are joined at a height lower than that value are members of a cluster.A more objective method is to identify statistically significant clusters via a bootstrap.(17) This method resamples participants to identify which clusters of variables often appear, measuring stability.Note that with either method, many clusters may contain only one variable.If the number of clusters is small relative to n, clustered variables can be summarized into a single measure, and models can then be fit using multiple regression models.However, if a large number of clusters are present, a better choice is to use cluster membership to fit group lasso or sparse group lasso models.The sparse group lasso is particularly useful, as it has penalties at both the group and individual level, allowing for sparsity across and within groups.(18-20)

Tree ensemble methods
Another group of popular machine learning methods are based on tree ensembles, where many decision trees are fit to the data.Decision trees rely on recursive binary partitioning, that is, at each step (node) in the tree, the observations are split into two daughter nodes depending on some function of the predictor variables.Often, methods aggregating many trees (ensembles) outperform single tree based methods.( 21) The random forests method is useful for high-dimensional data.Underlying the method are many Classification and Regression Trees run on bootstrap samples of the dataset.( 16) The relative impact of each variable on prediction accuracy is characterized using the variable importance score (VIMP), calculated by permuting each variable and re-fitting the random forest, and assessing how this impacts accuracy.VIMP scores provide a way of ranking predictors relative to each other, but choosing a threshold for the VIMP is often done post-hoc.
Recently, Ishwaran and Lu (2019) (22) proposed a resampling based calculation of the standard errors for the VIMP.We propose to use this standard error to inform variable selection via the confidence interval.Based on the estimated VIMP scores and their standard errors, we can create 100*(1-α/p)% confidence intervals for each variable.If the confidence interval excludes zero, we can conclude that there is evidence that the variable improves prediction, and therefore infer that there is a relationship between that variable and the outcome of interest.

Bagging
Bagging was a predecessor to random forests, and can be thought of as a special case.In the standard interpretation, at each node a random subset of variables (often choses to be p/3) are evaluated as candidates for splitting.In bagging, all p variables are considered for possible splitting.This tends to yield a smaller subset of variables with high VIMP scores, which may be better for our purposes of identifying the best variables.(21) We note that, as this is a special case of random forests, we can use the same resampling-based approach to define confidence intervals for the VIMP scores and accomplish variable selection.

Bayesian Additive Regression Trees (BART)
Like Random Forests and Bagging, BART is a tree ensemble method; however, it builds a set of trees using repeated draws from a Bayesian probability model.Similar to the VIMP of random forests, the relative importance of a given variable can be characterized using the variable inclusion proportion, defined as the number of splitting rules based on the variable out of the total number of splits.We can obtain an empirical estimate of the null distribution for the variable inclusion proportions by permuting the outcomes and re-fitting the BART algorithm.After these are obtained, three thresholds for variable selection have been proposed.The first, the local threshold, uses the null distribution of each individual variable, and if the fitted inclusion proportion is greater than its 1-α quantile under the null, that variable is selected.A more restrictive criterion (Global SE) increases the threshold using the local mean and standard deviation with a global multiplier determined based on the permutation distribution of all variables.The most restrictive criterion (Global Max) requires that the inclusion proportion is greater than the 1-α quantile of the maximum inclusion proportions (across all variables) from each permutation.

Simulation study
Machine learning methods are typically evaluated by their ability to predict outcomes.In this study, we are interested in a different question: how well does each method correctly identify the subset of census variables which are truly associated with the outcome of interest?Therefore, we conducted a simulation-based experiment, generating outcomes which have known associations with a small subset of census variables.
The data structure of the census variables is complex; measures may exhibit non-normal distributions, contain excess numbers of zeros, and some variables are highly collinear (see Additional File 2).To fully reflect this complexity, we used observed census variables for PA prostate cancer cases (10) as the basis of our simulation.
For computational tractability, we randomly selected 1,000 variables and 2,000 individuals.Missing values were imputed using median substitution, and all variables were standardized (mean = 0, standard deviation = 1).Let X be the data matrix corresponding to the full set of 1,000 neighborhood variables.We define XT as the matrix with columns representing variables truly associated with the outcome of interest, Y.The full set of predictors, p, also contains many other predictors.We define matrix X0 to be the matrix of columns not directly related to Y.The variables in X0 which are highly correlated with variables in XT are considered "surrogate" variables.
We selected 10 variables to be members of XT, where 5 variables exhibited marked collinearity (X1-X5, correlation >0.95 with at least 1 other variable), and 5 variables which exhibited modest or low collinearity (X6-X10, correlation < 0.6 with all other variables).Outcomes were simulated according to the structure shown in   We simulated outcomes (Y) five-hundred times.In practice, variable selections are often internally validated by withholding a portion of the dataset.Therefore, we randomly assigned 2/3 of the data to be the discovery set and the remaining 1/3 to be the validation set.The algorithms discussed above were applied to each set of simulated outcomes.Candidate variables selected in the discovery set were then validated in the withheld 1/3 sample, using a series of univariable regression models, considering any variable with a P-val<0.05 to be validated, similar to the approach taken in some GWAS studies.(28) For the random forest method, which often identified a large number of variables in the discovery set, we also explored using a multivariable lasso model in the validation in an attempt to reduce potential confounding in the validation step.Table 1 lists each method tested, along with the selection rule in the discovery set.All models were fit using R software (version 3.5).( 23 and minimized false positive variables (X0).We also considered more flexible success metrics where true positives were considered as the identification of either a target variable or a good surrogate (correlation > 0.8 with target), and false positives were considered as those not a target variable or a surrogate.The threshold of 0.8 was chosen as it is a commonly used value for determining suitability of surrogate endpoints in clinical studies.(29,30) We also considered a composite metric, the F2 score.This is a measure of accuracy combining the Positive Predictive Value (PPV, sometimes termed precision) and the sensitivity (sometimes termed recall).
The F2 score is a specific case of the general F score, which gives greater weight to the sensitivity.(31)

𝐹2 = (2 + 1) * 𝑇𝑃 (2 + 1) * 𝑇𝑃 + 2 * 𝐹𝑁 + 𝐹𝑃
Where TP is the number of true positives, FN is the number of false negatives, and FP is the number of false positives.When comparing models, a F2 score closer to 1 denotes the preferred model.We chose the weighted F2 as we believe that in this application, priority should be given to the ability to detect more true positive variables.Finally, we determined the average detection rate for the individual variables, and evaluated how effect size estimates from univariable models were related to the likelihood of detection.

Application to PA prostate cancer cases
To illustrate these methods in practice, we applied the most promising algorithms (those with the most true positives and fewest false positives) from the simulation study to the full dataset used in the prior NWAS study, which linked prostate cases from a PA Department of Health registry to social-environmental variables obtained from the U.S. Census (10) The binary outcome of interest was aggressive (high stage AND grade) prostate cancer (10).This cohort of white prostate cancer patients diagnosed between 1995 and 2005 contained 76,186 individuals (8% with aggressive disease).There were 14,663 census variables evaluated for association with the outcome.We included census variables as predictors, along with age and year of diagnosis.The data was split into discovery (2/3) and validation (1/3) samples.As above, variables selected in the discovery set were tested using univariable regression in the validation set, using a p-value cutoff of 0.05.We then compared which variables were selected by the most promising methods (from the simulation study) in the full study population to those found by the original NWAS method.

Comparison of methods
Table 2 shows the mean number of variables detected by each method, broken down into true positives and false positives.The strict definition considers true positives to be identification of variables in XT, while the relaxed definition allows for surrogate variables.For the false positives, the strict definition shows the number of members of X0 which were identified, while the relaxed definition shows the number of selected members of X0 which did not have a substantial correlation with an element of XT.The number of false positives was substantially reduced under the relaxed definition, especially for methods which identify groups of correlated predictors (e.g.elastic net, sparse group lasso), demonstrating that many of the "false positive" results were identified due to their relationship with a "true positive" variable.
For binary outcomes, the sparse regression method identifying the fewest false positive results was the lasso with the restrictive 1SE penalty (LASSO-1SE), while elastic net with the less restrictive penalty (ELNET-MIN) identified the largest number of true positives.When considering the sparse group lasso, the simpler correlationbased threshold to define clusters worked somewhat better than the more complex bootstrap-based cluster detection (although results were largely similar).The correlation-based clustering generated more clusters on average than bootstrap-based cluster selection (837 vs 772), so more restrictive cluster definitions may have led to these differences.Of the tree ensemble methods, BART-LOCAL performed the best.It substantially outperformed both RF or BAGGING.BART-GLOBALSE and BART-GLOBALMAX were too restrictive, identifying very few true positive variables.
Comparing the methods, there was generally a trade-off between the number of true positives and false positives.However, certain strategies were clearly inferior to others (dominated), with higher false positive rates and lower true positive rates than other methods.Univariable models and the random forests based models can be eliminated from consideration in future studies based on this criterion.When assessing the combined F2 measure of performance, many of the penalized models performed particularly well, with HCLST-CORR-SGL doing the best overall.BART-LOCAL also did well, particularly under the relaxed definition.The F2 measure indicates that these methods have particularly good sensitivity, while also considering their PPV.relaxed definition, if a true variable or its surrogate was selected, that variable was considered to be identified.Surrogates are therefore no longer in the pool of potential false positives, but the maximum number of true positive variables remains 10.
The results were largely similar for both continuous (Normal) and binary outcomes.We did find that Random Forests (RF) identified more variables (both false positives and false negatives) for the continuous outcome, while elastic net identified more variables with the binary outcome; however, the same method (HCLST-CORR-provide substantial improvements in detection rates of rarely selected variables, nor did using a multivariable (lasso) model in the validation step in place of the univariable regressions.

Application to PA prostate cancer cases
We assessed associations between the census variables and the outcome of aggressive Prostate Cancer (PCa) PCa using HCLST-CORR-SGL, the best-performing method for binary outcomes.After applying the hierarchical clustering algorithm with a threshold of 0.8, 10,888 of the 14,663 variables were grouped with at least one other variable.Of the 6,865 clusters identified, 3,090 had two or more variables (max 244), and 3765 clusters contained only one variable.HCLST-CORR-SGL identified nineteen census variables in thirteen clusters as predictors of aggressive disease (Additional file 5), as well as year of diagnosis and patient age.
One variable found in the NWAS study was identified by this approach, and two variables/clusters in the NWAS were highly related to variables identified by HCLST-CORR-SGL.These overlapping variables are described in Table 4.We note that the results from HCLST-CORR-SGL were sparse within clusters; nevertheless, the algorithm does not force selection of a single representative variable from each cluster.Therefore, some highly related variables were chosen (e.g.PCT_SF3_PCT065I007 and PCT_SF3_PCT065A007).

DISCUSSION AND CONCLUSIONS
In simulation studies, we found that methods using hierarchical clustering combined with sparse group lasso (HCLST-CORR-SGL) performed the best at identifying variables with true associations (or their surrogates), while providing control of false positive results.This conclusion is based on the method's F2 scores in simulated data, a combined measure of accuracy which gives greater weight to the method's sensitivity.HCLST-CORR-SGL used clustering to directly address the complex correlation structure of the data, which may have led to improvements in the ability of penalized regression to detect true positive variables.We showed that the simpler threshold-based approach was sufficient to define meaningful clusters.However, this approach, like the others we assessed, did not solve the low detection rates of variables subject to confounding.
We applied the HCLST-CORR-SGL to our full dataset and compared findings from these machine learning methods to our previously published NWAS, under the assumption that variables that replicated across methods were more likely to represent true findings.We note that in our simulation study, outcomes were generated completely independently, while the observed outcomes in the full dataset likely had spatial effects which were not accounted for in the machine learning approaches applied here.Nevertheless, HCLST-CORR-SGL did independently replicate three out of seventeen NWAS variables (i.e. the identification of the same or a highly correlated variable) which did take into account potential spatial effects.
Previous studies of social-environmental factors often selected census variables a priori.These studies showed that single variables representing single domains (e.g. % living below poverty) were associated with advanced prostate cancer and cancer more broadly (32).Interestingly, the NWAS and machine learners consistently identified more complex variables that combined domains related to race, age, and poverty with household or renter status.Thus, findings from these empirical methods could serve to be hypothesis-generating, suggesting interactions among domains that are often considered individually in current social-environmental studies.For example, the top hit from the previous NWAS (PCT_SF3_P030007) had a correlation of 0.93 with two variables identified by HCLST-CORR-SGL (PCT_SF3_PCT065I007 and PCT_SF3_PCT065A007).All three are markers of employment and transportation, a combination of two different domains.
This study has several limitations.First, although we assessed several popular machine-learning algorithms for variable selection, there are many other approaches.We considered principal components regression, as it is commonly used with highly collinear data, but ultimately did not include it because the results were difficult to interpret and required arbitrary thresholds.Other popular machine learning approaches that use a "black box" algorithm for prediction (e.g.neural networks) were not readily useable for variable selection and therefore were not included.Most Support Vector Machines (SVM) based methods are not readily used for variable selection; we considered a sparse SVM method, but found that it was computationally infeasible to implement.(33) We also did not evaluate predictive accuracy of the various methods, as it was not of primary interest.Further, we intentionally designed a situation where variables had small effects compared to the random error, (10) so even a perfect method would have relatively low predictive ability.In real-world cases, the social-environmental variables would be combined with patient-level variables, giving the models much better predictive accuracy; we did not do so here to isolate the effect of method choice on selection of neighborhood predictors.Finally, for computational tractability, the size of the dataset used in simulations was limited to 1,000 variables and 2,000 subjects, much smaller than the full dataset upon which this study is based.In future studies, we will assess the impact of spatial relationships and the rate of true associations on the method's performance.We will also consider cases where the data-generating model is non-linear, includes interactions, and uses patient-level predictors.
In this era of Big Data and Precision Medicine, (34,35) the importance of neighborhood and other environmental data will continue to grow.Given the complex structure and high dimensionality of environmental data, researchers should continue to develop machine learning approaches for this area.For complex diseases like cancer, the analysis of multilevel, mixed feature datasets (including environmental, biological, and behavioral features) will likely be needed to inform health disparities, disease prevention and clinical care, motivating the development of new analytical approaches.

Figure 1 .
Figure 1.We considered both binary and continuous outcomes as they are commonly seen in medical research

Fig. 1 .
Fig. 1.Simulation model, showing the relationship between the ten elements of XT, the 990 elements of X0, and

1 . 2 . 3 . 4 . 5 .
Additional file 1 TIFF (.tiff) file Title: Dendrogram for correlation between variables Description: Dendrogram showing the relationships between the 1,000 elements of the covariate matrix X.The horizontal red line represents a correlation of 0.8.Additional file 2 TIFF (.tiff) file Title: Distribution of 10 variables associated with simulated outcomes Description: Histograms showing the distributions of X1-X10, the elements of X used to simulate the outcomes Y.Additional file 3 Excel (.xlsx) file Title: Detection rates for each variable Description: Table showing the proportion of simulations where each varaible was selected (by each method) Additional file 4 TIFF (.tiff) file Title: Correlation structure of 10 variables Description: Correlations structure of X1-X10, the elements of X used to simulate the outcomes Y. Blue represents a positive correlation and red a negative correlation, with darker colors indicating a stronger relationship.Additional file 5 Excel (.xlsx) file Title: Full data results Description: Table containing all finidings when the HCLST-CORR-SGL method was applied to the full PA PCa dataset (binary outcome: diagnosis with aggressive PCa)

Table 1 .
Candidate methods

Table 4 .
Results from full data: Variables identified as associated with PCa aggressiveness by both