Proposed methods
Our proposed method uses BART. BART is a tree-based Bayesian probability model [21], offering an additional advantage over the algorithm-based machine learning methods of providing proper representations of uncertainty intervals via the posterior [22–25]. A regression or classification tree approximates the covariate-outcome relationship by recursive binary partitioning of the predictor space. The tree consists of the tree structure and all the decision rules sending a variable either left or right and leading down to a bottom node. Each of the bottom nodes represents the mean response of the observations falling in that node [14]. For a binary outcome, BART uses probit regression and a sum of trees models,
$$\begin{array}{@{}rcl@{}} \text{Pr}(Y\,=\, 1 {\, \vert \,} \boldsymbol{X}\,=\,\boldsymbol{x}) \,=\, \Phi\left(f(\boldsymbol{x})\right) \,=\, \Phi\left(\sum_{j=1}^{m} g(\boldsymbol{x}; \mathcal{T}_{j}, \mathcal{M}_{j})\right), \end{array} $$
(1)
where Φ(·) is the standard normal cumulative distribution function, \(\sum _{j=1}^{m} g(\boldsymbol {x}; \mathcal {T}_{j}, \mathcal {M}_{j})\) is a sum of trees model, and \((\mathcal {T}_{j}, \mathcal {M}_{j})\) are tree parameters. To avoid overfitting and limit the contribution of each \((\mathcal {T}_{j}, \mathcal {M}_{j})\), a regularizing prior is put on the parameters, and the posterior is computed using Markov chain Monte Carlo (MCMC). Details of the BART model can be found in Chipman et al. [21].
Here we leverage the underlying Bayesian probability model of the BART technique. We impute the missing data with random forest and use Rubin’s rule [26, 27] to combine the estimates and variances of the variable inclusion proportion (VIP) – the proportion of times each predictor is chosen as a splitting rule divided by the total number of splitting rules appearing in the model – of each predictor provided by the BART model across multiply imputed datasets for variable selection. We refer to our proposed method as RR-BART, where “RR” stands for Rubin’s rule. The key steps of RR-BART are:
-
1
Impute the data M times using the missForest technique [28]; fit a BART model to each of M imputed datasets and draw P Markov chain Monte Carlo (MCMC) posterior samples of the VIP for a predictor Xk, k=1,…,K. We then have a distribution of VIPkmp,m=1,…,M, p=1,…,P.
-
2
Calculate the average VIP across P posteriors and M imputed datasets for each predictor \(\overline {\text {VIP}}_{k\cdot \cdot }\), and identify the variable \(\phantom {\dot {i}\!}X_{k'}\) with the minimum value, \(\overline {\text {VIP}}_{k'\cdot \cdot } = \min \limits _{k=1,\ldots,K}(\overline {\text {VIP}}_{k\cdot \cdot })\). If \(\overline {\text {VIP}}_{k'\cdot \cdot } > \frac {1}{2K}\), then stop the algorithm and no variable selection will be performed. Otherwise, calculate the distance between each VIP score and the minimum of average posterior mean VIPs, \(\Delta _{kmp} = \text {VIP}_{kmp} - \overline {\text {VIP}}_{k'\cdot \cdot }\) for k=1,…,K,m=1,…,M,p=1,…,P.
-
3
Apply the Rubin’s rule [26, 27] to the distribution of Δkmp. The overall mean and total variance of the distance score for predictor Xk are calculated as follows:
$$\begin{aligned} \bar{Q}_{k} =&\sum_{m,p} \Delta_{kmp} /MP, \\ \text{Within-imputation variance}\ W_{k}=&\frac{1}{M} \sum_{m} \text{Var}(\bar{\Delta}_{km\cdot}), \\ \text{Between-imputation variance}\ B_{k} =& \frac{1}{M-1} \sum_{m} (\bar{\Delta}_{km.} - \bar{\Delta}_{k..})^{2}, \\ \text{Total variance}\ T_{k} =& W_{k} + \left(1+\frac{1}{M}\right)B_{k}, \end{aligned} $$
where \(\text {Var}(\bar {\Delta }_{km\cdot })\) is the variance among {Δkmp,p=1,…,P} divided by the sample size n, \(\bar {\Delta }_{km.} = \frac {1}{P}\sum _{p} \Delta _{kmp}\), and \(\bar {\Delta }_{k..} = \frac {1}{MP} \sum _{m}\sum _{p} \Delta _{kmp}\). The 1−α confidence interval for variable k’s average distance score is \(\bar {Q}_{k} \pm t_{df, 1-\alpha } \sqrt {T_{k}}\), where the degrees of freedom for the t-distribution is df=(M−1)/((Bk+Bk/M)/Tk)2 [26].
-
4
Select variables with the 1−α confidence intervals that do not contain zero.
The key idea of this approach is to properly characterize the probability distribution of the VIP measure for each predictor from multiple imputed datasets, and then identify the “important” predictors that have significantly larger VIPs. As pointed out by anonymous reviewers, this approach implies that there exists at least one irrelevant predictor so that variable selection is needed. In the extreme scenario where all candidate predictors are useful, the minimum of average posterior mean VIPs, \(\overline {\text {VIP}}_{k'\cdot \cdot }\), will be considerably far away from zero, then variable selection will be deemed as unnecessary and thus all predictors will be selected (step 2). In step 4, bigger values of α will lead to more selected variables and smaller α will result in less variables selected. We used α=0.05 as recommended in previous work [29]. In step 1, we use the flexible imputation method missForest. The missForest proceeds as follows. Initial guesses are made for the missing values (e.g., mean or mode), and variables are sorted in the ascending order of missingness proportions. The variables with missing data are in turn, in the descending order of missingness proportions, regressed via Random Forest on other variables. The missing values are imputed/updated using the fitted Random Forest model as the prediction model. This process is repeated until a stopping criterion is met or the maximum number of iterations is reached. Detailed description of missForest is provided in Supplementary Section 1 and in Stekhoven and Bühlmann [28].
Comparison methods
BI-BART Hu et al. [19] proposed BI-BART, which combines bootstrap imputation with a permutation based variable selection method using BART, developed by Bleich et al. [11] The term “BI” refers to bootstrap imputation. The permutation based BART variable selection method uses the VIP. The response vector is permuted a number of times (e.g, 100) and the BART model is fitted to each of the permuted response vectors and the original predictors; the VIPs computed from each of BART model fits constitute the null distributions. A variable is selected if its VIP from the BART model fitted to the unpermuted response is larger than the 1−α quantile of its own null distribution. The BI-BART method proceeds with the following three steps: (1) generate B bootstrap data sets and impute missing data using missForest, (2) perform variable selection using BART on each bootstrap data set, (3) select the final set of variables if they are selected in at least πB datasets, where π is a fraction threshold between 0 and 1 for selecting a predictor.
BI-XGB BI-XGB combines bootstrap imputation with XGBoost for variable selection among incomplete data. At the core of the XGBoost method is gradient boosting [20]. Boosting is a process in which a weak learner is boosted into a strong learner. The idea of gradient boosting is to build a model via additive functions that minimizes the loss function (exponential loss for classification), which measures the difference between the predicted and observed outcomes [30]. XGBoost uses a gradient tree boosting model with each additive function being a decision tree mapping an observation to a terminal node. In addtion, XGBoost uses shrinkage and column subsampling to further prevent overfitting [20]. Shrinkage scales newly added weights by a factor after each step of tree boosting and the column subsampling selects a random subset of predictors for split in each step of tree boosting. Variable selection via XGBoost on fully observed data is carried out in a recursive feature elimination procedure [19], using the variable importance score provided by the XGBoost model. Among incomplete data, BI-XGB carries out variable selection in three steps as described above for BI-BART, with permutation based BART method in step (2) replaced with XGBoost based variable selection method.
MIA-BART and MIA-XGB As tree-based methods, both BART and XGBoost offer a technique, missingness incorporated in attributes (MIA) [20, 29], to handle missing covariate data by treating the missingness as a value in its own right in the splitting rule. The MIA algorithm chooses one of the following three rules for all variables for splitting X and all splitting values xc: (1) if X is observed and X≤xc, send this observation left; otherwise, send this observation right. If X is missing, send this observation left, (2) If X is observed and X≤xc, send this observation left; otherwise, send this observation right. If X is missing, send this observation right, (3) If X is missing, send this observation left; if it is observed, regardless of its value, send this observation right. We refer to work by Kapelner and Bleich [29] for a detailed description of the MIA procedure. As comparison methods, we implement MIA within the BART model and within the XGBoost model while performing variable selection: MIA-BART and MIA-XGB. Because MIA cannot handle missing outcome data, we implement this technique on two versions of data: (i) only cases with complete outcome data; (ii) data with imputed outcomes.
Performance metrics
To stay consistent with the literature [9, 11, 19] so that our methods and results can be compared with previous work in similar contexts, we assess the performance of variable selection using four metrics: (i) precision, the proportion of truly useful predictors among all selected predictors, (ii) recall, the proportion of truly useful variables selected among all useful variables, (iii) \(F_{1} = \frac {2\text { precision} \cdot \text {recall} }{\text {precision} + \text {recall}}\), the harmonic mean of precision and recall. The F1 score balances between avoiding selecting irrelevant predictors (precision) and identifying the full set of useful predictors (recall), and (iv) Type I error, the mean of the probabilities that a method will incorrectly select each of the noise predictors. Note that “precision” and “positive predictive value (PPV)” are sometimes used interchangeably; and the same goes for “recall” and “sensitivity”. Type I error is sometimes referred to as “false positive” in the machine learning literature. These performance metrics will be calculated among 250 replications for larger sample sizes n=1000,5000 and among 1000 replications for smaller sample sizes n=300,650.
In addition to the four metrics, we propose to use the cross-validated AUC to evaluate the prediction performance of each model with selected variables for incomplete data. This additional metric will be useful to distinguish between the methods in the situation where it remains unclear which method is able to select the most relevant predictors based on the four metrics described above. The AUC evaluation is constituted of the following three steps:
-
(1)
randomly split the data into two halves, and then implement each of the methods using half of the data to select variables,
-
(2)
impute the other half of the data (single imputation) and record the AUC of each model with the selected variables. For BART based methods, calculate the AUC using BART, and for XGB based methods, calculate the AUC using XGBoost, and
-
(3)
repeat steps (1) and (2) 100 times to get the distribution of the AUCs.
Simulation of incomplete data
We adopt the simulation settings used in Hu et al. [19] to accurately mimic realistic missingness problems and to impartially compare methods in similar contexts. We use the multivariate amputation approach [31] to generate missing data scenarios with desired missingness percentages and the missing data mechanisms. The multivariate amputation procedure first randomly divides the complete data into a certain number (can be one) of subsets, each allowing the specification of any missing data pattern. Then the weighted sum scores are calculated for individuals in each subset to amputate the data. For example, the weighted sum score for X3 and individual i can take the form, \(wss_{x_{3},i} = x_{1i} w_{1} + x_{2i} w_{2}\), which attaches a non-zero weight w1 to x1i and w2 to x2i to suggest that the missingness in X3 depends on X1 and X2. A logistic distribution function [32] is then applied on the weighted sum scores to compute the missingness probability, which is used to determine whether the data point becomes missing or not.
We consider simulation scenarios that represent the data structures commonly observed in health studies: (i) sample sizes ranging from small to large: n=300,650,1000,5000, (ii) 10 useful predictors that are truly related to the responses, X1,…,X10, and 10, 20, 40 noise predictors, where X1 and X2∼i.i.dBern(0.5),X3,X4 and X5∼i.i.dN(0,1),X6∼i.i.dGamma(4,6), and X7,X8,X9,X10 were designed to have missing values under the MAR mechanism; a half of noise predictors are simulated from N(0,1), and the other half from Bern(0.5), and (iii) 20% missingness in Y with 40% overall missingness, and 40% missingness in Y with 60% overall missingness. There are a total of 24 scenarios considered, including 4 sample sizes × 3 ratios of useful versus noise predictors × 2 missingness proportions. We additionally investigate how each method performs when there are no noise predictors for n=1000 and two missingness proportions. The outcomes are generated from a model with arbitrary data complexity: (i) discrete predictors with strong (x1) and moderate (x2) associations; (ii) both linear and nonlinear forms of continuous predictors; (iii) nonadditive effects (x4x9). The outcome model is specified as:
$$\begin{aligned} \text{Pr}(y=&1 {\, \vert \,} x_{1},\ldots,x_{10}) = \text{logit}^{-1} (\!-2.7\!+1.8x_{1}\!+0.5x_{2}\\&+1.1x_{3}-0.4e^{x_{5} }\!-0.4(x_{6}\!-3.5)^{2} \!+0.3(x_{7}\,-\,1)^{3}\\&+1.1x_{8} -1.1x_{10}+5\sin(0.1\pi x_{4}x_{9})-0.4x_{5} x_{10}^{2}\\&+0.4 x_{3}^{2} x_{8}). \end{aligned} $$
Following generating the full data, the multivariate amputation approach was used to amputate X7,X8,X9,X10 and Y under the MAR mechanism to create the desired missingness characteristics. Detailed simulation set-up appears in Supplementary Section 2.
Because we generated four predictors, X7,X8,X9 and X10 from the normal distribution with the mean depending on other predictors, correlations among covariates X3,…,X10 were established. The correlations range from − 0.034 to 0.413. Supplementary Table 1 shows the Pearson correlations among the predictors X3,…,X10. An anonymous reviewer pointed out that we can measure the strength of MAR by first taking a missing indicator variable for each variable with missing values and regressing it on the other variables related to it being missing, and then estimating the AUC of the regression model. By this approach, the strength of missingness in X7,X8,X9,X10 and Y are strong MAR with the AUC ranging from.72 to.92; see Supplementary Table 2.
Case study
We demonstrate and compare the methods by addressing the emerging variable selection problem studied in Hu et al. [19] using the SWAN data. The SWAN was a multicenter, longitudinal study in the U.S. with the goal of understanding women’s health across the menopause transition. The SWAN study enrolled 3305 women aged between 42 and 52 in 1996-1997 and followed them up to 2018 annually. The emerging research question is the identification of important risk factors for 3-year incidence of metabolic syndrome. Metabolic syndrome is a cluster of conditions that occur together and has been shown to increase the risk of heart disease, stroke and type 2 diabetes [33]. Identifying important predictors of metabolic syndrome among middle-aged women would be valuable to forming preventive interventions and reducing risks or threats to women’s health. This important research question has been less studied in the literature for women during their middle years.
We use the analysis dataset described in Hu et al. [19], which included 2313 women who did not have metabolic syndrome at enrollment. Among the 2313 women, 251 (10.9%) developed metabolic syndrome within three years of enrollment, 1240 (53.6%) did not, and the remaining 822 (35.5%) had missing outcomes. Sixty candidate predictors (29 continuous variables and 31 discrete variables) were selected based on the previous literature [33–36] on risk factors for metabolic syndrome, including the demographics, daily life behaviour, dietary habits, sleep habits, medications, menopausal status and related factors and mental status among others. A detailed list of names and definitions of the 60 candidate variables is provided in Supplementary Table 12. Among the 60 candidate predictors 49 had missing values, and the amount of missing data in these variables ranges from 0.1% to 27.1%. Only 763 (33.0%) participants had fully observed data.