Exploring the variable importance in random forests under correlations: a general concept applied to donor organ quality in post-transplant survival

Random Forests are a powerful and frequently applied Machine Learning tool. The permutation variable importance (VIMP) has been proposed to improve the explainability of such a pure prediction model. It describes the expected increase in prediction error after randomly permuting a variable and disturbing its association with the outcome. However, VIMPs measure a variable’s marginal influence only, that can make its interpretation difficult or even misleading. In the present work we address the general need for improving the explainability of prediction models by exploring VIMPs in the presence of correlated variables. In particular, we propose to use a variable’s residual information for investigating if its permutation importance partially or totally originates from correlated predictors. Hypotheses tests are derived by a resampling algorithm that can further support results by providing test decisions and p-values. In simulation studies we show that the proposed test controls type I error rates. When applying the methods to a Random Forest analysis of post-transplant survival after kidney transplantation, the importance of kidney donor quality for predicting post-transplant survival is shown to be high. However, the transplant allocation policy introduces correlations with other well-known predictors, which raises the concern that the importance of kidney donor quality may simply originate from these predictors. By using the proposed method, this concern is addressed and it is demonstrated that kidney donor quality plays an important role in post-transplant survival, regardless of correlations with other predictors. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-023-02023-2.


Appendix
A) Variable and its residual importance in additive models For regression forests and squared error loss the permutation variable importance is given by: (5) When further assuming the underlying model to be additive and w.l.o.g. the intercept to be 0: this formula further simplifies: In the derivation it was used that f i (X i ) and f i (π(X i )) are independent but still have the same distribution and thus the same expected value.In particular, it can be seen that the VIMP can also be expressed as With renaming the variable of interest as Z = X p the model from equation ( 6) which we call model A is described as: We now define the model where shared information is removed from Z, which we call model B .For that, we separate the part of Z that can be explained by X 1 , . . ., X p−1 from the part that is independent from X 1 , . . ., X p−1 by defining g : R p−1 → R as described in equation ( 3) and apply mean squared error as loss function: The two parts of model ( 9), namely g(X −p ) and ϵ Z , are independent and describe the explained and unexplained part of Z, respectively.We now assume that f p is additive in g(X −p ) and ϵ Z : where h 1 (ϵ Z ) and h 2 (g(X −p )) are functions of independent random variables and are therefore also independent.The assumption on additivity of f p is fairly reasonable considering the flexibility of additive model components.
Inserting equation ( 10) into equation (model A ) leads to an alternative model description: Now, VIMP A (•) describes the VIMPs of model A as defined in equation ( 5) and is defined for X 1 , . . ., X p−1 and Z.In contrast, VIMP B (•) refers to model B and is defined for X 1 , . . ., X p−1 and ϵ Z : For the last equation we used the general result on VIMPs given in (8).
We now show that VIMP B (ϵ Z ) can never exceed VIMP A (Z) and identify the situation where equality holds.
The last term in equation ( 11) is 0 because h 1 (ϵ Z ) and h 2 (g(X −p )) are uncorrelated and (11) can be considered as a hypothetical VIMP of g(X −p ), that would result if this explained part of Z would have been included as a separate variable into the model.Finally, we get: Relation (13) describes that VIMP A (Z) and VIMP B (ϵ Z ) are of same size if and only if the explained part of Z, i.e. g(X −p ), would not improve the loss of the Random Forest.In other words, the part of Z that can be explained by X 1 , . . ., X p−1 (within the model class G) does not explain Y .

B) Asymptotics of finite sample VIMPs
In this section we show, that the finite sample VIMP estimated from a Random Forest converges with sample size to the VIMP as defined in ( 5).This convergence is the motivation why the density d VIMPB(ϵ Z ) estimated from resampled VIMP B (ϵ Z ) has to be adjusted: VIMP B (ϵ Z ) as derived from a sample of size 0.632n might show stronger deviations from its asymptotic limit as VIMP B (ϵ Z ) is derived from the full sample and thus might systematically differ from the full sample VIMP B (ϵ Z ).
The proof follows ideas of Ishwaran and Lu [27] and relies on some assumptions as described in the following.For a discussion, why these assumption can be considered as fairly reasonable, we refer to [27].
Let VIMP (n) (X i ) be the VIMP of X i estimated from a learning sample of n independent observations with same distribution as (X, Y ).We again assume Y = f (X) + ϵ and define f (n) (X) as the predictor of Y derived from a Random Forest analysis of the learning sample.Then where . Now define the residuals of the forest when applied to X and its permutated variant as r n (X) and r n (π i (X)), respectively: With an exchangeability assumption on (r n (X), r n (π i (X)) it follows that and thus the second expectation in ( 14) is 0. In summary, with σ 2 n := Var (r n (π i (X)) = Var (r n (X)) and ρ n := C(r n (π i (X)), r n (X)).This relies on a further assumption that f (n) is L 2 -consistent for f .With this assumption it can also be shown, that A n converges to 0: ] can be shown to be bounded [27] and n) it follows that σ 2 n also converges to 0 and finally For any given values of C(Y, Z) and spC(Y, Z) and with c = C(Y, Z) the values σ 2 1 and σ 2 2 can be found by grid search.

D.b) Additional simulation design C)
Here we introduce another simulation design, which has more covariates and is more complex compared to Simulation Design A. This model contains strong correlations as well as weak correlations between Y and the different X i and between the X i themselves respectively.The exact correlations between all variables are shown in Figure A1, where no connection means that an X i is independent to all other values.The (X 1 , ..., X 20 , Y ) vector was simulated as multivariate normally distributed with the given Covariances and E(X i ) = E(Y ) = 0 and V ar(X i ) = V ar(Y ) = 1.
We then applied our test to all variables.The exact rejection rates of this design are shown in Table A3.(calculated as difference between (AGE) and (AGE DIAB) that are contained within the raw dataset).CMV Recipient's CMV status.(CMV STATUS) GENDER Recipient's gender.(GENDER) SERUM ALB Recipient's total albumin serum.(TOT SERUM ALBUM) [b] https://unos.org/news/in-focus/what-is-epts/ [c] https://unos.org/news/in-focus/what-is-kdpi-a-new-animation-for-patientsexplains-the-kidney-donor-profile-index/F) VIMPS of all variables based on both model descriptions

Figure A1
Figure A1 Correlation graph that showws all correlations between Y and the different predictors X i .

Figure A2
Figure A2 The figure shows the VIMPs of all variables in the Random Survival Forest analysis of UNOS data under model description model A as well as the corresponding VIMPS under model description model B .Thereby, in model B KDPI was replaced by its residuals whereas all other variables remained unchanged.Thus, VIMP B (KDPI) refers to residual information and all other VIMPs to original variables' information within the two models.

Table A1
Simulation designs defined by C(Y, Z) and spC(Y, Z) and the corresponding values for σ 2

Table A2
Simulation results (1 000 simulated datasets, each with n patients) showing empirical rejection probabilities of the proposed resampling tests for a regression Random Forest.Correlations within the simulation model are given as the correlation between Y and Z (C(Y, Z)) and semipartial correlation between Y and Z (spC(Y, Z) := spC(Y, Z|X 1 , . . ., X 5 )).The test is here applied to the variable X 4 with C(Y, X 4 ) = spC(Y, X 4 ) = 0 for all 7 simulation designs.

Table A3
Simulation results (1 000 simulated datasets, each with n observations) showing empirical rejection rates for both hypotheses.The true correlations (C(Y, Z)) and semi-partial correlations (spC(Y, Z)) between Y and the different variables X i and therefore implicitly the true situation of H Description of all variables within the application The variables used in section 5 are given together with the variables' description.The variable names correspond to the names given in figures 2 and A2.The variable names used in the original UNOS dataset are given in brackets.DIAG Primary diagnosis of recipient's kidney disease.(DIAG KI) COLD ISCH Cold ischemic time of the graft.(COLD ISCH KI) PERIP VASC Recipient's peripheral vascular disease status.(PERIP VASC) BMI Recipient's body mass index.(BMI CALC) DIAB Recipient's diabetes status.(DIAB) AGE Recipient's age.(AGE) DIAB DUR Duration of the recipient's diabetes disease.