Comparative analysis of explainable machine learning prediction models for hospital mortality

Stenwig, Eline; Salvi, Giampiero; Rossi, Pierluigi Salvo; Skjærvold, Nils Kristian

doi:10.1186/s12874-022-01540-w

Research article
Open access
Published: 27 February 2022

Comparative analysis of explainable machine learning prediction models for hospital mortality

Eline Stenwig ORCID: orcid.org/0000-0002-3803-8106¹,
Giampiero Salvi²,
Pierluigi Salvo Rossi² &
…
Nils Kristian Skjærvold^1,3

BMC Medical Research Methodology volume 22, Article number: 53 (2022) Cite this article

4639 Accesses
24 Citations
3 Altmetric
Metrics details

Abstract

Background

Machine learning (ML) holds the promise of becoming an essential tool for utilising the increasing amount of clinical data available for analysis and clinical decision support. However, the lack of trust in the models has limited the acceptance of this technology in healthcare. This mistrust is often credited to the shortage of model explainability and interpretability, where the relationship between the input and output of the models is unclear. Improving trust requires the development of more transparent ML methods.

Methods

In this paper, we use the publicly available eICU database to construct a number of ML models before examining their internal behaviour with SHapley Additive exPlanations (SHAP) values. Our four models predicted hospital mortality in ICU patients using a selection of the same features used to calculate the APACHE IV score and were based on random forest, logistic regression, naive Bayes, and adaptive boosting algorithms.

Results

The results showed the models had similar discriminative abilities and mostly agreed on feature importance while calibration and impact of individual features differed considerably and did in multiple cases not correspond to common medical theory.

Conclusions

We already know that ML models treat data differently depending on the underlying algorithm. Our comparative analysis visualises implications of these differences and their importance in a healthcare setting. SHAP value analysis is a promising method for incorporating explainability in model development and usage and might yield better and more trustworthy ML models in the future.

Peer Review reports

Background

With the increasing availability and use of digital aid in health care, such as sensors and electronic health records, patients generate large amounts of data that can be used in treatment and analysis. Some of this information is not necessarily informative on its own but can give insight into complex medical problems when combined. Statistical modelling has long been one of the main approaches in medical research when studying relationships and their significance regarding different variables. However, due to data availability and better hardware, the use of artificial intelligence (AI) and machine learning (ML) has increased rapidly within this field over the last few years, supplementing, and to a certain degree replacing, the traditional statistical models [1].

Statistical modelling and ML can both be used for inference and prediction but have somewhat different approaches. Traditional statistical modelling should utilise pre-analytical clinical assumptions regarding the underlying structure of the data, while ML models often are purely ‘data-driven’ and are developed by generalising patterns within certain constraints specific to the algorithms [2].

Predictions in standard statistical models are often ‘human-readable’ to a certain degree. The opposite is the case with ML models, which are often compared to ‘black boxes’ where the mapping between the input feature and prediction is not clear. Not only for the end-user, but also the developer. The importance of different features on a prediction can for some ML models be explained using coefficients, or by tracing or visualising the steps taken by the algorithm. Still, this is not always sufficient for obtaining a model that is easily understood by humans for development and use. The complexity of a model increases rapidly with an increasing number of features, and explaining the impact of individual features is not straightforward. Without understanding, the models cannot be trusted to perform according to our expectations. There is an implementation gap for ML in healthcare, where a lack of trust in the model plays a vital part [3]. Models that are not trusted will not be used. The need for explainable ML and models that can be easily understood by humans is becoming increasingly apparent [4, 5].

A recently developed tool for making ML models more intuitive is SHapley Additive exPlanations (SHAP) [6] which are based on Shapley values [7]. Shapley values are a solution concept from game theory that weighs the contribution of each player and distributes the ‘payout’ accordingly. An implementation of this game theory provides weights or relations describing how big a role different features play in determining the output of the model. Some studies published over the last few years have incorporated SHAP or similar tools as part of the model development and performance evaluation [8–10]. This applies only to a small portion of published studies, and there is yet work to be done before explainability become state-of-the-art.

ML models need to be evaluated thoroughly to find their true performance regarding the intended purpose. Many ML models are usually evaluated using only a few performance metrics. This, in combination with the lack of transparency, often leads to poor evaluation of certain aspects of the model and model performance. A recent systematic review of studies where ML is used for predicting mortality based on ICU data [11] showed that papers generally only focus on the discriminative capabilities of models. Additionally, the papers rarely reported metrics related to other evaluation methods, such as calibration, i.e. how well the distribution of predicted probabilities matches the expected distribution. Models should be evaluated based on the use-case they were developed for, and the use of solely one metric would be highly insufficient in most cases [12, 13]. Domain knowledge is an integral part of ML model explainability and trustworthiness, and should be applied in all stages of the model development and implementation. This is particularly important in healthcare, where models should reflect the human physiology. The models should be correct for the right reasons, and medical experts are an essential part of this.

There are many potential uses for ML in healthcare, with tasks ranging from cancer detection [14] to predicting hospital readmission [15] or mortality [16]. Mortality prediction for patients in the Intensive Care Unit (ICU) can be regarded as a simple classification problem with two possible outcomes: dead or alive at ICU, or hospital, discharge. The model utilises variables such as age, height and weight, vital values, lab values, and diagnoses, with features ranging from highly granulated temporal data to single values such as the mean of the first 6 hours of the hospital stay or discrete variables like age or sex. There are many potential uses of mortality predictions, both on individual and group level, including stratifying and identifying patients, comparing and improving ICUs, helping with clinical decision making, knowledge derivation, and resource allocation [17, 18].

With the advent of personalised medicine, predictions on the individual patient level are warranted. However, models developed for groups are not directly applicable to individuals, as the mortality prediction reflects the probability of survival or death in a group or cohort [19]. Similarly, it is not possible to take a model developed for a specific patient group and use it in a different group [19]. Hence, the intended use of the model is crucial.

In this study, we want to investigate how individual features impact predictions from different ML models to learn how they compare to common medical theory, and to each other. This is done by developing four hospital mortality prediction models from a publicly available dataset using the same input features to highlight similarities and differences between the models from an end-user point of view using SHAP-values. Dataset-level performance metrics are calculated for the different ML models to assess the overall performance and compare it to the well-known APACHE IV score as a baseline. The purpose of the study is not to find the best model regarding explainability, but to clarify aspects to be aware of when developing and using ML models, and to explore why the ability to explain ML models, and not just the model output, is needed.

Methods

This section is divided in three parts: Input, Model and Output, representing the principal components of an ML model. The analysis is done with Python, and the code is available online.

Input

The dataset

The dataset used in this study is the freely available multi-centre eICU Collaborative Research Database [20], which contains information about patients admitted to critical care units in the US between 2014 and 2015. The dataset includes over 200 000 ICU stays from more than 139 000 patients and holds information such as patient demographics, lab values, information about diseases and treatment, and vital values with a resolution of 5 minutes. A large part of the dataset is dedicated to the Acute Physiology and Chronic Health Evaluation (APACHE) IV severity-of-disease classification system [21], and the dataset includes designated tables for the parameters used to calculate this score.

Patient selection

The patients’ inclusion criteria are shown in Fig. 1, which also includes the training and test set selection described later. Patients with multiple ICU stays are excluded, as well as patients younger than 18 years. Patients with a stay of fewer than 24 hours are excluded to weed out patients that are in the ICU for a short stay before death or transferral. Patients without valid age, sex, patient id, discharge status, body mass index (BMI), admission diagnosis, or predicted hospital mortality (based on APACHE IV) are also excluded. Outliers for height and weight are removed manually, while values outside of five standard deviations are removed for vital and lab values.

Feature selection

The selected features are the same as for calculating the APACHE IV score, with some exceptions. Height and weight are combined in one variable: BMI. Glasgow Coma Scale (GCS) is used as a combined score for eyes, motor and verbal. The features PaO2 and FiO2 are combined into one feature: pfratio.

Feature extraction is simplified using the tables containing the features used for calculating the APACHE IV score.

The vital and lab values used are the ‘worst’ value for each feature, i.e. the value furthest away from a reference value in the first 24 hours in the ICU. Many of the patients have stayed in the hospital prior to the ICU, and treatment will also affect the values. The result will reflect treatment and care given during the entire stay, and events can occur after the first 24 hours that are not considered by the models.

Train/test set

Most ML models are developed with the help of a training set and then validated on a test set to evaluate how the model performs on previously unseen data. The train/test split is also shown in Fig. 1. The training set comprises 75% of the patients, and the test set comprises the remaining 25%. The share of deceased patients is the same in both sets (10.2%). The same training set and test set are used for all models.

The ML methods considered cannot handle missing inputs. The missing numerical values are therefore filled with the mean of the training set. Patient with missing categorical variables are not included in the study.

Models

We construct four different ML models; Random Forest (RF), Logistic Regression (LR), Adaptive Boost Classifier (ADA), and Naive Bayes (NB). All models are from the scikit-learn Python package [22]. RF and ADA are both tree ensemble models. The models comprise multiple decision trees that when combined give better results than individual trees. Decision trees determine the output by using a flowchart-like structure to impose a series of conditions on the input. The RF model comprises multiple trees trained in parallel on different subsets of data before the final result is found by majority vote. The ADA model uses the same dataset for each tree, but the trees are trained sequentially instead of in parallel with trees updated based on the previous tree’s mistakes. The result is decided by a weighted majority vote. LR models resemble linear regression, but the output variable is binary. The NB classifier is based on Bayes Theorem and assumes conditional independence between input features.

Several different pre-processing techniques are tested to see if they affect the results significantly. This includes scaling of the input features, removal of patients with more than X number of missing values, and filling the missing values with (APACHE IV) reference values instead of mean values. Different ratios between deceased and alive patients in the training set are also tested. The models are trained by minimising the error with respect to the area under the receiver operating characteristic curve (AUC ROC/AUC/c-statistic).

Output

The hospital mortality prediction can be presented as a probability, or solely as a binary outcome based on a risk threshold or operation point. Probabilities facilitate risk stratification of patients and allow a more nuanced understanding than simple ‘alive’/‘deceased’ predictions. However, a probability still lacks information useful for clinical decision support.

The AUC is a popular metric for evaluating a model’s discriminative abilities, i.e. how well the model separates the classes. A perfect classifier will have an AUC of 1, while a random classifier yields an AUC of 0.5. AUCs for different models tested on the same dataset are often directly compared to determine which one performs better in terms of discrimination.

The AUC confidence intervals are found by bootstrapping with 10 000 bootstrap samples, each of the size of 70% of the test set.

While AUC is used to evaluate the models’ discriminative abilities, calibration curves are plotted to evaluate the calibration. Calibration is the agreement between the observed and predicted risk [23] and can be visualised with calibration curves where the predicted probability (x-axis) is plotted against the observed frequency (y-axis). A model capturing the accurate risk estimation would have the calibration curve y=x.

SHAP

The method for finding SHAP values differs depending on the type of model. It is possible to find exact SHAP values for tree models and linear models, while estimations are found for other types of models using a weighted local linear regression. This model agnostic method for finding SHAP values does not make any assumption about the model and is, therefore, slower than other methods. Because of the time and resources needed for this model agnostic method are SHAP values often based on a small subset of the data. The NB model is the only model that requires the use of this model agnostic method. We used 1000 samples for the evaluation in this case. For calculating the SHAP values for the LR model, a correlation between the features is assumed.

The SHAP values cannot be compared directly between models due to scaling differences. Still, it is possible to compare how different models weigh different input features by considering the shape of the different plots.

Results

Area under the receiver operating curve

The receiver operating characteristic curves are plotted in Fig. 2, and the AUCs are listed in Table 1. The RF model has the highest AUC, followed by the ADA model, APACHE IV, LR and NB. The ADA and APACHE IV models have almost completely overlapping confidence intervals. These two confidence intervals are also partly overlapping with both RF and LR. The NB model is the only model without any overlapping confidence intervals.

Table 1 Area under the receiver operating characteristics curves

Full size table

Calibration curves

The calibration curves are shown in Fig. 3. All calibration curves lie below the line y=x apart from a small part of the NB model calibration curve, i.e., the predicted mortality is higher than the actual mortality. Inspecting the curve for the RF model, the mortality for the group of patients with a predicted mortality of 60% is in reality 30%. This means that the chances of survival are better than what the model predicts.

SHAP summary plot

Figure 4 shows the feature importance for the models with respect to the mortality-prediction task. The features are listed top-down with decreasing importance. Only the top 25 features are listed, and categorical variables are split into one bar per category. The sum of the contribution from each category gives the total contribution before the one hot encoding. The bar lengths show the average impact of the individual features on the models’ output. For the RF model, the GCS is the most important feature followed by vent^{Footnote 1} and blood urea nitrogen (bun). All models apart from ADA place GCS as the most important feature. Age is also listed as an important feature by all the models, as well as vent and GCS.

A different presentation of the SHAP summary plots can be seen in Fig. 5. The order of the features listed is the same as in Fig. 4, and the x-axis shows the SHAP values for individual patients instead of the average absolute value. The further away from the vertical line at x=0, the larger the impact on the output prediction. Values to the left are contributing to increased chances of survival, while values to the right are pushing the prediction towards increased mortality. The colourful vertical lines are made of dots, with one dot for each patient. The colour of a dot signifies the feature value for that patient. A pink dot represents a high value, while a blue dot represents a low value. The gradients represent the values in between. These plots visualise how different feature values contribute to either survival or death, but the total contribution from each feature is less prominent compared to the bar plots.

SHAP force plot

SHAP force plots show the contribution of a single feature for one or several patients. Force plots for the patients in the test set for the features temperature and white blood count (WBC) are depicted in Figs. 6 and 7, respectively. The feature value is listed along the x-axis and is equivalent to the dot colour in Fig. 5. The y-axis shows the average feature contribution from patients with similar feature values.

All features are examined for all models. The aforementioned features highlight differences between the models and aspects to be aware of when developing and using ML models.

SHAP individual force plots

Figures 8 and 9 show the individual force plots for two patients (A and B) for the four different models; features the models consider relevant for the prediction of individual patients. The bold-faced number is the probability prediction (model output value), while the base value is the value that would be predicted if no inputs are given to the model. The blue features to the right of the prediction are the features pushing the prediction towards survival, while the pink features to the left push the prediction towards increased mortality. The length of the coloured segments helps visualise the size of the impact on the prediction. The longer the segment, the larger the impact. The length of the segments should not be compared between models.

Figure 8 shows the force plots for a patient (A) being alive at hospital discharge, which is predicted by the ADA, LR and APACHE IV model (prediction 0.39) if the risk threshold is 0.5. The RF and NB models predict that the patient dies. The models consider a low GCS score as the most, or second-most, influential factor concerning mortality. The RF, ADA and LR model also consider the fact that the patient is ventilated at the time of the worst respiratory rate as an important factor for mortality. The patient’s young age is the factor pushing the predictions most towards survival. While the models agree with the impact of several of the factors, they disagree with the influence of temperature. This patient has a high temperature of 40.6^∘C, which is considered the second most prominent factor for survival by the LR and RF models.

The SHAP values for another patient (B) are shown in Fig. 9. With a risk threshold of 0.5, all ML models, as well as the APACHE IV model (prediction 0.54), predict that this patient dies in the hospital, which is also the case. Even though the models agree on the outcome, the SHAP values vary between the models. The age and the fact that the patient was not ventilated at the time of the worst respiratory rate are factors that push the prediction towards survival. After age, the LR model also considers the low white blood count as a factor pushing the prediction towards survival, followed by vent and GCS. The opposite is the case with the ADA model, where the low white blood count pushes the prediction towards mortality. A high bilirubin level is considered quite important by all models and is the only factor deciding the NB model outcome. The predictions given by the models are all above the risk threshold of 0.5.