 Research
 Open access
 Published:
Predicting COVID19 mortality risk in Toronto, Canada: a comparison of treebased and regressionbased machine learning methods
BMC Medical Research Methodology volume 21, Article number: 267 (2021)
Abstract
Background
Coronavirus disease (COVID19) presents an unprecedented threat to global health worldwide. Accurately predicting the mortality risk among the infected individuals is crucial for prioritizing medical care and mitigating the healthcare system’s burden. The present study aimed to assess the predictive accuracy of machine learning methods to predict the COVID19 mortality risk.
Methods
We compared the performance of classification tree, random forest (RF), extreme gradient boosting (XGBoost), logistic regression, generalized additive model (GAM) and linear discriminant analysis (LDA) to predict the mortality risk among 49,216 COVID19 positive cases in Toronto, Canada, reported from March 1 to December 10, 2020. We used repeated splitsample validation and kstepsahead forecasting validation. Predictive models were estimated using training samples, and predictive accuracy of the methods for the testing samples was assessed using the area under the receiver operating characteristic curve, Brier’s score, calibration intercept and calibration slope.
Results
We found XGBoost is highly discriminative, with an AUC of 0.9669 and has superior performance over conventional treebased methods, i.e., classification tree or RF methods for predicting COVID19 mortality risk. Regressionbased methods (logistic, GAM and LASSO) had comparable performance to the XGBoost with slightly lower AUCs and higher Brier’s scores.
Conclusions
XGBoost offers superior performance over conventional treebased methods and minor improvement over regressionbased methods for predicting COVID19 mortality risk in the study population.
Background
Coronavirus disease (COVID19), caused by the severe acute respiratory syndrome coronavirus 2 (SARSCoV2), presents an unprecedented threat to global health worldwide. Cases have put a great burden on medical resources, leading to a shortage of intensive care resources. Prediction of mortality risk at the individual level is crucial for helping healthcare professionals prioritize medical care for patients by facilitating resource planning, and for guiding public health policymaking to mitigate the burden on the healthcare system.
For predicting event probability, logistic regression is commonly used. In logistic regression, linear effects are often assumed for continuous covariates, which may be restrictive in many applications. In contrast, generalized additive model (GAM) can model nonlinear covariate effects [1–3]. For regressionbased approaches, correct model specification is needed to ensure consistent probability estimates, which is challenging in the case of collinearity or complex interactive effects between independent variables.
To overcome these challenges, treebased machine learning methods, such as classification tree [4], random forest [5], and gradient boosting [6, 7] have gained popularity in the literature. One advantage of treebased methods is that they do not require specifying the parametric nature of the relationship between continuous predictors and the outcome. The treebased methods can also easily handle categorical predictors without the need to create dummy variables. Further, treebased methods allow for identifying highrisk subpopulations, especially when predictors have complex interaction effects. Nevertheless, the treebased methods are prone to overfitting, resulting in low bias but high variance, and limits to generalizability and robustness of models.
Research has been conducted comparing the performance of regressionbased and treebased methods in terms of predictive accuracy, but the results are inconclusive. Some studies concluded that classification tree and logistic regression had comparable performance [8, 9]; some studies concluded that logistic regression had superior performance over the treebased methods [10, 11]; while some showed that treebased methods outperform logistic regression [12–15]. One reason for this inconsistency is that comparative performance likely differs depending on the application and dimensionality of the data. Machine learning methods may perform better than regressionbased methods when there are complex, contingent relationships between predictors, and data has high dimensionality. Thus, it is important to assess model performance for specific applications and data sources. Few studies have been conducted examining the use of machine learning methods for predicting COVID19 mortality risk in Canada using available data sources.
If predictive models are going to be used for pandemic planning, validation to assess model robustness and performance is critical. In research on the performance of regression and machine learning methods, there is inconsistency in validation methods and how performance is assessed. Most studies used kfold crossvalidation (CV). Only a few employed repeated splitsample validation with a larger number of folds for CV to examine the robustness of the findings [8, 10]. Performance of the models at different levels of predicted probabilities is also important, as good performance overall may obscure predictive errors affecting those at different levels of risk.
The objective of this study was therefore to compare the performance of regression models and treebased methods for predicting COVID19 mortality in Toronto, Canada using data available in many settings. A range of individual and neighborhoodlevel predictors were considered. The predictive accuracy was assessed with repeated splitsample validation and forecast validation using the area under the receiver operating characteristic (ROC) curve and the Brier’s score. Predictive accuracy was also assessed at different levels of predicted probabilities.
Methods
Data description
Data on COVID19 confirmed cases from March 1, 2020, through December 10, 2020, in the city of Toronto, Canada, were retrieved from the Ontario Ministry of Health. The outcome variable was an individual’s mortality status due to COVID19. A range of predictors were considered. The COVID19 epidemic is dynamic, increasing or decreasing over time, sometimes on a daily basis. It is therefore expected that time is an important predictor for COVID19 mortality. In the dataset, the episode date, a derived/combined variable, was provided as the best estimate of when the disease was acquired and refers to the earliest available date from symptom onset (the first day that COVID19 symptoms occurred), laboratory specimen collection date, or reported date. The time variable included in the predictive model is the elapsed days between the start date of the study (March 1, 2020) and the episode date. The demographic characteristics of the subject include age groups: ≤19, 2029, 3039, 4049, 5059, 6069, 7079, 8089, 90+ years of old and selfreported gender: males, females and others, where others represent unknown or other sexual identifications such as transgender. Toronto is divided into 140 geographically distinct neighborhoods that were established to help government and community agencies with local planning by providing meaningful social and economic ecological data from census and other sources. Neighborhoodlevel predictors over the 140 neighborhoods in Toronto were obtained from the 2016 Canadian Census data, including population density and average household income, which were linked to the COVID19 data by neighborhoods. Research shows that temperature is negatively associated with COVID19 transmission [16, 17]. Therefore, the daily temperature in Toronto from March 1, 2020, until December 10, 2020, was included as a predictor, downloaded from the Government of Canada Daily Weather Data Report. Variables describing the history of hospitalization for COVID19 (ever hospitalized, ever in the intensive care unit (ICU), or ever intubated) were also used as predictors. These variables may be intermediate outcomes between infection and death, and may interact with other variables as predictors of mortality. For example, individual and neighborhood variables may be proxies for health status and chronic disease variables associated with serious COVID19 outcomes. They may also be associated with differences in health care access and quality, and thus modify the relationship between intermediate hospital outcomes and death.
Predictive models for COVID19 mortality risk
Regression methods
Logistic Regression
The logistic regression (LR) model with the logit link function can be expressed as, logit(π_{i})=X_{i}β, where π_{i} denotes the probability of mortality and X_{i} is the design matrix for all the covariates and β=(1,β_{1},...,β_{p})^{T} is a p×1 vector of regression coefficients. We considered two types of logistic regression models. The first model consisted of all the variables. No variable reduction was performed. In the second model, all the variables and their twoway interactions were included in the initial model. Then, the Least Absolute Shrinkage and Selection Operator (LASSO) [18] was used to exclude “unnecessary” predictors by shrinking their coefficients to exactly zero, yielding a more parsimonious model. The hyperparameter or regularization parameter controlling the amount of regularization in the LASSO regression is chosen by minimizing misclassification error in terms of Area Under the ROC curve (AUC) based on 10 fold crossvalidation. The function cv.glmnet in the glmnet package [19] in R was used for implementing the LASSO method.
Generalized Additive Models
Generalized additive models (GAMs) are a nonparametric, regression technique providing greater flexibility in modeling nonlinear covariate effects with smoothed splines [1, 20], which can be described as \(\text {logit}(\pi _{i})=\boldsymbol {X}_{i}\boldsymbol {\beta }+\sum _{j=1}^{J}f_{j}(z_{ij})\) where X_{i} is a row of the design matrix for any parametric model component, such as age groups, gender and critical care use; β is the corresponding parameter vector; f_{j}(z_{ij}) denote nonparametric spline functions for the jth continous predictor, j=1,⋯,J, respectively. A penalized loglikelihood method is maximized to estimate all the parameters [1]. The smoothing parameters are estimated by the generalized crossvalidation method [20]. The above model is fitted using the R package mgcv [20].
Linear Discriminant Analysis
Linear discriminant analysis (LDA) [21] models the distribution of the predictors separately in each of the response classes and then uses Bayes’ theorem to convert these back into estimates for the probability of an event. When the response variable classes are wellseparated, logistic regression may be unstable, but LDA does not suffer from this problem. However, LDA assumes the distribution of the predictors X are approximately normal in each of the classes and have a common variance, which may fail to hold in some cases. DA has closedform solutions, so it has no hyperparameters to tune. LDA was implemented in the lda function in the R package MASS [22].
Treebased methods
Classification Tree
Classification tree has become a popular alternative to logistic regression [4]. Unlike logistic and linear regression, a classification tree does not develop a prediction equation. The method firstly partitions the sample into two distinct samples according to all possible dichotomizations of all continuous variables given a threshold, and all the categorical variables. Then, the partition that yields the greatest reduction in impurity is selected. The procedure is then repeated iteratively until a prespecified stopping rule is met. After the entire feature space is split into a certain number of simple regions recursively, the predicted probability of the event for a given subset can be calculated using the proportion of subjects who have the condition of interest among all the subjects in the subset to which the given subject belongs [23].
In this study, the classification tree model was implemented using the R package rpart [24]. At each node, the partition was chosen that maximized the reduction in misclassification error. The minimum number of observations that must exist in a node in order for a split to be attempted was 30. The maximum depth of any node of the final tree was 100. The value of the complexity parameter (cp) was set as cp =0.001. Any split that did not decrease the overall lack of fit by a factor of cp was not attempted. To reduce the variance of the resulting models and prevent overfitting the data, the trees were then pruned by removing any split which did not improve the fit. The optimal size of each tree was determined using crossvalidation using the cptable function, which selects the optimal cp with lowest cross validation error. Pruning the tree was done using the prune function of the rpart R package.
Random Forest
Classification trees tend to overfit the training dataset, which may lead to low bias, but high variance [25]. To remedy the issue of high variation in classification trees, the results from multiple trees based on bootstrap samples from the original data can be aggregated, which are referred to as ensemble methods. A common ensemble method with trees is the random forest (RF) approach [26], which is a bagging procedure to combine multiple trees based on bootstrap samples from the original data. One tree is built from each bootstrap sample by introducing recursive binary splits to the data. At a given node, rather than considering all possible binary splits on all candidate predictors, it only considers a random sample of the candidate predictor variables to lower the correlation between trees.
For this study, 1000 regression trees were grown, and the size of the set of randomly selected predictor variables used for determining each binary split was the square root of the number of predictor variables (rounded down), which is the default parameter value in the R package randomForest [5]. In contrast to the classification tree, trees of an RF are not pruned back.
Extreme Gradient Boosting
Gradient boosting tree is an ensemble method of classification trees by iteratively refitting weak classifier to residuals of previous models, meaning that the current weak classifier was generated based on the previous one to optimize the predictive efficiency [6, 7]. Extreme gradient boosting (XGBoost) is an efficient implementation of the gradient boosting method [27], which can learn nonlinear relations among input variables and outcomes in a boosting ensemble manner to capture and learn nonlinear and complex relations accurately. Extreme gradient boosting can improve the accuracy of a classification tree [12–15].
In this study, XGBoost was implemented using the xgboost package in R, which automatically does parallel computation on a single machine, and is thus more computationally efficient than other gradient boosting packages. Hyperparameter optimization was performed to prevent overfitting of the model on the training data. Due to computational and time constraints, hyperparameter optimization was performed across a sparse parameter grid to determine the optimal combination of candidate hyperparameters, i.e., depth of the tree 1,2,3,4,5,6, shrinkage factor =0.01,0.02,0.03,0.04,0.05, and the maximum number of iterations= 500,1000,1500,2000.
Predictive model assessment
Cross validation
Repeated SplitSample Validation
Repeated splitsample validation [10] was used to compare the predictive accuracy of each statistical method. The data were randomly divided into 80% training and 20% testing datasets. Each model was fit on the training dataset. Predictions were then obtained in the testing dataset using the model derived from the training dataset. This process was repeated 200 times, i.e., each predictive model was fit using the training dataset. The model was then used to predict the mortality risk based on the testing dataset. Results were then summarized over the 200 testing datasets. Repeated splitsample validation assesses the robustness of the results and is less likely to be impacted by influential observations in only a few testing samples.
Forecasting Validation
We also validated the models based on the kstepahead predictions of the last k days of the observation period, where k=7,8,⋯,30. For each of the kstepahead predictions, the training dataset was all the data prior to the k days to be predicted. Each model fit the training dataset, and predictions were obtained for the last k days of the testing dataset.
Performance measures
Discrimination of the prediction method can be measured by the area under the ROC curve (AUC) [28]. Higher values of the AUC indicate better model discrimination. AUC examines the ability of the method to distinguish whether the patients who have the outcome have higher risk predictions than those who do not, but does not account for calibration, i.e., the magnitude of the disagreement between the observed and predicted responses [28]. To quantify how close the predictions are to the actual outcome, Brier’s score [28, 29] was used, which is defined as, \(1/n \sum _{i=1}^{n} (\hat {\pi }_{i}Y_{i})^{2}\), where \(\hat {\pi }_{i}\) is the predicted probability in the testing set, and Y_{i} is the observed response for the ith subject in the testing set. Lower Brier’s scores indicate greater model accuracy. Performance was further quantified using calibration measurement, which fits a logistic regression to model the outcome variable against the logit of the predicted probabilities as the independent variable in the testing dataset. For a well calibrated prediction model, the intercept of the calibration model should be zero and the slope should be one. We also assessed the models by graphically comparing the agreement of the predicted versus observed probabilities over the range of the predicted probabilities.
Results
Description of the study sample
The study sample includes n=49,216 COVID19 positive cases, of whom 1938 (3.9%) died from COVID19. Comparison of the sample characteristics by patients’ mortality status due to COVID19 is reported in Table 1. The neighborhoodlevel variables (population density and average household income) and the daily mean temperatures are continuous predictors, which may have a nonlinear relationship with the COVID19 mortality risk. In Table 1, these variables were categorized into four categories to describe their distributions in relation to COVID19 mortality status; however, in the predictive models, all these variables are modeled as continuous predictors. The results presented in Table 1 show statistically significant differences in all the predictors between COVID19infected individuals who died versus not.
We included two different sets of predictors in the models. The first set included all individual and neighborhood variables, and also variables describing hospital use for COVID19 conditions (ever hospitalized, ever in ICU and ever intubated). Hospitalization, ICU use and intubationtare often intermediate outcomes between infection and mortality, and may interact with individual and neighborhood variables in predicting mortality as a result of differences in risk (e.g. due to health status and chronic disease) and quality of care. The second set of predictors included only individual and neighborhood variables, thus omitting intermediate hospital outcomes as predictors of mortality.
Comparison of predictive ability of predictive methods
Repeated SplitSample validation
The predictive accuracy of the methods averaged over 200 repeated split samples are reported in Table 2. The results indicate XGBoost yields the highest AUC at 0.9669 and the lowest Brier’s score at 0.0251. The regressionbased methods (logistic, LASSO, and GAM) perform almost equivalently well as XGBoost at only slightly lower AUCs (0.9610 to 0.9622) and higher Brier’s scores (0.0261 to 0.0265). LDA results in a lower predictive accuracy with AUC at 0.9559 and the highest Brier’s score at 0.0471. Among the treebased methods, the classification tree yields the lowest AUC at 0.9450 and the highest Brier’s score at 0.0271. RF provides an improvement over the classification tree with a higher AUC value at 0.9552 and a lower Brier’s score at 0.0270. However, both classification tree and RF methods do not perform as well as the XGBoost method. Excluding history of hospital use for COVID19 conditions as predictors results in worse predictive accuracy for all type of models. Nevertheless, the relative performance of the methods is consistent with the results when including hospital use as predictors. For ease of comparison, the distributions of the AUC and Brier’s test scores over the 200 repeated samples for all the methods are displayed in Fig. 1.
In the calibration assessment (Table 2), XGBoost and LASSO have a calibration intercept closest to zero and calibration slope closest to one as compared to the other methods. Logistic and GAM result in a slightly worse calibration compared to XGBoost and LASSO. Of the treebased methods, RF has much worse calibration as compared to the classification tree, and both are not comparable with the XGBoost method. LDA has the worst performance in terms of calibration.
A graphical assessment of calibration presents predictions on the xaxis, and the outcome on the yaxis [30]. Perfect predictions are on the 45degree line. Further, examining calibration at various levels of predictive probability provides additional insights of the agreement between predicted and observed mortality risk. Ideally, a calibration measure would compare the predicted probability with the true probability for each individual, but the measurement of actual probability for a single individual is challenging. Forming groups of individuals and calculating the proportion of positive outcomes is an approach to calculating the observed or true probability of an event or outcome, which is the central idea of the HosmerLemeshow (HL) test [31]. There are two popular ways of grouping individuals: (1) group using deciles of predicted probability, and (2) group using equal intervals according to the predicted probability. We adopted the latter grouping method to graphically demonstrate the calibration of the predictive methods at various levels of predictive probability. This is achieved by splitting the individuals into 10 equally spaced groups between 0 and 1 according to their predicted probabilities of COVID19 mortality. Model calibration can then be assessed graphically by plotting the mean predicted versus observed event rates for the 10 groups, thus providing information on the direction or magnitude of miscalibration [30]. The results are presented in Figs. 2 and 3 for the case with and without history of hospital use for COVID19, respectively. The graphs reveal that the points in the lower risk intervals are closer to the 45degree diagonal line. By contrast, the points in the higher risk intervals are more dispersed, which can be explained by the fact that very few patients had predicted risk above 0.8 and the prediction above this threshold appears to be less wellcalibrated. Most of the methods suffer from the overprediction of risk in the highrisk groups. XGBoost appears to provide better calibration with points more closely distributed around the 45degree diagonal line across the groups. When hospital use for COVID19 variables are not included as predictors, the predicted probabilities are mostly below 0.8, as shown in Fig. 3. This indicates the distributions of predicted risk of mortality are less spread out compared to models that omit hospital use variables as predictors.
A better discriminating model has more dispersed predictive probabilities than a poorly discriminating model. Therefore, the distributions of the predicted mortality probability based on a random sample of the 200 repeated split samples for all the methods are displayed in Fig. 4. The distribution of the predicted mortality probability based on all the repeated split samples yielded very similar results, so only one random sample is presented for simplicity of illustration. The distributions of the predictive mortality probability are highly rightskewed, so the predicted probabilities below 0.2 are suppressed for better visualization of the higher predictive risk. As shown in Fig. 4, all of the predictive methods with hospital use variables as predictors, except for LDA method, had longer right tails in the predicted mortality probability compared to the counterpart models that omit hospital use variables as predictors. Therefore, it is expected that the methods including hospital use variables as predictors have better discrimination and calibration performance compared to the methods omitting hospital use variables as predictors.
Evaluating differences in the importance of predictors provides additional insight into model differences. The importance of predictors in order of significance with and without history of hospital use for COVID19 variables are presented in Fig. 5. Predictors with the largest influence varied considerably between the different methods. For the XGBoost method, age is the strongest predictor, followed by reporting time. Of history of hospital use variables, ever in hospital is the strongest predictor. The neighborhoodlevel factors (population density and average income) and temperature, also contribute to the prediction. Gender has the least contribution to the prediction.
Forecasting validation
The predictive accuracy of all methods for predicting daily COVID19 mortality risk over the last 7 to 30 days of the observational period is reported in Table 3. Notably, compared to repeated splitsample validation, the predictive accuracy of all the methods for forecasting, as measured by AUC, tends to be higher. XGBoost yields the highest AUC of 0.9866 and the lowest Brier’s score of 0.0091. The regressionbased methods (logistic, LASSO and GAM) again perform nearly equivalently well as the XGBoost method with AUC ranging from 0.9819 to 0.9842 and Brier’s score ranging from 0.0094 to 0.0096, with LASSO being the method most comparable to XGBoost. Among the treebased methods, the classification tree results in the lowest AUC value at 0.9781 and highest Brier’s score at 0.0098. RF improved over the classification tree with a higher AUC at 0.9808 and a lower Brier’s score at 0.0096. Despite the higher AUC values for forecasting CV compared to repeatedsplit sample CV, the calibration for forecasting CV tends to be poorer compared to repeatedsplit sample CV.
The predictive ability of the methods for forecasting mortality risk for the last 7 to 30 days at the end of the observational period is displayed graphically in Fig. 6. The results indicate that the accuracy of all the methods tends to decrease as the number of forecast days increases. XGBoost consistently outperforms the other methods over the forecasting time window. Interestingly, the superior performance of XGBoost over the regressionbased methods in terms of AUC is more substantial in the scenario when the history of hospital use predictors are included, compared to the scenario when they are omitted. This indicates the hospital use predictors may have complex interactive effects with the rest of the predictors for predicting the mortality risk. By contrast, in the scenario omitting hospital use predictors, logistic regression performs equivalently to XGBoost. In this case with only a few predictors being considered, the advantage of XGBoost to identify complex relationships between input variables and the outcomes is less pronounced.
Discussion
This article compared regression and treebased machine learning methods for predicting COVID19 mortality risk in Toronto, Canada. This investigation demonstrates that predictive models based on machine learning methods, applied to available data, can provide important insights to inform resource planning for health care services to address the burden of the COVID19 pandemic.
Our findings revealed that using machine learning methods to data employing a few easily accessible predictor variables, including age, hospital use for COVID19, episode date, gender, and neighborhood demographic and economic characteristics, it is possible to predict the risk of COVID19 mortality with a high degree of predictive power. Our findings also provide insight into the best choice of machine learning methods to use. We found that XGBoost outperforms the conventional regression tree methods, probably because it is a regularized model formalization to control overfitting. We fit three separate logistic regression models: main effect only, GAM and LASSO. The LASSO’s predictive performance is slightly better than the main effect only method, which indicates interactions among some predictors may exist. Compared to the logistic regression, GAM yielded an almost identical model fit, which implies that assuming linear relationships between input variables and the outcomes might be adequate in this study. However, note that we did not include twoway interactions in the GAM method due to model fitting complexity. For this reason, concluding the appropriateness of the linearity assumption may be premature. In this study, we only considered a few predictors. As the number of correlated and interactive predictors increases, LASSO would likely outperform the other regressionbased methods. When nonlinear covariates effects are pronounced, GAM is expected to outperform the conventional logistic regression methods. LDA resulted in the worst predictive accuracy in this study, which indicates the assumptions of LDA do not hold (i.e., Predictors in this study are likely not drawn from a Gaussian distribution with a common covariance matrix in each class).
There are limitations to this study that merit discussion. One major limitation of this study is the unavailability of data on clinical characteristics of patients, such as comorbidities. Recent research has identified certain chronic health conditions risk factors (e.g. obesity) as strong predictors of prognosis and severity of progression for COVID19 [32]. These crucial pieces of information are not readily available in publicly accessible data, but could be obtained from administrative health databases. Another potential limitation is the inclusion hospitalization, ICU use, and intubation for COVID19 as predictors. While they are clearly important predictors, the interpretation of these predictors and the policy implications of including them in models need to be considered. They may be proxies for patients’ underlying health status, or proxies for access to and quality of care. They are also intermediate health outcomes prior to most COVID19 deaths. Another limitation is that we did not consider support vector machine techniques or neural networks [25], which could be alternative approaches for predicting COVID19 mortality risk.
Despite the limitations, our findings revealed that by focusing on a few easily accessible variables, including age, past hospital use for COVID19, episode date, gender, and neighborhood demographic and economic characteristics, it is possible to predict the risk of mortality with high predictive power in the studied population.
Conclusion
The study demonstrates that the high predictive accuracy for COVID19 mortality risk can be achieved based on publicly available data in the studied population. This study provided a careful assessment of the predictive accuracy of the regression and treebased machine learning methods for predicting COVID19 mortality risk among confirmed cases in the study region. Although the prediction model established in our study included only a few easily accessible variables, XGBoost and LRbased methods have high predictive power with XGBoost resulting in slightly better performance. This type of datadriven risk prediction may assist health resource planning for COVID19.
Availability of data and materials
The datasets generated and/or analysed during the current study are available in the City of Toronto’s Open Data Portal, https://open.toronto.ca/
Abbreviations
 COVID19:

coronavirus disease
 XGBoost:

extreme gradient boosting
 GAM:

generalized additive model
 LDA:

linear discriminant analysis
 LASSO:

least absolute shrinkage and selection operator
 Tree:

classification tree
 RF:

random forest
 AUC:

area under the receiver operating characteristic (ROC) curve ICU: intensive care unit CV: cross validation
References
Hastie T, Tibshirani R. Generalized Additive Models. New York: Chapman and Hall; 1990.
Wood S. Stable and efficient multiple smoothing parameter estimation for generalized additive models. J Am Stat Assoc. 2004; 99(467):673–86.
Wood S. J Royal Stat Soc Series B (Stat Methodol). 2011; 73(1):3–36.
Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees (The Wadsworth Statistics/probability Series). Belmont, California: Wadsworth International Group; 1984.
Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002; 2(3):18–22.
Friedman J. Greedy function approximation: a gradient boosting machine. Annals Stat. 2001; 29(5):1189–232.
Friedman J. Stochastic gradient boosting. Comput Stat Data Anal. 2002; 38(4):367–78.
James K, White R, Kraemer H. Repeated split sample validation to assess logistic regression and recursive partitioning: an application to the prediction of cognitive impairment. Stat Med. 2005; 24(19):3019–35.
Garzotto M, Beer T, Hudson R, Peters L, Hsieh Y, Barrera E, Klein T, Mori M. Improved detection of prostate cancer using classification and regression tree analysis. J Clin Oncol. 2005; 23(19):4322–9.
Austin P. A comparison of regression trees, logistic regression, generalized additive models, and multivariate adaptive regression splines for predicting AMI mortality. Stat Med. 2007; 26(15):2937–57.
Das A, Mishra S, Gopalan S. Predicting CoVID19 community mortality risk using machine learning and development of an online prognostic tool. PeerJ. 2020; 8:e10083.
Hu C, Chen C, Fang Y, Liang S, Wang H, Fang W, Sheu C, Perng W, Yang K, Kao K, Wu C, Tsai C, Lin M, Chao W. Using a machine learning approach to predict mortality in critically ill influenza patients: a crosssectional retrospective multicentre study in Taiwan. BMJ Open. 2020; 10(2):e033898.
Liu J, Wu J, Liu S, Li M, Hu K, Li K. Predicting mortality of patients with acute kidney injury in the ICU using XGBoost model. PLOS ONE. 2021; 16(2):1–11.
Yao R, Jin X, Wang G, Yu Y, Wu G, Zhu Y, Li L, Li Y, Zhao P, Zhu S, Xia Z, Ren C, Yao Y. A machine learningbased prediction of hospital mortality in patients with postoperative sepsis. Front Med. 2020; 7:445.
Heldt F, Vizcaychipi M, Peacock S. Early risk assessment for COVID19 patients from emergency department data using machine learning. Sci Rep. 2021; 11(4200).
Wang J, Tang K, Feng K, Lin X, Lv W, Chen K, Wang F. Impact of Temperature and Relative Humidity on the Transmission of COVID19: A Modeling Study in China and the United States. BMJ Open. 2021; 11(2).
Sajadi M, Habibzadeh P, Vintzileos A, Shokouhi S, MirallesWilhelm F, Amoroso A. Temperature, Humidity, and Latitude Analysis to Estimate Potential Spread and Seasonality of Coronavirus Disease 2019 (COVID19). JAMA Network Open. 2020; 3(6):2011834.
Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Stat Soc Ser B (Methodol). 1996; 58(1):267–88.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010; 33(1):1–22.
Wood S. Generalized Additive Models: an Introduction with R. Boco Raton: CRC Press; 2017.
McLachlan G. Discriminant Analysis and Statistical Pattern Recognition. New Jersey, United States: Wiley; 2004.
Venables W, Ripley B. Modern Applied Statistics with S, 4th edn. New York: Springer; 2002.
Gareth J, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R. New York: Springer; 2017.
Therneau T, Atkinson B. Rpart: Recursive Partitioning and Regression Trees. R package version 4.115. 2019. https://CRAN.Rproject.org/package=rpart.
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning, (2nd Ed.) New York: Springer; 2008.
James K, White R, Kraemer H. Random forests. Mach Learn. 2001; 45(1):5–32.
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Miningk, KDD ’16. New York, NY, USA: Association for Computing Machinery: 2016. p. 785–94.
Harrell F. Regression Modeling Strategies: with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. New York: Springer; 2015.
Rufibach K. Use of Brier score to assess binary predictions. J Clin Epidemiol. 2010; 63(8):938–9.
Steyerberg E, Vickers A, Cook N, Gerds T, Gonen M, Obuchowski N, Pencina M, Kattan M. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010; 21(1):128–38.
Hosmer D, Lemesbow S. Goodness of fit tests for the multiple logistic regression model. Commun Stat Theory Meth. 1980; 9(10):1043–69.
Guan W, Liang W, Zhao Y, Liang H, Chen Z, Li Y, Liu X, Chen R, Tang C, Wang T, Ou C, Li L, Chen P, Sang L, Wang W, Li J, Li C, Ou L, Cheng B, Xiong S, Ni Z, Xiang J, Hu Y, Liu L, Shan H, Lei C, Peng Y, Wei L, Liu Y, Hu Y, Peng P, Wang J, Liu J, Chen Z, Li G, Zheng Z, Qiu S, Luo J, Ye C, Zhu S, Cheng L, Ye F, Li S, Zheng J, Zhang N, Zhong N, He J. Comorbidity and its impact on year=1590, patients with Covid19 in China: A Nationwide Analysis. Eur Respir J. 2020; 14;55(5):2000547.
Acknowledgements
The authors would like to thank the suggestions and comments from the Editor and reviewers, which significantly helped to improve the quality of this manuscript. The authors would also like to acknowledge the support from the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants. This research was enabled in part by support provided by WestGrid (www.westgrid.ca) and Compute Canada Calcul Canada (www.computecanada.ca).
Funding
This research was supported by the discovery grant from the Natural Sciences and Engineering Research Council (NSERC) of Canada. This research was enabled in part by support provided by Compute Canada.
Author information
Authors and Affiliations
Contributions
C.F. and G.K. conceived and planned the study. C.F. conducted all the analyses and wrote the paper with input from G.K. All authors (C.F., G.K. and E.J.C.) contributed to discussions and the writing of the manuscript. All authors reviewed and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Study was carried out in accordance with ethical guidelines of Dalhousie University.
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Feng, C., Kephart, G. & JuarezColunga, E. Predicting COVID19 mortality risk in Toronto, Canada: a comparison of treebased and regressionbased machine learning methods. BMC Med Res Methodol 21, 267 (2021). https://doi.org/10.1186/s12874021014414
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874021014414