- Open Access
Predicting postoperative surgical site infection with administrative data: a random forests algorithm
BMC Medical Research Methodology volume 21, Article number: 179 (2021)
Since primary data collection can be time-consuming and expensive, surgical site infections (SSIs) could ideally be monitored using routinely collected administrative data. We derived and internally validated efficient algorithms to identify SSIs within 30 days after surgery with health administrative data, using Machine Learning algorithms.
All patients enrolled in the National Surgical Quality Improvement Program from the Ottawa Hospital were linked to administrative datasets in Ontario, Canada. Machine Learning approaches, including a Random Forests algorithm and the high-performance logistic regression, were used to derive parsimonious models to predict SSI status. Finally, a risk score methodology was used to transform the final models into the risk score system. The SSI risk models were validated in the validation datasets.
Of 14,351 patients, 795 (5.5%) had an SSI. First, separate predictive models were built for three distinct administrative datasets. The final model, including hospitalization diagnostic, physician diagnostic and procedure codes, demonstrated excellent discrimination (C statistics, 0.91, 95% CI, 0.90–0.92) and calibration (Hosmer-Lemeshow χ2 statistics, 4.531, p = 0.402).
We demonstrated that health administrative data can be effectively used to identify SSIs. Machine learning algorithms have shown a high degree of accuracy in predicting postoperative SSIs and can integrate and utilize a large amount of administrative data. External validation of this model is required before it can be routinely used to identify SSIs.
Surgical site infection (SSI) is common and considered one of the most common types of postoperative complications . SSIs are associated with substantial morbidity and mortality, prolonged hospital duration of stay, increased hospital readmission rate, and financial burden to health care systems [1,2,3,4,5]. Previous research has shown the importance of effective prevention strategies targeting both short- and long-term consequences of SSI, which requires an ability to track SSIs . Since the primary data collection can be time-consuming and expensive, routinely collected health administrative data offer ample opportunities to identify and monitor SSIs, and assess the impact of prevention strategies, given a wide population coverage and minimal costs and efforts. Several studies have developed some accurate administrative algorithms to identify SSIs [6,7,8,9,10], while other studies have found that SSI identification using administrative data is imprecise . However, previous studies were often based on small sample sizes and/or a limited set of pre-selected variables to predict SSIs.
Machine learning approaches have been successfully applied to create predictive models in several fields of study, including automatic medical diagnostics [12, 13]. With interpretability of model parameters and ease of use, logistic regression can generate excellent models and serve as a commonly accepted statistical tool. Random Forests approach is used in situations where regression assumptions may be violated by situations in which many predictors are associated with a small number of outcomes . It can cope with inter-correlation between multiple explanatory variables, since each predictor is selected randomly for each stage of the learning process , unlike standard regression approaches. Previous studies have indicated that the Random Forests approach may have better prediction accuracy than other machine learning methods [16, 17]. We hypothesized that the use of machine learning approaches and a large data set with many features will improve the accuracy of SSI prediction. This study aimed to develop efficient algorithms to identify SSIs within 30 days after surgery using health administrative data.
Material and methods
This study was divided into three stages. In the first stage, a Random Forests algorithm was used to perform a preliminary screening of variables and to rank the importance of candidate variables. In the second stage, the 30 most important variables from the first stage were input into the high-performance logistic regression to build interpretable and parsimonious models for all three administrative datasets used in this study. Finally, we used risk score modeling methodology to transform the final logistic models form the second stage into the risk score system.
Selection and description of participants
This study was performed at The Ottawa hospital (TOH), Canada, a 1200-bed academic health sciences center providing approximately 90% of the major surgical operations in a catchment area of 1.2 million people. We identified all patients at TOH aged 18 years and older who underwent surgery and were included in the American College of Surgeons National Surgical Quality Improvement Program (NSQIP) data collection, between April 1, 2010, and March 31, 2015. The NSQIP uses trained Surgical Clinical Reviewers to collect data using a combination of chart review and follow up from the preoperative period through 30 days postoperatively. Patients were excluded if: 1) they were not eligible for the Ontario Health Insurance Program (OHIP) or had an invalid OHIP number, because this was required for linkage to health administrative datasets; or 2) they had missing admission, discharge, or surgery dates.
Population-based health administrative datasets
We linked the NSQIP dataset to three distinct population-based, health administrative datasets housed at the Institute for Clinical and Evaluative Sciences (ICES). ICES is an independent, non-profit research institute whose legal status under Ontario’s health information privacy law allows it to collect and analyze health care and demographic data, without informed consent, for health system evaluation and improvement. The use of data in this project was authorized under section 45 of Ontario’s Personal Health Information Protection Act, which does not require review by a Research Ethics Board. The datasets included: 1) the Discharged Abstract Database and Same Day Surgery Database to identify the records of the hospitalization (ICD-10 code), including admission and discharge dates, diagnoses, 2) the Physician Services Database to retrieve all claims for services provided by all eligible health care providers, and 3) the Ontario Health Insurance Plan (OHIP) database that contains physician diagnostic codes (ICD-9 codes) and diagnosis descriptions. All patients were followed for 30 days from the time of their surgery. All databases were linked using anonymized unique identifiers and analyzed at the ICES at the University of Ottawa, Ontario. This study was approved by the Ottawa Health Science Network Research Ethics Board.
All individuals who had any type of SSIs (i.e. superficial, deep, or organ space) (Additional file 1) within 30 days after surgery, according to the definition of the NSQIP protocol, were defined as having experienced an SSI.
This study utilized a 3-stage predictive modeling based on the hybrid modeling approaches developed in previous studies [14, 15, 18]. All stages described below were applied to each administrative dataset used in this study to generate three sub-models that contributed to the omnibus SSI model.
Stage 1 – model development using random forests algorithm
Details of Random Forests method have been described elsewhere [19,20,21]. In short, each of the classification trees is built using a bootstrap sample of the data, and a random subset of variables was selected at each split, thereby constructing a large collection of decision trees with controlled variation [22, 23] (Additional file 2). The Random Forests trees are not pruned, so as to obtain low-bias trees. Every tree in the forest casts a “vote” for the best classification for a given observation, and the class receiving most votes results in the prediction for that observation. The study cohort was first divided randomly into derivation (70%) and validation (30%) samples (Additional file 3). Then, the derivation data was sampled to create an in-bag partition – (2/3) to construct the decision tree, and a smaller out-of-bag partition (1/3) to test the constructed tree to evaluate its performance by computing: 1) misclassification error, 2) C-statistics, and 3) model performance (sensitivity, specificity, etc.). The optimal number of trees and a subset of variables at each node were selected using the “tuneRF” function in R to minimize the misclassification error. Random Forests calculates estimates of variable importance for classification using permutation variable importance measure (VIM) , which is based on the decrease of a classification accuracy when values of a variable in a node of a tree are permuted randomly. Finally, K-fold cross validation was used to evaluate the Random Forests model with 10 folds. We identified subsets of top 30 important diagnostic or procedure codes to predict SSIs, using a mean decrease accuracy value of 0.02 as a cut-off point. The Random Forests analyses were performed in R statistical software (3.3.2.) using “randomForest” package .
Stage 2 – stepwise model selection using high-performance logistic regression approach
Random forests algorithm was used to perform a preliminary screening of variables and to gain importance ranks. Then, the selected top-30 important predictors were input into the high-performance logistic model with stepwise variable selection to find the best parsimonious model to predict SSIs [14, 24, 25]. High-performance logistic regression (proc hplogisitc) belongs to the high-performance analytics procedures that can be used to reduce the dimension or identify important variables to obtain parsimonious predictive models . It permits several link functions and can handle ordinal and nominal data with more than two response categories . The Schwarz Bayesian Criterion (SBC) was used as a penalized measure of fit for logistic regression model to help avoid the model over-fitting.
Stage 3 – point system or risk scores
We used the methods suggested by Sullivan et al.  to summarize each logistic model from stage 2 as a point system. The point system or risk scores provide statistical information in a more clinically useful form than logistic regression models, as generalizability of the models developed from data from a single or a small group of hospitals to other patient populations is questionable [28, 29]. Clinical prediction models and associated risk-scoring systems are popular statistical methods as they permit a rapid assessment of patient risk without the use of computers or other electronic devices . The use of such points-based systems facilitates evidence-based clinical decision making . The point system developed in this study was designed to predict the risk of postoperative SSIs, based on a patient’s pre-procedural risk factors or predictors. The point score assigned to each predictor was derived from a well-fit logistic regression model.
The point scores were developed for hospitalization (ICD-10) and physician (ICD-9) diagnostic codes, and physician procedure claims. All variables in the models were categorical, and the distance between a variable and its base category in regression coefficient units was equal to the size of the coefficient. For each variable, its distance from the base category in regression coefficient units was divided by this constant and rounded to the nearest integer to get its point value.
Then, the obtained point scores were input into logistic regression model and adjusted for other potential confounding factors suggested by the existing literature, including age, sex, surgical procedure, emergency case, concurrent surgical procedures, patient’s physical status (ASA-5), and duration of surgery. The full model discrimination (C statistics or AUC) and calibration (Hosmer-Lemeshow (H-L) statistics) were assessed in the validation dataset. All methods were performed in accordance with the guidelines for developing and reporting Machine Learning predictive models in biomedical research . The high-performance regression and point score assignment were performed in SAS 9.4 statistical software.
We identified 14,351 patients who underwent surgery from April 1, 2010 to March 31, 2015 and were enrolled into NSQIP at our hospital. An SSI was identified in 795 (5.5%) of these patients. Of these, 540 (68%) had superficial SSIs and 255 (32%) had deep or organ space SSIs. Descriptive statistics for patients in the study sample are reported in Additional file 4. The derivation and validation datasets were similar in terms of baseline covariates (Additional file 5).
Predictive modeling for hospitalization diagnostic codes (ICD-10)
We identified 3085 hospitalization diagnostic (ICD-10) codes recorded within 30 days following the surgery date. These codes then were clustered into 994 three-digit hospitalization diagnostic codes that were used for the further analyses.
Stage 1: Given a large number of diagnostic codes (possible predictors), the Random Forests approach was used to identify a subset of top important 30 hospitalization diagnostic codes that best predicts classification. We used 800 classification trees and 46 variables available for splitting at each tree node. The accuracy of the Random Forests model was 95.3%. The resulting SSI prediction model demonstrated positive predictive value (PPV) of 98%, negative predictive value (NPV) of 97%, and AUC (area under the receiver operating characteristic curve) of 0.78 (95% CI 0.77–0.79). The accuracy of the Random Forests model after a 10-fold cross-validation was 94.3%. Figure 1 presents the top 30 hospitalization diagnostic (ICD-10) codes for classification of SSIs that have been identified using the permutation VIM.
Stage 2: The identified top 30 hospitalization diagnostic codes (ICD-10) codes were input into the high-performance logistic regression with a stepwise selection to identify the best parsimonious model to predict SSIs. Table 1, model 1 presents the final model of six hospitalization diagnostic codes to identify SSIs (AUC 0.87, 95% CI 0.86–0.89).
Stage 3: Risk scores for the final model of hospitalization diagnostic (ICD-10) codes are presented in Table 1, Model 1 . Among the entire cohort, 80.3% of patients had a score of 0, 11.8% had a score of 1, and 7.9% had a score equal or greater than 2.
Predictive modeling for physician diagnostic (ICD-9) codes.
We identified 442 physician diagnostic 3-digit codes (using ICD-9-CA) recorded within 30 days following the surgery date.
Stage 1: Given a large number of diagnostic codes (possible predictors), the Random Forests approach was used to identify a subset of 30 physician diagnostic codes that best predicts SSIs. The best misclassification rate was achieved by using 800 classification trees and 31 variables available for splitting at each tree node. The accuracy of the Random Forests model was 94.7%. The resulted SSI prediction model demonstrated PPV of 98%, NPV of 96%, and AUC of 0.82 (95% CI 0.81–0.83). The accuracy of the model after a 10-fold cross-validation was 94.1%. Figure 2 presents the top 30 important physician diagnostic (ICD-9) codes for prediction of SSIs that have been identified using VIM.
Stage 2: The identified top 30 physician diagnostic codes were input into the high-performance logistic regression model to identify the best parsimonious model for prediction of SSIs, using a stepwise selection approach. Table 1, Model 2 presents the final models of nine physician diagnostic codes to identify SSIs (AUC 0.85, 95% CI 0.84–0.86).
Stage 3: Risk scores for the final model of physician diagnostic codes are presented in Table 1, Model 2 . Among the entire cohort, 77.8% of patients had a score of 0, 7.7% had a score of 1, and 14.5% had a score equal or greater than 2.
Predictive modeling for physician procedure claims
We identified 2543 physician procedure claims recorded within 30 days following the surgery date. These codes then were clustered into 610 three-digit codes that were used for the further analyses.
Stage 1: Given a large number of physician procedure codes (possible predictors), Random forests approach was used to identify a subset of 30 physician procedure claims that best predicts SSIs. The best misclassification rate was achieved by using 1000 classification trees and 37 variables available for splitting at each tree node. The accuracy of the Random Forests model was 94.8%. The resulted SSI prediction model demonstrated PPV of 99%, NPV of 97%, and AUC of 0.82 (95% CI 0.81–0.83). The accuracy of the model after a 10-fold cross-validation was 94.4%. Figure 3 presents the top 30 physician procedure claims that have been identified using the permutation VIM.
Stage 2: The identified top 30 physician procedure claims were input into the high-performance logistic regression model to identify the best parsimonious model for prediction of SSIs. We used a stepwise variable selection approach. Table 1, Model 3 presents the final models of 14 physician procedure claims to identify SSIs (AUC 0.84, 95% CI 0.83–0.85).
Stage 3: Risk scores for the final model of physician procedure claims are presented in Table 1, Model 3 . Among the entire cohort, 55.4% of patients had a score of 0, 11.9% had a score of 1, and 44.6% had a score equal or greater than 2.
Full model with total risk score of diagnostic and procedure codes
In the derivation cohort, the total scores of hospitalization diagnostic (ICD-10) codes, physician diagnostic (ICD-9) codes and physician procedure claims were included in the logistic regression model and adjusted for potential confounding factors, including surgical specialties, age, sex, duration of surgery, emergency case, ASA class and concurrent surgical procedures (Table 2).
The full model had excellent discrimination (AUC 0.91; 95% CI, 0.90–0.92) and calibration (H-L statistics, 4.53, p = 0.402). The predicted probability threshold with the optimal operating characteristics  (e.g., the square of distance between the point (0, 1) on the upper left hand corner of ROC space and any point on ROC curve) was a predicted risk of 4% (sensitivity, 83.4%; specificity, 89.2%; PPV, 34.2%; and NPV, 99.1%). In the internal validation cohort, the full model remained strongly discriminative (AUC 0.89, 95% CI 0.88–0.90) and well calibrated (H-L statistics, 6.47, p = 0.487) (Fig. 4).
We used a 3-stage predictive modeling approach to derive and internally validate models to predict SSIs within 30 days after surgical procedure. To the best of our knowledge, this is the first study that used Machine Learning approaches to develop efficient algorithms for identifying SSIs within 30 days after surgery by use of health administrative data. The key finding of our study is that the risk of SSIs can be reliably estimated using routinely collected administrative data, including physician procedure claims, hospital (ICD-10) and physician (ICD-9) diagnostic codes. Our study results demonstrate high performance of the Random Forests algorithm for prediction of SSIs without pre-selection of possible predictors given a small number of cases. We derived a relatively small set of variables to identify postoperative SSIs, including 6 hospital diagnostic codes, 9 physician diagnostic codes, and 14 physician procedure claims.
Several studies have examined the use of administrative data to identify postoperative SSIs [6,7,8,9,10]. Our study findings are consistent with these studies [6, 10]. van Walraven et al. , for example, found that administrative data, including hospital diagnostic, emergency department visit codes and physician procedure claims, can be effectively used to identify postoperative patients with a low risk of having SSIs within 30 days of their surgical procedure. In particular, the predictive probability threshold with the optimal characteristics was a predicted risk of 5% (sensitivity, 82.1%, specificity, 85.6%, PPV, 27.7%). Additionally, Sands et al. found that  automated medical and claim records together can be used to screen for post discharge SSIs, but the method they used identified only 10% of procedures as possible infections.
The approach used in our study added a new contribution to the existing literature by incorporating much larger set of features as compared with the previous studies. It was possible to include all available diagnostic or procedure codes to identify SSIs in this study, because Random Forests approach is generally unaffected by the addition of irrelevant features and is robust to collinearity due to the use of subsets of random variables for tree splits. All the features included in this study were obtained from routinely collected data, and given the complex etiology of SSIs, there might be variables that would be overlooked if we used a narrower search strategy guided by a priori clinical expectations. It would be inappropriate to interpret the identified diagnostic or procedure codes as either causes or consequences of SSIs. Random Forests allows us to select variables that are influencing prediction given a small sample sizes and the extremely small ratio of samples to variable (large “p” and small “n”). If the identified important variables are consistent with clinical knowledge, there will be more confidence in the derived model as a decision support tool.
Several aspects of our study should be carefully considered. First, our study contained no information about outpatient antibiotic treatments because the Ontario health administrative data used for the study captures medication use for people over the age of 65 and who are on social assistance. Also, we did not include information about laboratory tests, because the Ontario health administrative data captures information only on outpatient laboratory tests, while laboratory tests performed during hospitalization are most important in predicting SSIs. Thus, information about antibiotic use and laboratory test could substantially improve SSI identification. Second, our study and model captured SSIs that occurred within 30 days after surgical procedure, so any SSI that occurred outside of this timeframe would have been missed. Third, our study was conducted in a single teaching hospital, providing about 90% of the major surgical operations in a catchment area of 1.2 million people. Therefore, external validation is necessary to measure model’s utility in other hospitals and geographic regions. Finally, the coding systems used in the province of Ontario might not be available in other jurisdictions. Therefore, some modifications might be required before using our models in other health regions.
This study shows that health administrative data could be effectively used in identifying SSIs. Machine learning approaches have shown a high degree of accuracy in predicting postoperative SSIs and can integrate and utilize a large amount of administrative data. The results of our study are useful in advancing current and future efforts to use administrative data for patient safety surveillance and improvement. Further research should examine the use of machine learning approaches to identify SSIs, stratified by the specific types of surgical procedures.
Availability of data and materials
The data that support the findings of this study are available at the Institute for Clinical Evaluative Sciences (ICES) (www.ices.on.ca/DAS), but restrictions apply for the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of ICES.
Surgical site infections
National Surgical Quality Improvement Program
Ontario Health Insurance Plan
Classification of Diseases
Patient’s physical status
Positive predictive value
Negative predictive value
Variable importance measure
Adjusted odds ratio
Area under curve
Pittet D, Harbarth S, Ruef C, Francioli P, Sudre P, Petignat C, et al. Prevalence and risk factors for nosocomial infections in four university hospitals in Switzerland. Infect Control Hosp Epidemiol. 1999;20(1):37–42.
Petrosyan Y, Thavorn K, Maclure M, Smith G, McIsaac DI, Schramm D, et al. Long-term health outcomes and health system costs associated with surgical site infections: a retrospective cohort study. Ann Surg. 2019.
Jenks PJ, Laurent M, McQuarry S, Watkins R. Clinical and economic burden of surgical site infection (SSI) and predicted financial consequences of elimination of SSI from an English hospital. J Hosp Infect. 2014;86(1):24–33.
Whitehouse JD, Friedman ND, Kirkland KB, Richardson WJ, Sexton DJ. The impact of surgical-site infections following orthopedic surgery at a community hospital and a university hospital: adverse quality of life, excess length of stay, and extra cost. Infect Control Hosp Epidemiol. 2002;23(4):183–9.
Badia JM, Casey AL, Petrosillo N, Hudson PM, Mitchell SA, Crosby C. Impact of surgical site infection on healthcare costs and patient outcomes: a systematic review in six European countries. J Hosp Infect. 2017;96(1):1–15.
van Walraven C, Jackson TD, Daneman N. Derivation and validation of the surgical site infections risk model using health administrative data. Infect Control Hosp Epidemiol. 2016;37(4):455–65.
Grammatico-Guillon L, Baron S, Gaborit C, Rusch E, Astagneau P. Quality assessment of hospital discharge database for routine surveillance of hip and knee arthroplasty-related infections. Infect Control Hosp Epidemiol. 2014;35(6):646–51.
Rennert-May E, Manns B, Smith S, Puloski S, Henderson E, Au F, et al. Validity of administrative data in identifying complex surgical site infections from a population-based cohort after primary hip and knee arthroplasty in Alberta, Canada. Am J Infect Control. 2018;46(10):1123–6.
Sands K, Vineyard G, Livingston J, Christiansen C, Platt R. Efficient identification of postdischarge surgical site infections: use of automated pharmacy dispensing information, administrative data, and medical record information. J Infect Dis. 1999;179(2):434–41.
van Walraven C, Jackson TD, Daneman N. Administrative data measured surgical site infection probability within 30 days of surgery in elderly patients. J Clin Epidemiol. 2016;77:112–7.
Song X, Cosgrove S, Pass M, Perl T. Using hospital claim data to monitor surgical site infections for inpatient procedures. Am J Infect Control. 2008;36(3).
Cohen AM, Ambert K, McDonagh M. A prospective evaluation of an automated classification system to support evidence-based medicine and systematic review. AMIA Annu Symp Proc. 2010;2010:121–5.
Szlosek DA, Ferrett J. Using Machine Learning and Natural Language Processing Algorithms to Automate the Evaluation of Clinical Decision Support in Electronic Medical Record Systems. EGEMS (Wash DC). 2016;4(3):1222.
Maroco J, Silva D, Rodrigues A, Guerreiro M, Santana I, de Mendonca A. Data mining methods in the prediction of dementia: a real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests. BMC Res Notes. 2011;4:299.
Douglas PK, Harris S, Yuille A, Cohen MS. Performance comparison of machine learning algorithms and number of independent components used in fMRI decoding of belief vs. disbelief. Neuroimage. 2011;56(2):544–53.
Li J, Alvarez B, Siwabessy J, Tran M, Huang Z, Przeslawski R, et al. Application of random forest, generalised linear model and their hybrid methods with geostatistical techniques to count data: predicting sponge species richness. Environ Model Softw. 2017;97:112–29.
Li J, Tran M, Siwabessy J. Selecting optimal random Forest predictive models: a case study on predicting the spatial distribution of seabed hardness. PLoS One. 2016;11(2):e0149089.
Bartz-Kurycki MA, Green C, Anderson KT, Alder AC, Bucher BT, Cina RA, et al. Enhanced neonatal surgical site infection prediction model utilizing statistically and clinically significant variables in combination with a machine learning algorithm. Am J Surg. 2018;216(4):764–77.
Breiman L. Random Forests Machine Learning. 2001;45:5–32.
Touw WG, Bayjanov JR, Overmars L, Backus L, Boekhorst J, Wels M, et al. Data mining in the life sciences with random Forest: a walk in the park or lost in the jungle? Brief Bioinform. 2013;14(3):315–26.
Liam A, Wiener M. Classification and regression by random forest. R News. 2002;2(3):18–22.
Ozcift A. Enhanced cancer recognition system based on random forests feature elimination algorithm. J Med Syst. 2012;36(4):2577–85.
Diaz-Uriarte R, Alvarez de Andres S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3.
Doerken S, Avalos M, Lagarde E, Schumacher M. Penalized logistic regression with low prevalence exposures beyond high dimensional settings. PLoS One. 2019;14(5):e0217057.
Yao D, Yang J, Zhan X. A novel method for disease prediction: hybrid of random Forest and multivariate adaptive regression splines. J Comput. 2013;8(1):170–7.
Cohen R. “SAS Meets Big Iron: High Performance Computing in SAS Analytical Procedures,” in Proceedings of the Twenty-Seventh Annual SAS Users Group International Conference. Cary, NC: SASInstitute Inc.; 2002.
Sullivan LM, Massaro JM, D'Agostino RB Sr. Presentation of multivariate data for clinical use: the Framingham study risk score functions. Stat Med. 2004;23(10):1631–60.
Wu C, Hannan EL, Walford G, Ambrose JA, Holmes DR Jr, King SB 3rd, et al. A risk score to predict in-hospital mortality for percutaneous coronary interventions. J Am Coll Cardiol. 2006;47(3):654–60.
Hong W, Lillemoe KD, Pan S, Zimmer V, Kontopantelis E, Stock S, et al. Development and validation of a risk prediction score for severe acute pancreatitis. J Transl Med. 2019;17(1):146.
Austin PC, Lee DS, D'Agostino RB, Fine JP. Developing points-based risk-scoring systems in the presence of competing risks. Stat Med. 2016;35(22):4056–72.
Luo W, Phung D, Tran T, Gupta S, Rana S, Karmakar C, et al. Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View. J Med Internet Res. 2016;18(12):e323-e.
Streiner DL, Cairney J. What's under the ROC? An introduction to receiver operating characteristics curves. Can J Psychiatr. 2007;52(2):121–8.
This study was supported by ICES, which is funded by an annual grant from the Ontario Ministry of Health and Long-Term Care (MOHLTC). The opinions, results and conclusions reported in this paper are those of the authors and are independent from the funding sources. No endorsement by ICES or the Ontario MOHLTC is intended or should be inferred.
This study was supported by the Ontario Research Fund (RE05–070) and Canadian Institutes of Health Research (CIHR) grant. The study design, opinions, results and conclusions reported in this paper are those of the authors and are independent from the funding sources.
This study was approved by the Ottawa Health Science Network Research Ethics Board.
Consent to participate
Consent for publication
ICES is a prescribed entity under section 45 of Ontario’s Personal Health Information Protection Act. Section 45 authorizes ICES to collect personal health information, without informed consent, for the purpose of analysis or compiling statistical information with respect to the management, evaluation or monitoring of the allocation of resources to or planning for all or part of the health system. Projects conducted under section 45, by definition, do not require review by a Research Ethics Board. This project was conducted under section 45 and approved by ICES’ Privacy and Legal Office.
No researcher involved in this study had any declared or otherwise known conflicts of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Includes American College of Surgery – National Surgical Quality I mprovement Program definition of different types of SSIs: superficial, deep and orga-space.
Provides information about the Random Forests algorithm: constructing a large collection of decision trees with controlled variation, as well as how the multiple models are normally combined by ‘voting’.
Provides information about the data partitioning into derivation (70%) and validation (30%) samples.
Provides descriptive statistics for patients in the study sample.
Provides information on the derivation and validation datasets.
Provides information on the title of the manuscript, author list and affiliations.
About this article
Cite this article
Petrosyan, Y., Thavorn, K., Smith, G. et al. Predicting postoperative surgical site infection with administrative data: a random forests algorithm. BMC Med Res Methodol 21, 179 (2021). https://doi.org/10.1186/s12874-021-01369-9
- Surgical site infection
- Administrative data
- Machine learning
- Random forests
- Data mining
- Predictive modeling