Predicting postoperative surgical site infection with administrative data: a random forests algorithm

Background Since primary data collection can be time-consuming and expensive, surgical site infections (SSIs) could ideally be monitored using routinely collected administrative data. We derived and internally validated efficient algorithms to identify SSIs within 30 days after surgery with health administrative data, using Machine Learning algorithms. Methods All patients enrolled in the National Surgical Quality Improvement Program from the Ottawa Hospital were linked to administrative datasets in Ontario, Canada. Machine Learning approaches, including a Random Forests algorithm and the high-performance logistic regression, were used to derive parsimonious models to predict SSI status. Finally, a risk score methodology was used to transform the final models into the risk score system. The SSI risk models were validated in the validation datasets. Results Of 14,351 patients, 795 (5.5%) had an SSI. First, separate predictive models were built for three distinct administrative datasets. The final model, including hospitalization diagnostic, physician diagnostic and procedure codes, demonstrated excellent discrimination (C statistics, 0.91, 95% CI, 0.90–0.92) and calibration (Hosmer-Lemeshow χ2 statistics, 4.531, p = 0.402). Conclusion We demonstrated that health administrative data can be effectively used to identify SSIs. Machine learning algorithms have shown a high degree of accuracy in predicting postoperative SSIs and can integrate and utilize a large amount of administrative data. External validation of this model is required before it can be routinely used to identify SSIs. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-021-01369-9.


Background
Surgical site infection (SSI) is common and considered one of the most common types of postoperative complications [1]. SSIs are associated with substantial morbidity and mortality, prolonged hospital duration of stay, increased hospital readmission rate, and financial burden to health care systems [1][2][3][4][5]. Previous research has shown the importance of effective prevention strategies targeting both short-and long-term consequences of SSI, which requires an ability to track SSIs [2]. Since the primary data collection can be time-consuming and expensive, routinely collected health administrative data offer ample opportunities to identify and monitor SSIs, and assess the impact of prevention strategies, given a wide population coverage and minimal costs and efforts. Several studies have developed some accurate administrative algorithms to identify SSIs [6][7][8][9][10], while other studies have found that SSI identification using administrative data is imprecise [11]. However, previous studies were often based on small sample sizes and/or a limited set of pre-selected variables to predict SSIs.
Machine learning approaches have been successfully applied to create predictive models in several fields of study, including automatic medical diagnostics [12,13]. With interpretability of model parameters and ease of use, logistic regression can generate excellent models and serve as a commonly accepted statistical tool. Random Forests approach is used in situations where regression assumptions may be violated by situations in which many predictors are associated with a small number of outcomes [14]. It can cope with inter-correlation between multiple explanatory variables, since each predictor is selected randomly for each stage of the learning process [15], unlike standard regression approaches. Previous studies have indicated that the Random Forests approach may have better prediction accuracy than other machine learning methods [16,17]. We hypothesized that the use of machine learning approaches and a large data set with many features will improve the accuracy of SSI prediction. This study aimed to develop efficient algorithms to identify SSIs within 30 days after surgery using health administrative data.

Material and methods
This study was divided into three stages. In the first stage, a Random Forests algorithm was used to perform a preliminary screening of variables and to rank the importance of candidate variables. In the second stage, the 30 most important variables from the first stage were input into the high-performance logistic regression to build interpretable and parsimonious models for all three administrative datasets used in this study. Finally, we used risk score modeling methodology to transform the final logistic models form the second stage into the risk score system.

Selection and description of participants
This study was performed at The Ottawa hospital (TOH), Canada, a 1200-bed academic health sciences center providing approximately 90% of the major surgical operations in a catchment area of 1.2 million people. We identified all patients at TOH aged 18 years and older who underwent surgery and were included in the American College of Surgeons National Surgical Quality Improvement Program (NSQIP) data collection, between April 1, 2010, and March 31, 2015. The NSQIP uses trained Surgical Clinical Reviewers to collect data using a combination of chart review and follow up from the preoperative period through 30 days postoperatively. Patients were excluded if: 1) they were not eligible for the Ontario Health Insurance Program (OHIP) or had an invalid OHIP number, because this was required for linkage to health administrative datasets; or 2) they had missing admission, discharge, or surgery dates.

Population-based health administrative datasets
We linked the NSQIP dataset to three distinct population-based, health administrative datasets housed at the Institute for Clinical and Evaluative Sciences (ICES). ICES is an independent, non-profit research institute whose legal status under Ontario's health information privacy law allows it to collect and analyze health care and demographic data, without informed consent, for health system evaluation and improvement. The use of data in this project was authorized under section 45 of Ontario's Personal Health Information Protection Act, which does not require review by a Research Ethics Board. The datasets included: 1) the Discharged Abstract Database and Same Day Surgery Database to identify the records of the hospitalization (ICD-10 code), including admission and discharge dates, diagnoses, 2) the Physician Services Database to retrieve all claims for services provided by all eligible health care providers, and 3) the Ontario Health Insurance Plan (OHIP) database that contains physician diagnostic codes (ICD-9 codes) and diagnosis descriptions. All patients were followed for 30 days from the time of their surgery. All databases were linked using anonymized unique identifiers and analyzed at the ICES at the University of Ottawa, Ontario. This study was approved by the Ottawa Health Science Network Research Ethics Board.

Study outcome
All individuals who had any type of SSIs (i.e. superficial, deep, or organ space) (Additional file 1) within 30 days after surgery, according to the definition of the NSQIP protocol, were defined as having experienced an SSI.

Statistical analysis
This study utilized a 3-stage predictive modeling based on the hybrid modeling approaches developed in previous studies [14,15,18]. All stages described below were applied to each administrative dataset used in this study to generate three sub-models that contributed to the omnibus SSI model. Stage 1model development using random forests algorithm Details of Random Forests method have been described elsewhere [19][20][21]. In short, each of the classification trees is built using a bootstrap sample of the data, and a random subset of variables was selected at each split, thereby constructing a large collection of decision trees with controlled variation [22,23] (Additional file 2). The Random Forests trees are not pruned, so as to obtain low-bias trees. Every tree in the forest casts a "vote" for the best classification for a given observation, and the class receiving most votes results in the prediction for that observation. The study cohort was first divided randomly into derivation (70%) and validation (30%) samples (Additional file 3). Then, the derivation data was sampled to create an in-bag partition -(2/3) to construct the decision tree, and a smaller out-of-bag partition (1/3) to test the constructed tree to evaluate its performance by computing: 1) misclassification error, 2) C-statistics, and 3) model performance (sensitivity, specificity, etc.). The optimal number of trees and a subset of variables at each node were selected using the "tuneRF" function in R to minimize the misclassification error. Random Forests calculates estimates of variable importance for classification using permutation variable importance measure (VIM) [19], which is based on the decrease of a classification accuracy when values of a variable in a node of a tree are permuted randomly. Finally, K-fold cross validation was used to evaluate the Random Forests model with 10 folds. We identified subsets of top 30 important diagnostic or procedure codes to predict SSIs, using a mean decrease accuracy value of 0.02 as a cut-off point. The Random Forests analyses were performed in R statistical software (3.3.2.) using "randomForest" package [21].
Stage 2stepwise model selection using highperformance logistic regression approach Random forests algorithm was used to perform a preliminary screening of variables and to gain importance ranks. Then, the selected top-30 important predictors were input into the high-performance logistic model with stepwise variable selection to find the best parsimonious model to predict SSIs [14,24,25]. High-performance logistic regression (proc hplogisitc) belongs to the highperformance analytics procedures that can be used to reduce the dimension or identify important variables to obtain parsimonious predictive models [26]. It permits several link functions and can handle ordinal and nominal data with more than two response categories [26]. The Schwarz Bayesian Criterion (SBC) was used as a penalized measure of fit for logistic regression model to help avoid the model over-fitting.

Stage 3point system or risk scores
We used the methods suggested by Sullivan et al. [27] to summarize each logistic model from stage 2 as a point system. The point system or risk scores provide statistical information in a more clinically useful form than logistic regression models, as generalizability of the models developed from data from a single or a small group of hospitals to other patient populations is questionable [28,29]. Clinical prediction models and associated risk-scoring systems are popular statistical methods as they permit a rapid assessment of patient risk without the use of computers or other electronic devices [30]. The use of such points-based systems facilitates evidence-based clinical decision making [30]. The point system developed in this study was designed to predict the risk of postoperative SSIs, based on a patient's preprocedural risk factors or predictors. The point score assigned to each predictor was derived from a well-fit logistic regression model.
The point scores were developed for hospitalization (ICD-10) and physician (ICD-9) diagnostic codes, and physician procedure claims. All variables in the models were categorical, and the distance between a variable and its base category in regression coefficient units was equal to the size of the coefficient. For each variable, its distance from the base category in regression coefficient units was divided by this constant and rounded to the nearest integer to get its point value.
Then, the obtained point scores were input into logistic regression model and adjusted for other potential confounding factors suggested by the existing literature, including age, sex, surgical procedure, emergency case, concurrent surgical procedures, patient's physical status (ASA-5), and duration of surgery. The full model discrimination (C statistics or AUC) and calibration (Hosmer-Lemeshow (H-L) statistics) were assessed in the validation dataset. All methods were performed in accordance with the guidelines for developing and reporting Machine Learning predictive models in biomedical research [31]. The high-performance regression and point score assignment were performed in SAS 9.4 statistical software.

Results
We identified 14,351 patients who underwent surgery from April 1, 2010 to March 31, 2015 and were enrolled into NSQIP at our hospital. An SSI was identified in 795 (5.5%) of these patients. Of these, 540 (68%) had superficial SSIs and 255 (32%) had deep or organ space SSIs. Descriptive statistics for patients in the study sample are reported in Additional file 4. The derivation and validation datasets were similar in terms of baseline covariates (Additional file 5).
Predictive modeling for hospitalization diagnostic codes (ICD-10) We identified 3085 hospitalization diagnostic (ICD-10) codes recorded within 30 days following the surgery date. These codes then were clustered into 994 threedigit hospitalization diagnostic codes that were used for the further analyses.
Stage 1: Given a large number of diagnostic codes (possible predictors), the Random Forests approach was used to identify a subset of top important 30 hospitalization diagnostic codes that best predicts classification. We used 800 classification trees and 46 variables available for splitting at each tree node. The accuracy of the Random Forests model was 95.3%. The resulting SSI prediction model demonstrated positive predictive value (PPV) of 98%, negative predictive value (NPV) of 97%, and AUC (area under the receiver operating characteristic curve) of 0.78 (95% CI 0.77-0.79). The accuracy of the Random Forests model after a 10-fold cross-validation was 94.3%. Figure 1 presents the top 30 hospitalization diagnostic (ICD-10) codes for classification of SSIs that have been identified using the permutation VIM.
Stage 2: The identified top 30 hospitalization diagnostic codes (ICD-10) codes were input into the highperformance logistic regression with a stepwise selection to identify the best parsimonious model to predict SSIs. Table 1, model 1 presents the final model of six hospitalization diagnostic codes to identify SSIs (AUC 0.87, 95% CI 0.86-0.89).
Stage 3: Risk scores for the final model of hospitalization diagnostic (ICD-10) codes are presented in Table 1, Model 1 [27]. Among the entire cohort, 80.3% of patients had a score of 0, 11.8% had a score of 1, and 7.9% had a score equal or greater than 2.
We identified 442 physician diagnostic 3-digit codes (using ICD-9-CA) recorded within 30 days following the surgery date.
Stage 1: Given a large number of diagnostic codes (possible predictors), the Random Forests approach was used to identify a subset of 30 physician diagnostic codes that best predicts SSIs. The best misclassification rate was achieved by using 800 classification trees and 31 variables available for splitting at each tree node. The accuracy of the Random Forests model was 94.7%. The resulted SSI prediction model demonstrated PPV of 98%, NPV of 96%, and AUC of 0.82 (95% CI 0.81-0.83). The accuracy of the model after a 10-fold crossvalidation was 94.1%. Figure 2 presents the top 30 important physician diagnostic (ICD-9) codes for prediction of SSIs that have been identified using VIM.
Stage 2: The identified top 30 physician diagnostic codes were input into the high-performance logistic regression model to identify the best parsimonious model for prediction of SSIs, using a stepwise selection approach.  [27]. Among the entire cohort, 77.8% of patients had a score of 0, 7.7% had a score of 1, and 14.5% had a score equal or greater than 2.

Predictive modeling for physician procedure claims
We identified 2543 physician procedure claims recorded within 30 days following the surgery date. These codes then were clustered into 610 three-digit codes that were used for the further analyses.
Stage 1: Given a large number of physician procedure codes (possible predictors), Random forests approach was used to identify a subset of 30 physician procedure claims that best predicts SSIs. The best misclassification rate was achieved by using 1000 classification trees and 37 variables available for splitting at each tree node. The accuracy of the Random Forests model was 94.8%. The resulted SSI prediction model demonstrated PPV of 99%, NPV of 97%, and AUC of 0.82 (95% CI 0.81-0.83). The accuracy of the model after a 10-fold crossvalidation was 94.4%. Figure 3 presents the top 30 physician procedure claims that have been identified using the permutation VIM.
Stage 2: The identified top 30 physician procedure claims were input into the high-performance logistic regression model to identify the best parsimonious model for prediction of SSIs. We used a stepwise variable selection approach.  [27]. Among the entire cohort, 55.4% of patients had a score of 0, 11.9% had a score of 1, and 44.6% had a score equal or greater than 2.
Full model with total risk score of diagnostic and procedure codes In the derivation cohort, the total scores of hospitalization diagnostic (ICD-10) codes, physician diagnostic (ICD-9) codes and physician procedure claims were included in the logistic regression model and adjusted for potential confounding factors, including surgical specialties, age, sex, duration of surgery, emergency case, ASA class and concurrent surgical procedures ( Table 2).

Discussion
We used a 3-stage predictive modeling approach to derive and internally validate models to predict SSIs within 30 days after surgical procedure. To the best of our knowledge, this is the first study that used Machine Learning approaches to develop efficient algorithms for identifying SSIs within 30 days after surgery by use of health administrative data. The key finding of our study is that the risk of SSIs can be reliably estimated using routinely collected administrative data, including physician procedure claims, hospital (ICD-10) and physician (ICD-9) diagnostic codes. Our study results demonstrate high performance of the Random Forests algorithm for prediction of SSIs without pre-selection of possible predictors given a small number of cases. We derived a relatively small set of variables to identify postoperative SSIs, including 6 hospital diagnostic codes, 9 physician diagnostic codes, and 14 physician procedure claims. Several studies have examined the use of administrative data to identify postoperative SSIs [6][7][8][9][10]. Our study findings are consistent with these studies [6,10]. van Walraven et al. [6], for example, found that administrative data, including hospital diagnostic, emergency department visit codes and physician procedure claims, can be effectively used to identify postoperative patients with a low risk of having SSIs within 30 days of their surgical procedure. In particular, the predictive probability threshold with the optimal characteristics was a predicted risk of 5% (sensitivity, 82.1%, specificity, 85.6%, PPV, 27.7%). Additionally, Sands et al. found that [9] automated medical and claim records together can be used to screen for post discharge SSIs, but the method they used identified only 10% of procedures as possible infections. The approach used in our study added a new contribution to the existing literature by incorporating much larger set of features as compared with the previous studies. It was possible to include all available diagnostic or procedure codes to identify SSIs in this study, because Random Forests approach is generally unaffected by the addition of irrelevant features and is robust to collinearity due to the use of subsets of random variables for tree splits. All the features included in this study were obtained from routinely collected data, and given the Fig. 3 Description of the top 30 physician procedure claims to identify SSIs. Z59 -Digestive system surgical procedure; C46 -Infectious diseasenon-emergency hospital in-patient services: assessment/ consultation; Z10 -Integumentary system surgical procedures: incision of abscess/ haematoma; K07 -Family practice/geriatrics acute and chronic home care supervision; K99 -Emergency departmentspecial visit premium; C03 -General surgery, non-emergency hospital in-patient services-assessment, visits, consultations; A35 -Urology -consultations/ assessment; S16 -Digestive system surgical procedures; H15 -Family practice & practice in general -weekend and holidays: assessment/care; C64 -General thoracic surgery -non-emergency hospital in-patient services: consultation assessment; H12 -Family practice & practice in general -nights assessment and car; C12-Non-emergency hospital in-patient services: Subsequent visits by the MRP; R11-Integumentary system surgical procedures: operations of the breast; E08 -Hospital and institutional consultations/assessments by MRP; C20 -Obstetrics and gynecology -non-emergency hospital in-patient services; Z08 -Debridement of wound(s) and/or ulcer(s) extending into subcutaneous tissue, tendon, ligament, bursa and/or bone; G55-Diagnostic and therapeutic procedures, critical care; S21-Digestive system surgical procedures: rectum; S65 -Male genital surgical procedures; Z74 -Respiratory surgical procedures; R62-Musculoskeletal system surgical proceduresamputation; A20 -Obstetrics and gynecology -assessment or consultation; Z22 -Musculoskeletal system surgical procedures; R06 -Myocutaneous, myogenous or fascia-cutaneous flaps, neurovascular island transfer, transplantation of free island skin and subcutaneous flap; A24 -Otolaryngologyassessment/ consultation; C13 -Internal and occupational medicine: non-emergency hospital in-patient services; C01 -Non-emergency hospital in-patient services, subsequent visits by the MRP; H13 -Family practice & practice in general -weekdays, evenings: assessment/care; C21 -Consultations/visits anaesthesia -non-emergency hospital in-patient services  Fig. 4 Receiver Operator Characteristics Curve (ROC curve) and * calibration plot for the full model with risk scores for hospitalization diagnostic (ICD-10) codes, physician diagnostic (ICD-9) codes, and physician procedure claims, adjusted for the ** study covariates, in the validation cohort. * In the calibration plot, the observed percentage of patients having an SSI within 30 days of surgery is plotted against the predicted SSI risk from the SSI risk model (horizontal axis). ** Study covariates: surgical specialties, age, sex, duration of surgery, emergency case, ASA class, concurrent surgical procedures complex etiology of SSIs, there might be variables that would be overlooked if we used a narrower search strategy guided by a priori clinical expectations. It would be inappropriate to interpret the identified diagnostic or procedure codes as either causes or consequences of SSIs. Random Forests allows us to select variables that are influencing prediction given a small sample sizes and the extremely small ratio of samples to variable (large "p" and small "n"). If the identified important variables are consistent with clinical knowledge, there will be more confidence in the derived model as a decision support tool.
Several aspects of our study should be carefully considered. First, our study contained no information about outpatient antibiotic treatments because the Ontario health administrative data used for the study captures medication use for people over the age of 65 and who are on social assistance. Also, we did not include information about laboratory tests, because the Ontario health administrative data captures information only on outpatient laboratory tests, while laboratory tests performed during hospitalization are most important in predicting SSIs. Thus, information about antibiotic use and laboratory test could substantially improve SSI identification. Second, our study and model captured SSIs that occurred within 30 days after surgical procedure, so any SSI that occurred outside of this timeframe would have been missed. Third, our study was conducted in a single teaching hospital, providing about 90% of the major surgical operations in a catchment area of 1.2 million people. Therefore, external validation is necessary to measure model's utility in other hospitals and geographic regions. Finally, the coding systems used in the province of Ontario might not be available in other jurisdictions. Therefore, some modifications might be required before using our models in other health regions.

Conclusion
This study shows that health administrative data could be effectively used in identifying SSIs. Machine learning approaches have shown a high degree of accuracy in predicting postoperative SSIs and can integrate and utilize a large amount of administrative data. The results of our study are useful in advancing current and future efforts to use administrative data for patient safety surveillance and improvement. Further research should examine the use of machine learning approaches to identify SSIs, stratified by the specific types of surgical procedures.