Investigating factors affecting the interval between a burn and the start of treatment using data mining methods and logistic regression

Background Burn is a tragic event for an individual, the family, and community. It can cause irreparable physical, mental, economic, and social injury. Researches well documented that a quick visit to a healthcare center can greatly reduce burn injuries. Therefore, the aim of this study is to identify the effective factors in the interval between a burn and start of treatment in burn patients by comparing three classification data mining methods and logistic regression. Methods This cross-sectional study conducted on 389 hospitalized patients in Imam Khomeini Hospital of Kermanshah city since 2012 to 2015. The data collection instrument was a three-part questionnaire, including demographic information, geographical information, and burn information. Four classification methods (decision tree (DT), random forest (RF), support vector machine (SVM) and logistic regression (LR)) were used to identify the effective factors in the interval between burn and start of treatment (less than two hours and equal or more than two hours). Results The mean total accuracy of all models is higher than 0.8. The DT model has the highest mean total accuracy (0.87), sensitivity (0.44), positive likelihood ratio (14.58), negative predictive value (0.89) and positive predictive value (0.71). However, the specificity of the SVM model and RF model (0.99) was higher than other models, and the mean negative likelihood ratio (0.98) of the SVM model are higher than other models. Conclusions The results of this study shows that DT model performed better that data mining models in terms of total accuracy, sensitivity, positive likelihood ratio, negative predictive value and positive predictive value. Therefore, this method is a promising classifier for investigating the factors affecting the interval between a burn and the start of treatment in burn patients. Also, key factors based on DT model were location of transfer to hospital, place of occurrence, time of accident, religion, history and degree of burn, income, province of residence, burnt limbs and education.


Background
The burn is a tragic event for individuals, families, and communities. It can cause irreparable physical, psychological, economic, and social injury. The burn is one of the most important diseases all over the world [1]. Burns after traffic accidents, falls and interpersonal violence are the fourth most common injuries [2]. World Health Organization has estimated that 265,000 deaths occur each year due to firing and burn with boiling water, electric burn and other injuries. More than 96 % of burn deaths occur in low-and middle-income countries [3]. Based on standardized age in 2017, the regions with the highest rate of burn were Eastern Europe with 303 per 100,000, Central Asia with 298 per 100,000 and South Latin America with 226 per 100,000 [4]. Burns are the third leading cause of death in the United States after accident and drowning, and the sixth leading cause of death in Iran [5]. It is a well-known fact that the most important factors in the mortality of burn patients are age, inhalation of burns and percentage of the total body surface area (TBSA) [6,7].
Epidemiological studies conducted in emergency centers in Iran and other countries indicate that burns are one of the most important public health problems that lead to death, disability, pain, physical, mental and economic problems [8][9][10]. Evidence have reported that burn injuries cause significant limitations that go far beyond physical issues and affect people's emotional, social, and family relationships [11][12][13]. Therefore, the shorter the interval between burns and the start of treatment and the sooner the patient goes to the medical center, the fewer these complications will be [14,15]. Hence, the present study was conducted to identify the factors affecting the interval between burns and the start of treatment in patients referred to the Imam Khomeini Hospital of Kermanshah city during the years 2012-2015.
To achieve this purpose, we compared the performance of three well-known data mining models including RF, DT, and SVM with the LR as a classical technique. Recently, a growing number of studies, especially in the field of public and medical health, has compared the accuracy of traditional classifiers with data mining methods. Some studies revealed that data mining techniques have higher accuracy and lower error rates than the traditional models [16][17][18] and some others found better performance for traditional methods [19][20][21]. To the best of our knowledge, there is not any study that compared the traditional classifiers with the data mining classifiers like RF, SVM, and DT for predicting the factors affecting the interval between burns and the start of treatment.

Dataset
The present study was a cross-sectional descriptiveanalytical study to investigate the factors affecting the interval between a burn and the start of treatment in burn patients in Imam Khomeini Hospital of Kermanshah city from 2012 to 2015. The data gathering instrument was a three-part questionnaire. We used information on 18 risk factors that appear to be effective in the interval between burn time and the start of treatment. These risk factors included: age, gender, marital status, occupational status, place of residence, education, religion, income, burn percentage, history of burns, time of the accident, place of occurrence, burnt limbs, province of residence, location of transfer to hospital, the cause of burn, type burn, degree of burn. All information was collected and recorded by a trained investigator from information in the patient file, interviews with patients and relatives. In this study, according to the burn specialist opinion, 120 min was considered as a cut-off point for the interval between a burn and the start of treatment and defined this variable as a binary variable (less than 120 min and equal or more than 120 min) [14,[22][23][24].

Data pre-processing and dealing with missing values
Before the model application, the missing data and outliers were checked consistently. The missing data across all variables for the dataset ranged from 0 to 15.4 %. The highest missing data were time of accident (15.4 %). Variables with missing values were imputed using CART regression trees and their mode [25]. We used Anomaly detection to indicate outliers. Anomaly detection provides very important and critical information for outlier detection in various applications [26]. By considering value 2 as a threshold for Anomaly detection, there were no outlier records [27]. For a better interpretation of the results, associate degrees, bachelor and master were combined with a single "college education" group for the analysis of variable education. The demographic statistics and summaries of the variables in the data analysis are shown separately for the two groups of response variables in Table 1.

Classification models
In this paper, DT, RF, SVM and logistics regression models were used to identify the factors involved in interval between burn and start of treatment in burn patients referred to Imam Khomeini Hospital from 2012 to 2015. Each of the used methods will be described briefly.
One of the simplest and most common classification techniques is the DT. The main goal of the DT (like other classification techniques) is to build a model that can predict variable response values. The DT is made up of nodes and partitions. The construction of the tree begins with the presence of all training data in the first node. Then, the first partition divides the data into two or more daughter nodes based on a predictor variable. The DT has three types of nodes: Root node: It has no input branch and the number of its output branches can be zero or more.
Intermediate node: It consists of one input branch and two or more output branches.
Final node or leaf: It consists of one input branch and had no output branch.
In the DT, a category is allocated to each final node [28]. RF method is a non-parametric statistical method (free model) for classification analysis and regression analysis using recursive partitioning algorithm. The RF algorithm uses a set of classified trees [29]. This method is very effective in selecting a set of predictive variables, which best express the phenotypes of the disease. RF method is also useful when predictor variables are nonlinearly related to disease because they do not assume any constraints on the relationship between predictor and  response variables. These methods are often compatible with genetic heterogeneities, so that individual models are automatically fitted to subsets of data that are characterized by early partitioning in the tree. The simplicity of the model and the interpretability of the RF method, the flexibility in using a large number of predictive variables and the limited sample size and their ability to pay attention to genetic heterogeneity have increased their application in genetic studies. In addition to prediction, it is involved in identifying very important variables [30]. SVMs are commonly used for issues where there are two categories. In this algorithm, the two-page classification is placed on the border of two data classes, and the problem is to find the maximum boundary between these two pages and, as a result, between the two categories of data. Accordingly, two pages become so far from each other that they collide with the data. A SVM is a classification that is considered among the core methods of machine learning. SVM has a high generalizability accuracy. The main idea in SVM is that assuming that the classes can be separated linearly yields super-pages that can separate classes. In problems where the data is not linearly separable, using nonlinear cores, we map the data to a space with more dimensions so that they can be separated linearly in this new space. Different cores can be used for SVM, such as RBF and LINEAR, etc. SVMs are one of the most wellknown methods for classifying data by providing a statistical model. One of the problems we face in providing a nonlinear vector classification machine is the way of defining the core and its related parameters. Well-known class of core functions, such as Polynomial, Gaussian and sigmoid, have been introduced that require the adjustment of parameters for optimal performance. The Sequential minimal optimization is one of the well-known methods for teaching this classification machine at a desirable time.
Logistics regression is a standard statistical model for modeling binary responses [31]. In this method, the probability of the response variable (interval of burn to the start of treatment) is modeled as a linear function of the independent variables. Slope parameters in a logistic model can be interpreted as odds ratios. Linear structure and simple interpretation, appropriate software wide scope are the most important advantages of LR model.
All models are fitted with the variables introduced in Table 1. 70 % of the data were used as training data and the remaining 30 % were used as test data. Statistical analysis was performed using R software.

Implementation and Performance Criteria
In this study, we classified the data into two categories randomly: training data and test data. 70 % of the data were used as training data and the remaining 30 % were used as test data. This process was repeated 100 times. Then, the mean of obtained sensitivity, specificity, overall accuracy, positive and negative predictive value, and positive and negative likelihood ratio from these 100 repetitions was used to compare the models. Classification models indicate the importance of a variable based on the percentage increase in the prediction error. A variable is selected as the most important if it creates the most error when it is removed. After scoring the importance of variables, they are ranked based on their importance.

Results
In this study, 47.8 % of patients were under 14 years of age. 54.2 % of them were men, 52.7 % were married, and 78.9 % were unemployed. Also, half of the participants (50.6 %) had income < 200$. Further information is provided by the two groups of response variables in Table 1. Table 2 show the most important factors associated with the distance between burns and the start of treatment, based on DT, RF and SVMs. Common variables between these models are: place of occurrence, time of the accident, location of transfer to hospital and income. Table 3 represents the performance of different models. In the DT model, indicators such as total accuracy, sensitivity, positive likelihood ratio, negative predictive value and positive predictive value were higher than other models. However, the specificity of the SVM model and RF model was higher than other models, and the mean negative likelihood ratio of the SVM model are higher than other models.

Discussion
In this study, data mining and LR techniques have been used to investigate the factors affecting the interval of burn to the start of treatment in burn patients in Imam Khomeini Hospital of Kermanshah city. Different models yielded different results, but in most indicators, the DT model performed better. In the DT model, indicators such as total accuracy, sensitivity, positive likelihood ratio, negative predictive value and positive predictive value were higher than other models. However, the specificity of the SVM model and RF model was higher than other models, and the mean negative likelihood ratio of the SVM model are higher than other models. According to the results of DT model, the most important factors affecting the interval of burn to start of treatment were location of transfer to hospital, degree of burn, time of accident, religion, place of occurrence, income, province of residence, burnt limbs, education, and history of burn. The time interval between burn and the start of treatment is one of the most important factors in the treatment process of burn patients, and no comprehensive study has been conducted so far [14,15,32]. Machine learning methods, especially the DT model, performed well in studies [33][34][35][36][37][38] in the field of burns and other areas of health science. According to the results of this study and the fact that no similar study has been done in this case, the variables of location of transfer to hospital, degree of burn, time of accident, religion, place of occurrence, income, province of residence, burnt limbs, education, history of burn were identified as important variables. The results of this study had two limitations. First limitation is related to cross-sectional design and observed associations did not show the causality. The second limitation of our study was that due to the lack of sufficient literature on the interval between burns and initiation of treatment in burn patients, it was not possible to compare the clinical results of this study with other studies.

Conclusions
The results of current study indicated that DT model is better than data mining models in determining the effective factors in the interval between a burn and start of treatment in burn patients. Finding showed the most important factors affecting the interval of burn to start of treatment were location of transfer to hospital, degree of burn, time of accident, religion, place of occurrence, income, province of residence, burnt limbs, education, and history of burn.