- Research article
- Open Access
Investigating factors affecting the interval between a burn and the start of treatment using data mining methods and logistic regression
BMC Medical Research Methodology volume 21, Article number: 71 (2021)
Burn is a tragic event for an individual, the family, and community. It can cause irreparable physical, mental, economic, and social injury. Researches well documented that a quick visit to a healthcare center can greatly reduce burn injuries. Therefore, the aim of this study is to identify the effective factors in the interval between a burn and start of treatment in burn patients by comparing three classification data mining methods and logistic regression.
This cross-sectional study conducted on 389 hospitalized patients in Imam Khomeini Hospital of Kermanshah city since 2012 to 2015. The data collection instrument was a three-part questionnaire, including demographic information, geographical information, and burn information. Four classification methods (decision tree (DT), random forest (RF), support vector machine (SVM) and logistic regression (LR)) were used to identify the effective factors in the interval between burn and start of treatment (less than two hours and equal or more than two hours).
The mean total accuracy of all models is higher than 0.8. The DT model has the highest mean total accuracy (0.87), sensitivity (0.44), positive likelihood ratio (14.58), negative predictive value (0.89) and positive predictive value (0.71). However, the specificity of the SVM model and RF model (0.99) was higher than other models, and the mean negative likelihood ratio (0.98) of the SVM model are higher than other models.
The results of this study shows that DT model performed better that data mining models in terms of total accuracy, sensitivity, positive likelihood ratio, negative predictive value and positive predictive value. Therefore, this method is a promising classifier for investigating the factors affecting the interval between a burn and the start of treatment in burn patients. Also, key factors based on DT model were location of transfer to hospital, place of occurrence, time of accident, religion, history and degree of burn, income, province of residence, burnt limbs and education.
The burn is a tragic event for individuals, families, and communities. It can cause irreparable physical, psychological, economic, and social injury. The burn is one of the most important diseases all over the world . Burns after traffic accidents, falls and interpersonal violence are the fourth most common injuries . World Health Organization has estimated that 265,000 deaths occur each year due to firing and burn with boiling water, electric burn and other injuries. More than 96 % of burn deaths occur in low- and middle-income countries . Based on standardized age in 2017, the regions with the highest rate of burn were Eastern Europe with 303 per 100,000, Central Asia with 298 per 100,000 and South Latin America with 226 per 100,000 . Burns are the third leading cause of death in the United States after accident and drowning, and the sixth leading cause of death in Iran . It is a well-known fact that the most important factors in the mortality of burn patients are age, inhalation of burns and percentage of the total body surface area (TBSA) [6, 7].
Epidemiological studies conducted in emergency centers in Iran and other countries indicate that burns are one of the most important public health problems that lead to death, disability, pain, physical, mental and economic problems [8,9,10]. Evidence have reported that burn injuries cause significant limitations that go far beyond physical issues and affect people’s emotional, social, and family relationships [11,12,13]. Therefore, the shorter the interval between burns and the start of treatment and the sooner the patient goes to the medical center, the fewer these complications will be [14, 15]. Hence, the present study was conducted to identify the factors affecting the interval between burns and the start of treatment in patients referred to the Imam Khomeini Hospital of Kermanshah city during the years 2012–2015.
To achieve this purpose, we compared the performance of three well-known data mining models including RF, DT, and SVM with the LR as a classical technique. Recently, a growing number of studies, especially in the field of public and medical health, has compared the accuracy of traditional classifiers with data mining methods. Some studies revealed that data mining techniques have higher accuracy and lower error rates than the traditional models [16,17,18] and some others found better performance for traditional methods [19,20,21]. To the best of our knowledge, there is not any study that compared the traditional classifiers with the data mining classifiers like RF, SVM, and DT for predicting the factors affecting the interval between burns and the start of treatment.
The present study was a cross-sectional descriptive-analytical study to investigate the factors affecting the interval between a burn and the start of treatment in burn patients in Imam Khomeini Hospital of Kermanshah city from 2012 to 2015. The data gathering instrument was a three-part questionnaire. We used information on 18 risk factors that appear to be effective in the interval between burn time and the start of treatment. These risk factors included: age, gender, marital status, occupational status, place of residence, education, religion, income, burn percentage, history of burns, time of the accident, place of occurrence, burnt limbs, province of residence, location of transfer to hospital, the cause of burn, type burn, degree of burn. All information was collected and recorded by a trained investigator from information in the patient file, interviews with patients and relatives. In this study, according to the burn specialist opinion, 120 min was considered as a cut-off point for the interval between a burn and the start of treatment and defined this variable as a binary variable (less than 120 min and equal or more than 120 min) [14, 22,23,24].
Data pre‐processing and dealing with missing values
Before the model application, the missing data and outliers were checked consistently. The missing data across all variables for the dataset ranged from 0 to 15.4 %. The highest missing data were time of accident (15.4 %). Variables with missing values were imputed using CART regression trees and their mode . We used Anomaly detection to indicate outliers. Anomaly detection provides very important and critical information for outlier detection in various applications . By considering value 2 as a threshold for Anomaly detection, there were no outlier records . For a better interpretation of the results, associate degrees, bachelor and master were combined with a single “college education” group for the analysis of variable education. The demographic statistics and summaries of the variables in the data analysis are shown separately for the two groups of response variables in Table 1.
In this paper, DT, RF, SVM and logistics regression models were used to identify the factors involved in interval between burn and start of treatment in burn patients referred to Imam Khomeini Hospital from 2012 to 2015. Each of the used methods will be described briefly.
One of the simplest and most common classification techniques is the DT. The main goal of the DT (like other classification techniques) is to build a model that can predict variable response values. The DT is made up of nodes and partitions. The construction of the tree begins with the presence of all training data in the first node. Then, the first partition divides the data into two or more daughter nodes based on a predictor variable. The DT has three types of nodes:
Root node: It has no input branch and the number of its output branches can be zero or more.
Intermediate node: It consists of one input branch and two or more output branches.
Final node or leaf: It consists of one input branch and had no output branch.
In the DT, a category is allocated to each final node .
RF method is a non-parametric statistical method (free model) for classification analysis and regression analysis using recursive partitioning algorithm. The RF algorithm uses a set of classified trees . This method is very effective in selecting a set of predictive variables, which best express the phenotypes of the disease. RF method is also useful when predictor variables are nonlinearly related to disease because they do not assume any constraints on the relationship between predictor and response variables. These methods are often compatible with genetic heterogeneities, so that individual models are automatically fitted to subsets of data that are characterized by early partitioning in the tree. The simplicity of the model and the interpretability of the RF method, the flexibility in using a large number of predictive variables and the limited sample size and their ability to pay attention to genetic heterogeneity have increased their application in genetic studies. In addition to prediction, it is involved in identifying very important variables .
SVMs are commonly used for issues where there are two categories. In this algorithm, the two-page classification is placed on the border of two data classes, and the problem is to find the maximum boundary between these two pages and, as a result, between the two categories of data. Accordingly, two pages become so far from each other that they collide with the data. A SVM is a classification that is considered among the core methods of machine learning. SVM has a high generalizability accuracy. The main idea in SVM is that assuming that the classes can be separated linearly yields super-pages that can separate classes. In problems where the data is not linearly separable, using nonlinear cores, we map the data to a space with more dimensions so that they can be separated linearly in this new space. Different cores can be used for SVM, such as RBF and LINEAR, etc. SVMs are one of the most well-known methods for classifying data by providing a statistical model. One of the problems we face in providing a nonlinear vector classification machine is the way of defining the core and its related parameters. Well-known class of core functions, such as Polynomial, Gaussian and sigmoid, have been introduced that require the adjustment of parameters for optimal performance. The Sequential minimal optimization is one of the well-known methods for teaching this classification machine at a desirable time.
Logistics regression is a standard statistical model for modeling binary responses . In this method, the probability of the response variable (interval of burn to the start of treatment) is modeled as a linear function of the independent variables. Slope parameters in a logistic model can be interpreted as odds ratios. Linear structure and simple interpretation, appropriate software wide scope are the most important advantages of LR model.
All models are fitted with the variables introduced in Table 1. 70 % of the data were used as training data and the remaining 30 % were used as test data. Statistical analysis was performed using R software.
Implementation and Performance Criteria
In this study, we classified the data into two categories randomly: training data and test data. 70 % of the data were used as training data and the remaining 30 % were used as test data. This process was repeated 100 times. Then, the mean of obtained sensitivity, specificity, overall accuracy, positive and negative predictive value, and positive and negative likelihood ratio from these 100 repetitions was used to compare the models. Classification models indicate the importance of a variable based on the percentage increase in the prediction error. A variable is selected as the most important if it creates the most error when it is removed. After scoring the importance of variables, they are ranked based on their importance.
In this study, 47.8 % of patients were under 14 years of age. 54.2 % of them were men, 52.7 % were married, and 78.9 % were unemployed. Also, half of the participants (50.6 %) had income < 200$. Further information is provided by the two groups of response variables in Table 1.
Table 2 show the most important factors associated with the distance between burns and the start of treatment, based on DT, RF and SVMs. Common variables between these models are: place of occurrence, time of the accident, location of transfer to hospital and income.
Table 3 represents the performance of different models. In the DT model, indicators such as total accuracy, sensitivity, positive likelihood ratio, negative predictive value and positive predictive value were higher than other models. However, the specificity of the SVM model and RF model was higher than other models, and the mean negative likelihood ratio of the SVM model are higher than other models.
In this study, data mining and LR techniques have been used to investigate the factors affecting the interval of burn to the start of treatment in burn patients in Imam Khomeini Hospital of Kermanshah city. Different models yielded different results, but in most indicators, the DT model performed better. In the DT model, indicators such as total accuracy, sensitivity, positive likelihood ratio, negative predictive value and positive predictive value were higher than other models. However, the specificity of the SVM model and RF model was higher than other models, and the mean negative likelihood ratio of the SVM model are higher than other models. According to the results of DT model, the most important factors affecting the interval of burn to start of treatment were location of transfer to hospital, degree of burn, time of accident, religion, place of occurrence, income, province of residence, burnt limbs, education, and history of burn. The time interval between burn and the start of treatment is one of the most important factors in the treatment process of burn patients, and no comprehensive study has been conducted so far [14, 15, 32]. Machine learning methods, especially the DT model, performed well in studies [33,34,35,36,37,38] in the field of burns and other areas of health science. According to the results of this study and the fact that no similar study has been done in this case, the variables of location of transfer to hospital, degree of burn, time of accident, religion, place of occurrence, income, province of residence, burnt limbs, education, history of burn were identified as important variables. The results of this study had two limitations. First limitation is related to cross-sectional design and observed associations did not show the causality. The second limitation of our study was that due to the lack of sufficient literature on the interval between burns and initiation of treatment in burn patients, it was not possible to compare the clinical results of this study with other studies.
The results of current study indicated that DT model is better than data mining models in determining the effective factors in the interval between a burn and start of treatment in burn patients. Finding showed the most important factors affecting the interval of burn to start of treatment were location of transfer to hospital, degree of burn, time of accident, religion, place of occurrence, income, province of residence, burnt limbs, education, and history of burn.
Availability of data and materials
To access the data, he/she should coordinate with Imam Khomeini Hospital, which is under the supervision of Kermanshah University of Medical Sciences. Also, the reader can contact the first author and the corresponding author via the following emails: email@example.com. firstname.lastname@example.org.
Classification and Regression Trees
Support Vector Machine
The Total Body Surface Area
Nabovati E, Azizi A, Abbasi E, Vakili-Arki H, Zarei J, Razavi A. Using data mining to predict outcome in burn patients: a comparison between several algorithms. Health Inf Manage. 2014;10(6):799.
Kumar S, Ali W, Verma AK, Pandey A, Rathore S. Epidemiology and mortality of burns in the Lucknow Region, India—a 5 year study. Burns. 2013;39(8):1599–605.
WHO. [Available from: https://www.who.int/violence_injury_prevention/other_injury/burns/en/.
James SL, Lucchesi LR, Bisignano C, Castle CD, Dingels ZV, Fox JT, et al. Epidemiology of injuries from fire, heat and hot substances: global, regional and national morbidity and mortality estimates from the Global Burden of Disease 2017 study. Injury prevention. 2019.
Vaghardoost R, Kazemzadeh J, Dahmardehei M, Rabiepoor S, Farzan R, Kheiri AA, et al. Epidemiology of acid-burns in a major referral hospital in Tehran, Iran. World journal of plastic surgery. 2017;6(2):170.
Eljaiek R, Dubois M-J. Hypoalbuminemia in the first 24 h of admission is associated with organ dysfunction in burned patients. Burns. 2013;39(1):113–8.
Sheppard N, Hemington-Gorse S, Shelley O, Philp B, Dziewulski P. Prognostic scoring systems in burns: a review. Burns. 2011;37(8):1288–95.
Alaghehbandan R, Rossignol AM, Lari AR. Pediatric burn injuries in Tehran, Iran. Burns. 2001;27(2):115–8.
Anlatıcı R, Özerdem ÖR, Dalay C, Kesiktaş E, Acartürk S, Seydaoğlu G. A retrospective analysis of 1083 Turkish patients with serious burns. Burns. 2002;28(3):231–7.
Ansari-Lari M, Askarian M. Epidemiology of burns presenting to an emergency department in Shiraz, South Iran. Burns. 2003;29(6):579–81.
Novelli B, Melandri D, Bertolotti G, Vidotto G. Quality of life impact as outcome in burns patients. G Ital Med Lav Ergon. 2009;31(1 Suppl A):A58-63.
Kazemzadeh J, Rabiepoor S, Alizadeh S. The Quality of Life in Women with Burns in Iran. World journal of plastic surgery. 2019;8(1):33.
Echevarría-Guanilo ME, Gonçalves N, Farina JA, Rossi LA. Assessment of health-related quality of life in the first year after burn. Escola Anna Nery. 2016;20(1):155–66.
Branski LK, Herndon DN, Barrow RE. A brief history of acute burn care management. Total burn care: Elsevier; 2018. p. 1–7. e2.
Wolf SE, Rose JK, Desai MH, Mileski JP, Barrow RE, Herndon DN. Mortality determinants in massive pediatric burns. An analysis of 103 children with > or = 80 % TBSA burns (> or = 70 % full-thickness). Annals of surgery. 1997;225(5):554.
Aghaei A, Soori H, Ramezankhani A, Mehrabi Y. Factors related to pediatric unintentional burns: the comparison of logistic regression and data mining algorithms. Journal of Burn Care & Research. 2019;40(5):606–12.
Graham B, Bond R, Quinn M, Mulvenna M. Using data mining to predict hospital admissions from the emergency department. IEEE Access. 2018;6:10458–69.
Najafi-Ghobadi S, Najafi-Ghobadi K, Tapak L, Aghaei A. Application of data mining techniques and logistic regression to model drug use transition to injection: a case study in drug use treatment centers in Kermanshah Province, Iran. Substance Abuse Treatment, Prevention, and Policy. 2019;14(1):55.
Faisal M, Scally A, Howes R, Beatson K, Richardson D, Mohammed MA. A comparison of logistic regression models with alternative machine learning methods to predict the risk of in-hospital mortality in emergency medical admissions via external validation. Health informatics journal. 2018:1460458218813600.
Mandal SK. Performance analysis of data mining algorithms for breast cancer cell detection using Naïve Bayes, logistic regression and decision tree. International Journal Of Engineering And Computer Science. 2017;6(2).
van der Ploeg T, Nieboer D, Steyerberg EW. Modern modeling techniques had limited external validity in predicting mortality from traumatic brain injury. Journal of clinical epidemiology. 2016;78:83–9.
Barrow RE, Jeschke MG, Herndon DN. Early fluid resuscitation improves outcomes in severely burned children. Resuscitation. 2000;45(2):91–6.
Tanaka H, Hanumadass M, Matsuda H, Shimazaki S, Walter RJ, Matsuda T. Hemodynamic effects of delayed initiation of antioxidant therapy (beginning two hours after burn) in extensive third-degree burns. The Journal of burn care & rehabilitation. 1995;16(6):610–5.
Tanaka H, Matsuda T, Miyagantani Y, Yukioka T, Matsuda H, Shimazaki S. Reduction of resuscitation fluid volumes in severely burned patients using ascorbic acid administration: a randomized, prospective study. Archives of Surgery. 2000;135(3):326–31.
Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees: CRC press; 1984.
Chandola V, Banerjee A, Kumar V. Anomaly detection: A survey. ACM computing surveys (CSUR). 2009;41(3):1–58.
IBM. IBM Knowledge Center [Available from: [Available from: https://www.ibm.com/support/knowledgecenter/SS3RA7_15.0.0/com.ibm.spss.modeler.help/anomalydetectionnode_general.htm.
Buntine W, Niblett T. A further comparison of splitting rules for decision-tree induction. Machine Learning. 1992;8(1):75–85.
Breiman L. Random forests. Machine learning. 2001;45(1):5–32.
Winham SJ, Colby CL, Freimuth RR, Wang X, De Andrade M, Huebner M, et al. SNP interaction detection with random forests in high-dimensional genetic data. BMC bioinformatics. 2012;13(1):164.
Hosmer Jr DW, Lemeshow S, Sturdivant RX. Applied logistic regression: John Wiley & Sons; 2013.
Gupta M, Gupta O, Goil P. Paediatric burns in Jaipur, India: an epidemiological study. Burns. 1992;18(1):63–7.
Abdar M, Kalhori SRN, Sutikno T, Subroto IMI, Arji G. Comparing Performance of Data Mining Algorithms in Prediction Heart Diseases. International Journal of Electrical & Computer Engineering (2088–8708). 2015;5(6).
BIRNBAUM EBD. Application of data mining techniques to healthcare data. Infection control and hospital epidemiology. 2004.
Jimenez F, Sanchez G, Juárez JM. Multi-objective evolutionary algorithms for fuzzy classification in survival prediction. Artificial intelligence in medicine. 2014;60(3):197–219.
Meng X-H, Huang Y-X, Rao D-P, Zhang Q, Liu Q. Comparison of three data mining models for predicting diabetes or prediabetes by risk factors. The Kaohsiung journal of medical sciences. 2013;29(2):93–9.
Patil BM, Joshi RC, Toshniwal D, Biradar S. A new approach: role of data mining in prediction of survival of burn patients. Journal of medical systems. 2011;35(6):1531–42.
Zhang S, Tjortjis C, Zeng X, Qiao H, Buchan I, Keane J. Comparing data mining methods with logistic regression in childhood obesity prediction. Information Systems Frontiers. 2009;11(4):449–60.
The researchers appreciate the experts and staff of the Clinical Research Development Center of Imam Khomeini and Dr Mohammad Kermanshahi Hospitals.
This study was partially funded by of Kermanshah University of Medical Science. Kermanshah University of Medical Science provided technical support for the present study.
Ethics approval and consent to participate
We obtained written informed consent from all the participants and for illiterates and participants under age of 16 from parents/legally authorized representatives. The study was approved by the ethics committee of Kermanshah University of Medical Sciences with the code “Kuma. rec.1395.122”. The study adhered to relevant guidelines and regulations.
Consent for publication
The authors declare no conflict of interest.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Ahmadi-Jouybari, T., Najafi-Ghobadi, S., Karami-Matin, R. et al. Investigating factors affecting the interval between a burn and the start of treatment using data mining methods and logistic regression. BMC Med Res Methodol 21, 71 (2021). https://doi.org/10.1186/s12874-021-01270-5
- Start of treatment
- Random Forest
- Decision tree
- Support Vector Machine
- Logistic regression