BMC Medical Research Methodology

Background: Development of three classification trees (CT) based on the CART (Classification and Regression Trees), CHAID (Chi-Square Automatic Interaction Detection) and C4.5 methodologies for the calculation of probability of hospital mortality; the comparison of the results with the APACHE II, SAPS II and MPM II-24 scores, and with a model based on multiple logistic regression (LR).


Background
Stratifying the patients into risk groups, according to their severity, is essential for the comparison of treatments and the establishment of differences between different units or hospital centres. As a result, working in an intensive care unit (ICU) necessitates making prognoses for patients within the first 24 hours of their admission. Establishing a prognosis consists of assigning a probability of death by using variables commonly used for the diagnosis and treatment of critically ill patients [1].
Severity scores are classic tools used in establishing this probability. The most commonly used scores are the APACHE II (Acute Physiology and Chronic Health Evaluation II), the SAPS II (Simplified Acute Physiology Score II) and the MPM II-24 (Mortality Probability Models II-24) scores [2][3][4].
Other systems of severity classification based on different mathematical strategies have also been used [5].
In the last decade, classification trees (CT), which were developed more than 20 years ago, have acquired greater importance in the immediate interpretation of the decision rules that they generate, and they are readily accepted by professionals in clinical practice [6].
A CT is a graphic representation of a series of decision rules. Beginning with a root node that includes all cases, the tree branches are divided into different child nodes that contain a subgroup of cases. The criterion for branching (or partitioning) is selected after examining all possible values of all available predictive variables. In the terminal nodes (the "leaves" of the tree), a grouping of cases is obtained, such that the cases are as homogeneous as possible with respect to the value of the dependent variable [7].
The different CT types are distinguished by the manner of node partitioning. In the specific case of CARTs (Classification And Regression Trees), possibly the most widely used CT in medicine, an impurity function (the so-called Gini index) is calculated, and for each division of the tree, the variable and its cut-off value are defined such that the decrease in the impurity function is the greatest [8]. There are many types of CTs (or improved versions) such as CHAID (Chi-square Automatic Interaction Detection) and C4.5 (developed from the so-called Concept Learning Systems). Table 1 illustrates, in a schematic fashion, the particularities of these CTs. A CT has a growth phase, a pruning phase (removal of branches that do not provide general information to the system) and a selection of the optimal tree [8].
The aim of the present study was to develop (with a population of critically ill patients) three classification trees (based on CART, CHAID and C4.5 methodologies) to calculate the probability of hospital mortality and to compare these trees with each other, with the classic scores (APACHE II, SAPS II and MPM II-24) and with a model based on multiple logistic regression.

Methods
This is a retrospective study carried out using the database of a mixed ICU (with medical and surgical services) of 14 beds located at the University Hospital Arnau de Vilanova of Lleida. The ethical committee of the hospital was informed that the study was being carried out, and informed consent was not deemed necessary, since all the variables were collected for the diagnosis and treatment of the patients and their anonymity was assured at all times.

Database
Data collected over ten years (from January 1997 to December 2006) were used. In this study, all patients were over the age of 16 years and remained in the ICU for more than 24 hours. Patient records with incomplete data were not used.
A random partition, in a 70:30 ratio, was made to establish the development and the validation sets, respectively.
Data concerning age, sex, length of stay in the ICU and procedures specific to the ICU were used. The outcome variable of interest was the probability of hospital mortality. The patients were divided according to their diagnostic groups following the Knaus classification [9]. Six diagnostic groups were established according to the case mix and level of severity of the ICU, including two trauma categories of TBI (traumatic brain injury) and Multiple trauma (multiple trauma without brain injury), Respiratory (chronic respiratory problems with decompensation), Neurological (ischemic or hemorrhagic strokes), Surgery (surgical problems not included in other categories) and Medicine (medical pathology not included in other categories).
Each patient's medical records and laboratory database files were used to obtain information pertaining to baseline (at ICU admission) demographics, pre-existing comorbidities and scores (APACHE II, SAPS II and MPM II-24). The data were then compiled (manual recording) into single data using a relational database management system (Microsoft Access©). APACHE II, SAPS II and MPM II-24 scores were determined by the worst value found during the first 24 hours after ICU admission [2][3][4].
The presence of acute renal failure was defined (according to the model MPM II-24) by levels of serum creatinin above 2 mg/dL [4]. The antecedents of chronic organ insufficiency (defined according to the APACHE II model) were included in the variable COI [2].

Logistic models and classification trees
Models were created with the development set and were subsequently checked in the validation set.
Working with the development set, first, a univariate analysis was performed for all the variables included in the three scores to select those that predicted survival. Those that were statistically significant predictors were included in the development of the multivariable models. We used a model of multiple logistic regression (LR) with forward stepwise selection of variables [10].
The computer programs used for creating the CTs are presented in Table 1. The program WEKA (a project of Waikato University) is freely accessible and includes a CT module, named J48, that includes CART and C4.5 [11].
Answer-Tree©, a module of SPSS (Statistical Package for the Social Sciences), includes options for CART and CHAID, and the program DTREG© (version 3.5) is based on a CART-type methodology.
To create the three types of CTs, a cross-validation system with ten partitions was used, and the only common restriction for terminating the growth of the tree was the minimum number of subjects in the terminal nodes (which was fixed at 50 patients).

Statistical analysis
The variables are presented as the mean (standard deviation), the median (interquartile interval) or as a percentage. For a comparison of the variables, the chi-squared (χ 2 ) test was used for categorical variables, and the ANOVA test or non-parametric Mann-Whitney test was used for continuous variables, depending on the characteristics of the distribution.
To compare the different models, we measured their precision (discrimination and calibration) with the Brier score. The discrimination was measured by calculating the percentage of correctly classified patients (PCC) with a cut-off point with a probability of 0.5 and by the area below the ROC curve (AUC) [12]. For calibration, the Hosmer-Lemeshow C test (HL-C) was used [13] by constructing the calibration curve and calculating the standardized mortality ratio (SMR) [14]. These calculations were made both in the development set and in the validation set. We used a correlation matrix (Spearman correlation coefficients) and the Bland-Altman test to analyse the individual probabilities generated by the CT models [15].
The statistical analysis was carried out with the program SPSS (version 14.0).

Demographic characteristics
Among 2823 patients, 139 were excluded due to incomplete or erroneous data (4.9%), leaving 2684 eligible patients. The development group consisted of 1880 patients (70%) and the validation group consisted of 804 (30%).
The demographic characteristics of the patients are shown in Table 2; there were no major differences between the   development and the validation groups. Some characteristics are particular to the ICU, such as the low proportion of scheduled patients (6.5%), the prolonged length of stay (median of 7 days) and the high mortality rate (31.4%). Table 3 shows the evolution (during the 10 years observed) of hospital mortality, the severity scores and the participation percentage in the development set. There are no significant differences (only the evidence that the number of admissions has kept on increasing).

Variable selection: univariate analysis
A total of 24 variables showed significant differences between the survivors and non-survivors ( Table 4). The table also shows the scores for which the different variables were included. No significant differences were found for respiratory frequency (APACHE II), serum potassium (APACHE II and SAPS II), hematocrit (APACHE II), leuckocyte count (APACHE II y SAPS II), bilirubin (SAPS II), PaO2 (MPM II-24) or antecedents of cirrhosis and neoplasia (MPM II-24).
Only the COI variable reflected the chronic illnesses of the patient. For variables related to diagnoses, the surgery group was associated with a greater possibility of hospital mortality, while the trauma group was associated with a lower likelihood of mortality. Table 5 shows the LR model including 9 variables (Continuous: Age, HR, Glasgow and (A-a)O2 gradient. Discrete: Inotropic therapy, MV, Acute renal failure, COI and Trauma) selected from the 24 variables.

Classification Tree Models
The variables common to the three CTs and the LR model are inotropic therapy (INOT), Glasgow value, (A-a)O2 gradient ((A-a)O2), age and COI.  sion rules with an assignment rank of probability ranging from 5.9% to a maximum of 71.3%.
It is noted that a CT can use the same variables in various decision rules and that, for continuous variables, different cut-off points can be selected. Figure 2 illustrates the CT based on the CHAID methodology. It used seven variables, and it also began with the var-iable INOT. It generated fifteen decision rules with an assignment rank of probability ranging from 0.7% to a maximum of 86.4%. In this type of CT, the Glasgow value, age and (A-a)O2 variables were divided into intervals with more than two possibilities. Figure 3 depicts the C4.5 model, which used six variables (the five common variables and the MAP, which is not included in the LR model) and generated ten decision Classification tree by CART algorithm

Comparison of model properties
The three CT models and the LR model were also compared with those generated using the APACHE II, SAPS II and MPM II-24 scores.
The severity scores were applied without making recalibration in all the population (development and validation sets). Table 6 shows the values for the properties evaluated. It can be seen that all models achieved an acceptable discrimination (an AUC greater than 0.70) both in the development and the validation set. Figure 4 presents the calibration curves of the models. It is notable that some curves were displaced to the observed mortality; this coincided with an SMR greater than 1 (with a CI of 95% that does not include 1) ( Table 6). The models based on the CTs were better calibrated (this was observed both in the calibration curves and in the obtained SMR (see Table 6)).
All the models correctly classified approximately 75% of the cases evaluated. Table 7 shows the correlations between the probabilities calculated with the 3 CTs and the LR model (all of them statistically significant). Figure 5 shows the Bland-Altman results obtained in the validation set by comparing the probabilities determined by the CART CT with those of the LR, CHAID and C4.5 CTs.

Comparison of individual probabilities generated by the CT models
We observed that there were patients for whom the difference in the probabilities exceeded the acceptable limit of the test. There were 116 patients included in the comparison of the CART and CHAID CTs, and 245 in the comparison of the CART and C4.5 CTs. The differences can be partly attributed to the behaviour of the Glasgow variable (different cut-off points or partitions) and to the influence of the COI variable in the different divisions of the tree branches.
The different models generate, in some patients, different allocation of death provability. When performing a vali- dation with records not used in the phase of development, the different allocation of probability determines in our case a conservation of a similar discrimination but that the calibration is different (being better for the AC).

Discussion
The results illustrate that the ICU had particular demographic characteristics due to its case mix, with a low percentage of scheduled patients, a long length of stay and high mortality. These data are important when it comes to appraising and generalising the results obtained with our database [16].
The results yielded mortality rates that were higher than expected (according to the classic APACHE II, SAPS II and MPM II-24 scores), which can be partly attributed to these individual characteristics [17]. However, this finding also necessitates a recalibration of these models in order to achieve a correct stratification of the patients' risk of hospital mortality [18].
Previously, CTs have been used with critical patients, e.g., for the calculation of the probability of death from coronary pathology [19], intracerebral haemorrhages [20] or traumatic brain injuries [21], for the prediction of persistent vegetative states [22] or (as in our study) for stratifying the probability of death in a general population of ICU patients [23,24].    These five variables are capable of stratifying the examined population of critical patients (for example, as in the CART CT), using eight simple decision rules, with acceptable properties of discrimination and calibration.
We also observed that the three CT types exhibited differences. Even when incorporating the five common variables mentioned earlier, these CTs differed in the first variable to be selected, in the details of "branching", in the cut-off points (and subgroups), in the order of variable selection and in the incorporation of other variables.  Calibration curves for the classification models The CT software allows to adjust the levels and the number of partitions for each branch in order to get more complex models [7]. In our case, our only restriction (in the 3 CT models) was that the minimum number of subjects in the terminal nodes should be of 50 patients.
We cannot state which CT was optimal (since they had similar general properties). The CART and CHAID CTs were similar in their order of partitioning, although the CHAID CT (due to its inherent characteristics) separated the continuous variables into more than two possibilities and generated more decision rules. The CART CT was simpler, while the CHAID CT showed greater complexity (and also selected more variables). Different CTs can select different first variables, and in the C4.5 CT, the first variable, the Glasgow point value, was different from that of the other CTs; the C4.5 CT also incorporated different variables. The analysis of the individual probabilities generated by the different CTs (in spite of a good correlation) assisted in the identification of possible "problem" variables, e.g. the Glasgow point value and the COI variable, in their order of appearance in the decision rules generated.
The CTs most widely used for medical applications have been based on the CART methodology, but studies that use other CT types have started to appear [26][27][28].
When there is a classification problem, there is no model that can be chosen a priori to be the best [29]. Even with the same information, different CTs develop models with different interpretations [30]. Based on our data, the CTs do not compete with the classic scores in their function of calculating individual probabilities. In the case of a large database, the CTs generated would be too complex to  a c interpret and use with regularity (many branches and decision rules). The immediate interpretative advantage of CTs is only obtained with simple trees [31,32].
Our study had several limitations. In the first place, it was carried out in only one ICU and within a ten-year span database (although no variation was observed during the period of study). It would also have been possible to employ more methodologies for comparison or to improve those that were used, by incorporating relations and/or ranks of a priori variables, as do the classic scores.
As exposed by one of the reviewers, we found a great difference between the observed and expected mortality in the validation group in the LR model. The LR-based model could have been carried out using the variables as categorical, thus minimizing the possible effect that outlier values (using the variables as continuous) have on the predicted outcome. One of the advantages of CT-based models is that they automatically change the continuous variables into categorical ones and that their cut-off points could also be used to create a LR model with discreet variables We must mention the effort at Waikato University (New Zealand) regarding the free-access program WEKA, which strives to collect (in a single tool) the majority of the methodologies that are used to classify, select and group variables [11].
There are models, with different methodologies that could improve the individual properties and achieve greater precision in classification [33,34].
The principal advantage of CTs is that they are easy to interpret. However, this advantage could turn into an obstacle, since we tend to choose the optimal CT as that which more closely approaches the clinical reasoning that coincides with that of the program user [35]. An understanding of the clinical problem is necessary in order to adequately interpret CTs.
One contribution of our effort was the demonstration that the CT methodology is not unique and that different CTs could be generated according to various methodologies. The CTs assisted in both selecting variables of greater importance in the problem of classification and determining the best cut-off points for the continual variables.
We believe that CTs (e.g., the model based on CART) are mainly useful in obtaining homogenous groups for the assignation of the probability of hospital mortality. These groups with different characteristics (defined by rules of classification that can be interpreted) can serve, for example, as a basis for the creation of new scores.
We intend to do further research including a multi-centre study, with the incorporation of more methodologies and the possible use of hybrid models. In order to generalise our results, external validation will be required [36].

Conclusion
The main benefits to CT analysis are to identify a relatively small number of groups that are reasonably homogeneous with regard to the outcome. The CTs can be used in intensive care medicine for assisting in diagnosis and prognosis [37,38]. Those less familiar with CTs should realise that this us a class of methods including many different approaches, and that these different approaches may result in considerable differences in classifications.