CART and ODT-models
Tree-based models such as CART (Classification And Regression Trees) [22] are non-linear and non-parametric alternatives to linear models for regression and classification problems (such as linear regression, logistic regression, linear discriminant analysis, linear proportional hazard models). CART models are fitted by binary recursive partitioning of a multidimensional covariate space, in which the dataset is successively split into increasingly homogeneous subsets until a specified criterion is satisfied. For the first partition, CART searches the best possible place to split a continuous variable into two classes and defines two subspaces which maximize overall class separation (i.e. interclass variance of the dependent variable). Each of these subspaces subsequently serves as the basis for further partitioning independently of the others and so on. At each step the variable used for each split is selected from all the explicative variables so as to provide an optimal partition given the previous actions. Partitions' sequence is summarized by a binary tree. The root node of tree corresponds to the entire data space. Partitions of the space are associated with descendants of the root node. The leaves of the tree, or terminal nodes, correspond to subspaces which are not further partitioned. The stability of the procedure can be improved using Data resampling.
While CART-models are widely used as exploratory techniques they are less-commonly used for prediction. Trees generally rely on fewer assumptions than classical methods and handle a wide variety of data structures. Furthermore they are easy to use and to interpretate, and thus provide a wide range of application fields. The use of CART procedure has been considered by others in a variety of medical problems [22, 23] such as, for example, survival analysis [24–26], longitudinal analysis, diagnostic and prognostic studies or clinical trials [27–30].
One particular application is signal processing [31], in which the problem concerns the detection of multiple change points in the mean. The CART procedure can be used to estimate simultaneously the change-points and the means by recovering an underlying piecewise constant function f(t).
If m
k
are the means for each piecewise k, then t
k
are the change-points:
If we extend this point of view to the covariate space defined by geographic coordinates, CART estimates the "change-lines" (instead of change-points) of a piecewise constant function on R2. In other words, tree-based procedure can easily determine spatial patterns.
However, one limitation is that CART provides axis-parallel splits i.e. rectangular spatial patterns. Oblique decision trees (ODT) deal with this problem. Those algorithms produce oblique (and then polygonal) partitioning of the covariate space. However, oblique trees are less popular than axis-parallel trees because the splits are less straightforward to interpret and oblique procedures require greater computational complexity than axis-parallel algorithms. Finding the best oblique tree in the covariate space is a NP-Hard problem [32]. Therefore, existing ODT algorithms use deterministic heuristics or evolutionary algorithms (like the OC1 system [33]) to find appropriate hyperplanes for partitioning the covariate space [22, 34, 32, 33]. Comparisons of the different procedures are provided, for example, by Murthy [33] Cantu-Paz [34] and Bradley [35].
Despite this difficulty in RN, it is easier to find an oblique partition in the particular case of a space determined by the geographic coordinates, i.e. in R2. Evolutionnary or heuristic algorithms are not robust. They provide occasionally local minima [33] and therefore are not optimal procedures in R2. The ODT-algorithm we have developped is an optimal procedure to reach an optimal solution without using heuristics or evolutionary procedures.
ODT algorithm
The general purpose of the entire procedure consists on finding several partitions of the plane. We present the first step which allows finding the best oblique split of the plane. Going recursively, this algorithm will split the plane into several partitions, until reaching a specific criterion.
This subsection is organized as follow:
i. First, we will introduce how the plane is splitted into two adjacent partitions according to the interclass variance.
ii. Second, we will present how the finite set of oblique lines is determined, still within the first step of the entire procedure.
iii. Third, we will propose an optimization of this algorithm.
i. The splitting method proceeds as follows.
Consider, in the geographical space represented by the plane with an orthogonal basis {x, y} and a fixed origin O, npoints M
i
with coordinates {x
i
, y
i
}. These coordinates can represente the geographic coordinates of a point location provided by GPS. To each point M
i
a numeric random variable Z
i
(called explained or dependant variable) is associated with the observation z
i
Whereas the CART procedure partitions the plane according to a line parallel to the axis maximizing the interclass variance of z
i
, our procedure partitions the plane according to an oblique line
maximizing in the same way the interclass variance of z
i
.
To find this oblique line according to the direction
we have to define the perpendicular direction uand the angle
.
From a general viewpoint, for a fixed direction
the procedure has to:
Orthogonally projects the points M
i
on the (O, u) direction, defining the coordinate u
i
;
Considers all the u
i
as potential threshold in the way to split the plane with the direction
perpendicular to the direction uand going through u
i
;
Finds the optimal split between two adjacent classes, maximizing the interclass variance of z
i
according to theses projections.
ii. The splitting method provides a finite set of cluster proceeding as follows.
Before detailing the algorithm, we have to study the different splitting directions
i.e. to specify wich angles θ have to be analyzed. For a global solution the algorithm can scan all the oblique directions (i.e. all the θ) between zero and π. In a heuristic way one can also discretize this interval providing a finite number of angles θ. But these two methods are not optimal.
The optimal algorithm for an optimal solution is easy to implement. Obviously, two points M
i
(x
i
, y
i
) and M
j
(x
j
, y
j
) have the same projected coordinates on the (O, u) direction if and only if M
i
M
j
is perpendicular to (O, u) (Figure 1). Then the number of critical directions, defined by the θ
ij
angles, exists and is finite.
For each direction
passing through two points M
i
(x
i
, y
i
) and M
j
(x
j
, y
j
), we define φ
ij
the angle between the line M
i
M
j
and the x-axis.
Then:
As previously defined, θ is the angle between the x-axis and the direction (O, u) perpendicular to M
i
M
j
.
Then for each couple (M
i
, M
j
), we have
Each critical angle θ
ij
defines an angular sector. Within each sector, the order of the coordinates projected on the (O, u) direction does not depend on this direction. For points M
i
and M
j
the difference (u
j
- u
i
) of their coordinates projected on (O, u) verifies:
(u
j
- u
i
) cos(φ
ij
) = (x
j
- x
i
) sin(θ - θ
ij
) (1)
with (u
j
- u
i
) = (y
j
- y
i
) sin(θ) for x
i
= x
j
⇔ φ
ij
=
Thus (u
j
- u
i
) depends continuously on θ. The sign of this difference cannot change within the angular sector since (u
j
- u
i
) = 0 only if θ = θ
ij
.
It follows that the interclass variances (and then the ODT procedure) is not modified within each sector. As a direct consequence of (1) the transition from a sector to another via the critical angle θ
ij
(Figure 2) induces the same order except the permutation of the two adjacent elements (u
i
, u
j
).
Note that for aligned points M
i
, M
j
and M
k
the algorithm has to permute the adjacent element group (u
i
, u
j
, u
k
). Similarly for parallel directions M
i
M
j
and M
k
M
l
, the algorithm has to permute at the same time the couples of adjacent elements (u
i
, u
j
) and (u
k
, u
l
).
Note again that all these angular sectors define as much covariates. Thus the procedure comes to the usual CART procedure. But the number of different critical angles is
and using CART this way over-consumes time and space. For example, in our application the number of point locations is n= 164, hence the number of different angular sectors is N= 13270.
iii. We present now an optimization of our algorithm. A less time consuming and more efficient algorithm is a stepwise analysis of the angular sector, ordered according to the observed θ
ij
. At each step the algorithm uses the previous calculus.
Because only two elements between two adjacent sectors are permuted only one interclass variance has to be reloaded, related to the single different split (or some interclass variances for the group of permuted couples, related to a few different splits). The procedure inherits the calculus of the other interclass variances from the previous sector with the exception of the interclass variance related to the single permutation.
Thus, the algorithm complexity is
(n
2 log n) in time and
(n) in space for one split. Finally, our algorithm splits the plane into two adjacent partitions as follows:
Arrange the x
i
;
Calculate and arrange the θ
ij
via the a
ij
;
Calculate
;
For each potential split of the first angular sector (corresponding to the x-axis), i.e. for each value of x
i
:
Calculate the ∑ z
i
for each class (on both sides of the threshold x
i
) and then the interclass variance, using the previous results;
If the calculated interclass variance is greater than the previous one, store the results;
For the next angular sector
Permute the corresponding x
i
x
j
(or the group of elements);
Calculate the ∑ z
i
only for the two classes generated by the split between x
j
and x
i
(or some splits for the group of permuted elements);
If the new interclass variance is greater than the previous optimum, store the results;
Until all sectors are scanned.
This algorithm goes on recursively until a specific criterion is reached and the Oblique Decision Tree is completed.
For simplicity we will not herein discuss special procedures of CART such as stopping rules, pruning algorithms or resampling methods; these are examined elsewhere [22, 36].
Dataset
Malaria is the major parasitic disease in the world affecting approximately 300–500 million individuals annually. About two percents of the individuals infected with Plasmodium falciparum die. Most of the deaths occur in children. In the last decade, the incidence of malaria has been increasing at an alarming rate in Africa representing over 90% of the reported cases in the world [37].
The study area was the whole village of Bancoumana located in the administrative circle of Kati (Mali, Western Africa). This village is located in the high Niger's valley, a Sudanese savannah area, about 60 km south-west from the capital city Bamako. The main activities are rice cultures and truck farming along the Niger river. This village is 2.5 km2 wide, with 8 000 inhabitants (MRTC census, 1998) and about 1 600 children under 9 years. The transmission of malaria is high during the rain season (usually from June to October, with temperatures varying between 25°C and 40°C). It decreases then, reaching a low level of transmission one or two months thereafter.
The project investigated at a village-level approach (using a 1–3 m resolution scale) the risk of malaria infection. The presence of P. falciparum, the main infectious agent of malaria in this area, in blood smears was investigated in 1 461 children living in 164 households during the dry season in March 2000. Among them, 474 children had a positive blood smear (32.44%, CI95% [30.09–34.89]). Localization was performed through GPS receivers. Thus, all children were geocoded at a point location (corresponding to their house). Geo-database and cartographic displays were provided with the ArcGIS 8.3 software (ESRI, Redlands, CA).
Human subjects' research conducted in these studies was approved by the Institutional Committee on Ethics of the Mali Faculty of Medicine and Pharmacy, University of Mali. To obtain informed consents a stepwise consent process was applied as described by Doumbo [38]. First, the community informed consent was obtained before the beginning of the study. Second, the informed consent of the parents or guardians of the children were orally obtained before each clinical or biological investigation.
Data analysis
The ODT-algorithm was implemented with the Matlab Software 7.0.1 (The Mathworks Inc. 2004). We applied the ODT procedure to the dataset using the GPS coordinates of each location as independent covariates and the parasitic positivity rate (rate of positive blood smears per houses) as dependant variable. Thus ODT provided an optimal partition of the geographical area, i.e. a spatial pattern of the disease risk. We chose to use two classical stopping rules [22]. First, the ODT stopped if a class was made up of less than 15 locations. Second, we prunned a node if, after partition, one of the two resulting classes was made up of less than 3 locations.
For inference we considered the constant risk hypothesis as a model of "no pattern". Under this null hypothesis each child is at the same disease risk within the observation period regardless of his location. Thus the classes issued from the ODT displayed similar disease risk. However in keeping with many spatial health applications [12], we can not rely on asymptotic arguments to derive theoretically the associated distributions under the null hypothesis. Monte Carlo (MC) simulations were flexible tools for such assessment.
Similarly to many statistical models, we used for inference the explained variability rate Rv, defined as the ratio of the interclass sum of squared errors (SCE) (outcome of the ODT model) and the total SCE. We considered a Monte Carlo inference conditional on the set of all locations and on the local number of subjects. The total number of cases varied from simulation to simulation with an expected value (the total number of cases on the observed dataset). In this way, the simulations assessed spatial variations in the local proportion of cases conditional on the set of all locations. Monte Carlo simulations reflected a constant risk hypothesis similarly to the Rushton and Lolonis [39] approach. We ran 999 simulations under the constant risk hypothesis i.e. homogeneous Poisson distribution. Under this null hypothesis we applied the ODT-algorithm for each of the random dataset and calculated the empirical distribution of Rv. Thus the MC inference provided p-values for testing whether or not the observed explained variability rate is a realization of the theoretical (simulated) distribution under the constant risk hypothesis. In other words, MC inference tested the ODT-model and provided the significance of the spatial pattern issued from the oblique decision tree.
We compared the ODT-model outputs with those of the scan statistic method. For the latter, we used the software program SaTScan™ [18] in order to test for the presence of spatial clusters of malaria infection and to estimate their locations. The identification of high risk clusters with the SaTScan™ was performed under the Poisson probability model assumption using a maximal cluster size of 50% of the total population. For statistical inference, 999 Monte Carlo replications were performed. The null hypothesis of no clustering was rejected when the simulated p-value was lower than or equal to 0.05.
During the data analysis we calculated all confidence intervals of rates according to the Wilson method [40].