Population and area
Incident cancer cases diagnosed between 1998 and 2007 were all extracted from the Cancer Registry of Isère (for more details on this registry, see Ref. [5]). The study considered only prostate, lung, colon-rectum, and bladder cancers in men. The confounding factor was the Townsend index of deprivation, an indicator of socioeconomic disadvantage. This index is a synthesis of the following variables: proportion of unemployed people in the workforce, proportion of households without car, proportion of households renting, and the proportion of overcrowded households. This index increases with the increase in the level of deprivation [14, 21]. The spatial unit was the Commune; i.e., the smallest administrative unit in France (mean number of inhabitants: 1700 over France, 2300 over Département Isère in 2006) [22].
Statistical analysis strategy
First, we searched for the presence of spatial heterogeneity using the Potthoff-Whittinghill method [23–25]. Briefly, if there is no clustering, the observed number of cases in a geographical area should follow a Poisson distribution (mean = variance = expected number of cases in the area). The Potthoff-Whittinghill test checks whether the ratio of the variance to the expected number of cases is >1, in which case the data are said over-dispersed relative to the Poisson distribution (See further details below).
The “naive” global spatial autocorrelation was checked using Moran’s I statistic and the “true” global spatial autocorrelation was confirmed using the Empirical Bayes Index. Another way to measure the spatial effect or the spatial correlation in each cancer type was to compare the goodness of fit between spatial and non-spatial regression models. For this, we used two non-spatial models, the Poisson regression and the constant risk model. Thereafter, we used spatial heterogeneity, autocorrelation, and Besag-York-Mollié (BYM) models in a Bayesian approach [26]. We also used SaTScan and SpODT to explore their abilities to detect spatial clusters in different scenarios (homogeneity, heterogeneity, spatial autocorrelation and the latter two) also detected by global detection methods. The Townsend socioeconomic index was then introduced into the modeling of all local detection methods in the univariate setting to assess its impact on the spatial distribution of each cancer type. Finally, we used a multivariate Bayesian modeling to assess the impact of underlying correlations between the four cancers on their incidences.
Statistical analysis
Poisson non-spatial regression model
Poisson regression model assumes that O
i
|θ
i
∼ Poisson(E
i
θ
i
); O
i
being the number of cases, θ
i
the relative risk, and E
i
the expected number of cases in a given area i. The SIR is the maximum likelihood estimator of θ
i
and is given by \( SI{R}_i=\raisebox{1ex}{${Y}_i$}\!\left/ \!\raisebox{-1ex}{${E}_i$}\right. \). The variance of this model, \( \raisebox{1ex}{$SI{R}_i$}\!\left/ \!\raisebox{-1ex}{${E}_i$}\right. \) is large when E
i
is small, which reflects an over-dispersion of cases. The Poisson model assumes also that \( \log \left({\theta}_i\right)= log\left({E}_i\right)+\beta {\mathcal{X}}_i;\;\mathcal{X} \) being a covariate and β its effect.
Incidence data regarding the four cancer types were first fitted with Poisson regression and using the additional information on the socioeconomic status.
Global detection method
The global spatial autocorrelation analysis was carried out by type of cancer using Moran’s I statistic and the Empirical Bayes Index (EBI), an adapted Moran’s I proposed to take into account a heterogeneity in population distribution.
\( EBI=\frac{N}{{\displaystyle \sum }{w}_{ij}}\frac{{\displaystyle \sum }{w}_{ij}{z}_i{z}_j}{{\displaystyle \sum }{\left({z}_i-\overline{z}\right)}^2} \) where N is the number of Communes, \( {z}_i=\frac{p_i-b}{\sqrt{v_i}} \). and \( {p}_i=\frac{O_i}{x_i} \), O
i
being the number of cases and x
i
the population at risk in Commune \( {C}_i,\kern0.5em {v}_i=a+\left(\frac{b}{x_i}\right),\;b=\frac{O}{x},\;a={s}^2-\frac{b}{\left(\frac{x}{N}\right)},\;{s}^2={\displaystyle \sum }{x}_i\frac{{\left({p}_i-b\right)}^2}{x} \).
Spatial heterogeneity was tested with Potthoff-Whittinghill method, the null hypothesis being that the numbe observed cases O
i
. in a Commune C
i
is Poisson distributed and the mean being the number of expected cases E
i
. The test statistic may be written as follows: \( PW={\displaystyle {\sum}_i^N\frac{O_i\left({O}_i-1\right)}{E_i}} \) which is asymptotically normally distributed, with mean \( \frac{O_{+}\left({O}_{+}-1\right)}{E_{+}} \) and variance \( 2\left(N-1\right)\frac{O_{+}\left({O}_{+}-1\right)}{{\left({E}_{+}\right)}^2} \) and where N is the number of Communes C
i
, O+ = ∑
Ni = 1
O
i
, and C+ = ∑
Ni = 1
C
i
.
Moran’s I and Potthoffhittinghill statistics were computed using Monte-Carlo simulations with 999 replications [14] under the assumption of multinomial distribution of cancer cases. The EBI value was tested using 999 random permutations. Considering the Commune as the spatial unit, three clustering methods were investigated and compared: SpODT, SaTScan, and HBSM.
SpODT method
This is a non-parametric regression model (similar to the Classification and Regression Tree (CART) algorithm of Breiman [12]) that allows local detection of clusters. However, whereas CART provides perpendicular area cuts, SpODT provides oblique area cuts which are more suitable for spatial epidemiology [12, 27]. The functional form of the SpODT model is z
i
= f(x
i
, y
i
) + ε
i
. In this formula, {x
i
, y
i
} correspond to the Commune centroids and ε
i
∈ ℝ represents the residuals. The functional form f(x
i
, y
i
) may be written:
$$ f\left({x}_i,{y}_i\right)={\displaystyle \sum_{j=1}^P}{\overline{z}}_j\ I\left\{{M}_i\left({x}_i,{y}_i\right)\in class\ j\right\} $$
where class j (j = 1, …, P) corresponds to one of the final P classes after splitting the area under study. For each point M
i
in the jth class, \( {\overline{z}}_j \) is the mean of the SIR values of all spatial units and \( {z}_i={\overline{z}}_j\pm {\varepsilon}_i \) is the predicted risk.
SpODT algorithm makes recursively oblique cuts of the study area according to the overall interclass variance until reaching a final number P of areas as per the stopping criteria of the algorithm. Once the classification is obtained, a test is performed using a Monte-Carlo approach to compare the distribution with the estimated R2 [27]. For application to our data, we conducted a sensitivity analysis by changing the values of the parameters that serve as stopping criteria for the SpODT algorithm without using additional information on the socioeconomic status. After the univariate analysis, a multivariate analysis was performed and the SIRs of the detected patterns were those with p-value ≤0.05. Afterwards, we conducted further analyses, first varying the graft level setting to gather the adjacent final classes according to their similarity in terms of risk level, then adding the Townsend index to the model.
SaTScan
This method derives from the Geographical Analysis Machine (GAM) [28]. It aims at grouping neighboring spatial units into potential clusters through circular or elliptical windowing that scans the study area [11]. The observed number of cases is compared with the expected number inside and outside each window by computing, for each scan, a likelihood ratio statistic. Assuming that the observed data follow a Poisson model, the likelihood ratio is:
$$ \frac{L(Z)}{L_0}=\frac{{\left(\frac{n_z}{u(z)}\right)}^{n_z}{\left(\frac{N-{n}_z}{N-u(z)}\right)}^{N-{n}_z}}{{\left(\frac{N}{u(A)}\right)}^N} $$
In this formula, nz and u(z) are the observed and expected numbers of cases in a circular frame Z, N and u(A) the observed and expected numbers of cases estimated under the null hypothesis of a homogeneous risk over the whole area, and nz>u(z) or equal to \( \raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{${\mathrm{L}}_{0\ }\ $}\right. \). In the present study, we considered only circular windows because they detect smaller and more compact clusters than elliptical ones [29]. We conducted a sensitivity analysis with different maximum cluster sizes for the at-risk population; precisely, 1 to 50 % of the whole population. The analyses were performed with and without additional information on the socioeconomic status. The candidate clusters and their relative risks (RR) were those with a p-value ≤0.05 as obtained by Monte-Carlo simulation.
Univariate HBSM
This approach aims at modeling the spatial distribution and estimating the relative risk (θ
i
) at each spatial unit i. This approach allows taking into account spatial and non-spatial effects as structured information that can be graphically represented by four levels. The first level represents the local variability inside each spatial unit as provided by the observed data assumed having a Poisson distribution; this level gives the likelihood. The second level (or spatial process level) represents the variability between spatial units and depends on the presence of an autocorrelation \( \left(\mathcal{U}\right) \) or a spatial heterogeneity \( \left(\mathcal{V}\right) \). At this level, potential covariates \( \left(\mathcal{X}\right) \) are assumed to be normally distributed. The third level (or priors) represents the variability of the spatial process components and is approached with a particular distribution (Beta, Gamma, Dirichlet, or Wishart) [30]. Here, a gamma distribution was chosen for precision (inverse of the variance). The parameters (called hyperparameters) of this distribution are fixed a priori according to previous recommendations [31]; they represent the fourth level. These hyperparameters were estimated using MCMC sampling and approximation methods.
Three a priori assumptions on the spatial process level were tested: i) autocorrelation with conditionally auto-regressive model (CAR model), analytically written \( \log \left(\theta i\right) = \alpha +{\mathcal{U}}_i \); ii) heterogeneity with \( \log \left(\theta i\right) = \alpha +{\mathcal{V}}_i \); iii) both, with \( \log \left({\theta}_i\right) = \alpha +{\mathcal{U}}_i+{\mathcal{V}}_i \) according to the BYM model [26].
In the spatial process, the hypothesis of a homogeneous risk was tested with log(θ
i
) = α, which corresponds to the intercept [4, 5, 32]. Analyses were then performed using the additional information on the socioeconomic status \( \left({\mathcal{X}}_i\right) \) with each previous model. In the BYM, this additional variable was written: \( \log \left({\theta}_i\right)=\alpha +\beta *{\mathcal{X}}_i+{\mathcal{U}}_i+{\mathcal{V}}_i \).
To measure the influence of the socioeconomic status on the spatial distribution of cancer cases, we compared the models according to two criteria: i) the empirical variance of the autocorrelation components \( {\mathcal{S}}_{\mathcal{U}}^2 \); and, ii) heterogeneity components \( {\mathcal{S}}_{\mathcal{V}}^2 \) and their Deviance Information Criterion (DIC) [33]. A comparison was made between the DIC in each cancer type to identify the advantages of using a Bayesian approach. As in Colonna and Sauleau [4, 5, 32], we considered that a DIC difference of more than 10 points indicates the presence of an influence of the socioeconomic status on the differences in cancer incidence. The presence of either autocorrelation or heterogeneity was decided according to the model that had the highest empirical variance.
Multivariate HBSM
In the multivariate disease mapping, we can assume that, similarly to the univariate HBSM, the data are structured into spatial and non-spatial effects and can be graphically represented by the same four levels. The first level represents the local variability within each spatial unit as provided by the observed data assumed to have a Poisson distribution:
$$ \left.{O}_{ij}\right|{\theta}_{ij}\sim Poisson\left({E}_{ij}{\theta}_{ij}\right) $$
O
ij
being the number of cases, θ
ij
the relative risk, and E
ij
the expected number of cases, all in a given area i and for disease j.
The main difference with the univariate disease mapping concerns the second level that represents the variability between spatial units and assumes a plausible dependence between diseases [34]. The a priori assumption considers that \( \mathcal{U} \) or \( \mathcal{V} \) follow a multivariate normal distribution (MVN) [35], that can be identified by bb and analytically written: b ~ MVN(0, Σ
b
), where Σ
b
denotes the variance of parameter also known as the between-disease covariance matrix. To provide the log relative risks θ
ij
, Martinez-Beneito et al. [20] have recently proposed two M-based models that unified the multivariate disease mapping by considering Σ
b
= M⊺M. One M-based model considers M as a fixed effects; the other considers them as random effects of the correlations between the diseases. The model can be written: log(θ
ij
) = ΦM where Φ is the matrix that contains the distribution of the underlying spatial patterns. For disease j and area i, log(θ
j
) = Φ1M1j + … + Φ
i
M
ij
, where M
ij
is the entry in M. When Φ follows three independent proper CAR distributions with different parameters, the M-based model is equal to the MCAR model proposed by Jin et al. [36] and equal to that of Gelfand and Vounatsou [37] in the case of two diseases. For brevity, we have applied a BYM spatial structure on four underlying factors (four diseases). This multivariate BYM model [26] assumes the presence of eight underlying patterns, four of them with spatially heterogeneous distributions and four with CAR distributions.
Finally, we compared the DICs of the independent Bayesian modeling in each cancer type with those of the multivariate BYM models to check the usefulness of the multivariate approach.
Statistical software programs
Moran and Potthoff-Whittinghill statistics were computed using package DCluster of R 3.0.2. EBI was computed using package spdep of R [38]. The proximity-weighted matrix w
ij
= exp(−d
ij
), a negative exponential function for exponential decay, was used in all three methods [9, 39, 40]. For the SpODT, we used version 0.9 of SpODT package [12, 27]. For SaTScan, we used SaTScan version 9.3 [11]. For HBSM and Poisson regression, we used WinBUGS version 1.4.3 to estimate the Bayesian parameters. More specifically, we called WinBUGS from R using package R2WinBUGS 2.1-19 [41]. For each model, we ran three chains. A burn-in of 60 000 iterations was performed and the posterior distribution was obtained using a sample of 20 000 iterations. Convergence was monitored graphically using time series plots checked with Monte-Carlo standard error [4].