We simulated 432 scenarios (50 datasets for each scenario, 1,000 individuals for each dataset). These scenarios were analyzed first with a non-spatial Cox-PH model, and then with four models that accounted for spatial heterogeneity, including a Data-Generating model (DGM).
For each factor (age, sex, and treatment), model performance was assessed by quantifying parameter estimation bias, mean square error (MSE), confidence interval coverage rate (CR), and significance rate (SR).
For the main scenarios (32), 500 datasets of 1,000 individuals each were performed for validation (Additional file 1: Figure S1).
Simulation plan
Event time was simulated using a classic Cox-PH model that accounted for different risks factors (including environmental factors) according to a Weibull distribution. Censoring time was simulated using an exponential distribution.
More precisely, for a given vector of covariates X, the instantaneous risk function λi(t, X) for individual i at time t was defined as follows:
$$ {\lambda}_i\left(t,X\right)={\lambda}_i\left(t,{X}^{(i)}\right)+{\lambda}_i\left(t,{X}^{\left(-i\right)}\right) $$
where λi(t, X(i)) was the fixed effect dependent on individual i and λi(t, X(−i)) was the spatial effect dependent on the neighborhood of individual i.
We then simulated a randomized and controlled prevention trial of n = 1,000 individuals in a square area Ω of 400 km2. Before the treatment randomization was simulated, different risk situations (scenarios) were simulated according to location (spatial distribution of individuals), baseline risk, and heterogeneity of environmental risk (location and density of breeding sites) (see description below). In addition to the treatment effect, two non-spatial risk factors were simulated: one had a significant continuous effect (age factor), and the other had a null effect (sex factor).
The size of datasets and the number of simulations were selected with an accuracy of 0.01 and a variance of 0.09, following standard recommendations [24].
Location (spatial distribution of individuals)
To simulate the spatial distribution of individuals in the study zone Ω, we used an Inhomogeneous Point Process (IPP) in which population density depended on the location of individuals. We considered three clusters, L1, L2, and L3, where the population was at its densest, and we introduced a concentration parameter τ to control for population density in these clusters.
The location of these three clusters was randomized using a Homogeneous Point Process (HPP) so as to respect a minimum distance between them (see Additional file 2).
We studied the impact of population density in the three clusters by simulating three different density situations:
τ = 0.2 (low population density in the three clusters)
τ = 0.5 (average population density in the three clusters)
τ = 0.8 (high population density in the three clusters)
Baseline risk
Baseline risk λ0 was assumed to be constant for all individuals. It was simulated using a Weibull distribution with a shape parameter set to 1 to obtain a constant value over time. Three baseline risk levels corresponding to three observed epidemiological profiles were used for the scale parameter γ, taking as an example malaria prevalence in Mali [25]:
Low prevalence: γ = 6% (e.g., situation in Bamako),
Median prevalence: γ = 37% (e.g., situation in Segou),
High prevalence: γ = 60% (e.g., situation in Mopti).
Source of environmental risk (Breeding sites)
Spatial heterogeneity of environmental risk was simulated according to location of breeding site, variation in breeding site effect (defined as Relative Risk RRb), variation in breeding site density, and influence radius of breeding site considered as a constant.
An individual was considered exposed to a breeding site when the distance between this individual and the center of the site was less than the site’s influence radius.
The influence radius of a breeding site was interpreted as the average distance traveled by a mosquito to take a blood meal. The influence radius was set at r = 600 m [26], and was constant for all breeding sites.
Considering that breeding site density did not depend on location, we simulated the spatial distribution of breeding sites following a marked HPP where the mark was the breeding site’s influence radius r. Thus, an individual could be exposed to zero, one, or more breeding sites, with risk increasing accordingly.
Breeding site density was used to quantify the ratio of the environmental risk area (i.e., the area covered by the influence radius of at least one breeding site) to the total area of the study zone (400 km2).
This density was interpreted as the probability of being exposed to at least one breeding site (see Additional file 2). To simulate different scenarios, breeding site density was set at 0.25, 0.5, and 0.75.
Given that breeding site productivity and protection against mosquito bites can be highly variable, the breeding site effect (defined as Relative Risk RRb), was also simulated in four situations:
-
Very weak effect (RRb = 1.05),
-
Weak effect (RRb = 1.20),
-
Strong effect (RRb = 1.5),
-
Very strong effect (RRb = 3).
Treatment
Allocation to control and intervention groups was simulated according to a symmetric Bernoulli distribution \( \mathcal{B}\left(n,0.5\right) \) using individual randomization and without accounting for location or for other factors.
Subsequently, four levels of treatment effect (defined as Relative Risk RRt) were simulated:
-
Very weak effect (RRt = 0.95),
-
Weak effect (RRt = 0.8),
-
Strong effect (RRt = 0.6),
-
Very strong effect (RRt = 0.25).
Indeed, while the effectiveness of prevention trials (including malaria-prevention trials) is highly variable, in most cases it falls within the relative risk range of 0.25 to 0.95 [27,28,29,30,31,32].
Other risk factors
Although the aim of our study was to assess the impact of spatial heterogeneity of environmental risk, we included two additional factors in our analysis: one was related to the disease (age), and the other was not (sex). These two factors were independent of spatial location, and their effect was constant over time.
Thus, a binary variable identified as the sex factor was simulated using a symmetrical binomial distribution with a null effect (defined as Relative Risk RRs = 1).
$$ Sex\sim \mathcal{B}\ \left(n,p=0.5\ \right) $$
A continuous disease-related variable identified as the age factor was simulated following a piecewise uniform distribution, based on the population distribution by age group in Mali [33]. The protective effect of this variable (defined as Relative Risk RRa), which was taken from the literature [25], was fixed and constant over time (RRa = 0.84):
$$ \mathrm{Age}1\sim \mathcal{U}\ \left(47.3\%\mathrm{n},\left[0.25,14\right]\right) $$
$$ \mathrm{Age}2\sim \mathcal{U}\ \left(19.2\%\mathrm{n},\right]14,24\left]\right) $$
$$ \mathrm{Age}3\sim \mathcal{U}\ \left(26.8\%\mathrm{n},\Big]24,54\Big]\right) $$
$$ \mathrm{Age}4\sim \mathcal{U}\ \left(3.76\%\mathrm{n},\Big]54,64\Big]\right) $$
$$ \mathrm{Age}5\sim \mathcal{U}\ \left(2.94\%\mathrm{n},\right]64,75\left]\right) $$
Events and Censoring
Event time Tevent was simulated using a classic Cox-PH model that accounted for the different risk factors detailed above, namely baseline risk, age, sex, treatment, and environmental risk (breeding sites). The instantaneous risk function of individual i was calculated as follows:
λi(t| X) = λ0 exp (β1Agei + β2Sexi + β3Treatmenti + β4BSi) [equation 1]
λ0 was the baseline risk considered constant over time, BSi was the number of breeding sites to which individual i was exposed, and β1, β2, β3, β4 were the parameters associated with the covariates age, sex, treatment, and breeding site, respectively.
Censoring time Tcens was simulated following an exponential distribution. Thus, observation time T was defined as the minimum between censoring time and event time.
$$ T=\mathit{\min}\left({T}_{event},{T}_{cens}\right) $$
The combination of these different simulation parameters resulted in 432 scenarios, each containing 50 datasets (Figure 1). Datasets of 1,000 individuals each yielded a power greater than 85% when the treatment effect (RRt) was less than 0.8. For a treatment effect set at RRt = 0.95, power was around 65%.
An example of the simulated data structure is shown in Figure 2.
Model Evaluation and Comparison
For each scenario, we analyzed each dataset using five statistical models (see Additional file 1 for details):
a DGM, according to equation 1;
Model Evaluation and Comparisona classic non-spatial Cox-PH model, whose parameters were estimated with the maximum likelihood method, according to an equation similar to equation 1 but without taking into account the environmental risk factor [34, 35];
a GAM model, also with proportional hazards, which modeled the spatial effect with a bivariate spline function [36] (as detailed below);
two SPDE models (as detailed below), which modeled spatial heterogeneity using a Gaussian field with a Matèrn covariance function: one modeled survival time with a Weibull distribution (Cox-SPDE), and the other modeled the number of events with a Poisson distribution (P-SPDE).
Using a Bayesian approach, we estimated the SPDE models (Cox-SPDE and P-SPDE) with the INLA algorithm to optimize computation times [37, 38].
Generalized Additive Model (GAM)
The GAM model was originally designed to study non-linear links [39], usually with spline functions. In our particular case, a bivariate spline function of the geographic coordinates of individuals (latitude and longitude) was used to account for spatial heterogeneity of environmental risk [36, 40]:
$$ \lambda \left(t,X\right)={\lambda}_0(t)\exp \left({\beta}_1{X}_1+{\beta}_2{X}_2+\dots +{\beta}_n{X}_n+f\left( long, lat\right)\right) $$
for each point data, f was a bivariate spline function modeling the spatial effect, long was longitude, and lat was latitude.
SPDE model
The SPDE model was originally designed to model spatial structures using a Gaussian field [41]. This Bayesian model is generally written as follows:
$$ Y\mid \beta, X,Z\sim \mathbb{P}\left(Y|\mu, \phi \right) $$
$$ Z\sim GF\left(0,\varSigma \right) $$
ℙ is the distribution of Y (dependent variable), X is the vector of covariates (fixed effects), Z is a Gaussian field (GF) with a Matèrn covariance function Σ (random spatial effect), and μ = E(Y) and ϕ are the parameters of the distribution of Y, μ = E(Y) = h(βX + Z), where h is the canonical link function.
The Matèrn covariance function is used to control for spatial dependence and for the regularity of the Gaussian field (see Additional file 2). Other covariance functions, including exponential ones, are particular cases of the Matèrn covariance function [42]. In our particular case, we applied the SPDE model in two different ways to model outcome distribution: We applied a spatial Cox-PH model (Cox-SPDE) to model survival time, and a Poisson model (P-SPDE) to discretize survival time and to model the number of events within each time interval (Poisson distribution).
Cox-SPDE model
The Cox-SPDE model is the classic Cox-PH model to which a Gaussian field is added to model the spatial structure [38, 41]. In our particular case, the dependent variable (Y) was survival time:
$$ \lambda \left(t,X\right)={\lambda}_0(t)\exp \left({\beta}_1{X}_1+{\beta}_2{X}_2+\dots +{\beta}_n{X}_n+Z\right) $$
P-SPDE model
The P-SPDE model is a piecewise exponential model driven by a counting process [41, 43, 44]. Thus, for a number of events D (where δik is the indicator of the event in individual i = 1, 2, … , n in the time interval k = 1, 2, … , D, and where Tik is survival time in individual i in the kth interval), δ is assumed to follow a Poisson distribution of parameter π (average number of events per time interval).
In our particular case, the spatial structure was modeled with a Gaussian field, as earlier described. The dependent variable was the number of events per time interval instead of survival time.
$$ \delta \sim Poisson\ \left(\pi \right) $$
$$ \log \left(\pi \right)={\lambda}_0+{\beta}_1{X}_1+{\beta}_2{X}_2+\dots +{\beta}_n{X}_n+\log (T)+Z $$
Model performance measures
The parameter β represented the true effect of each covariate estimated by \( \hat{\beta} \), with a standard deviation \( \overline{Sd\left(\hat{\beta}\right)} \)estimated on K datasets for each scenario and with a 95% Confidence Interval CI(β).
The models were compared using four performance measures [24, 45]: estimation bias \( B\left(\hat{\beta}\right) \), mean square error (MSE), coverage rate (CR), and significance rate (SR), as defined in Additional file 2.
Application
The four models (Cox-PH, GAM, Cox-SPDE, P-SPDE) were applied, according to the above specifications, to data from a vaccine trial aimed at testing a malaria transmission blocking vaccine candidate: Pfs230 [46]. This randomized controlled trial was conducted in 2015-2016 in the town of Bancoumana, Mali (for more details, see [47]). In our study, the event of interest was a clinical malaria episode. The age, sex, and geographical coordinates (GPS) of participants were also collected. The preventive fraction of the vaccine was calculated based on an estimated prevalence of 77%.