Skip to main content

Geographically weighted accelerated failure time model for spatial survival data: application to ovarian cancer survival data in New Jersey

Abstract

Background

In large multiregional cohort studies, survival data is often collected at small geographical levels (such as counties) and aggregated at larger levels, leading to correlated patterns that are associated with location. Traditional studies typically analyze such data globally or locally by region, often neglecting the spatial information inherent in the data, which can introduce bias in effect estimates and potentially reduce statistical power.

Method

We propose a Geographically Weighted Accelerated Failure Time Model for spatial survival data to investigate spatial heterogeneity. We establish a weighting scheme and bandwidth selection based on quasi-likelihood information criteria. Theoretical properties of the proposed estimators are thoroughly examined. To demonstrate the efficacy of the model in various scenarios, we conduct a simulation study with different sample sizes and adherence to the proportional hazards assumption or not. Additionally, we apply the proposed method to analyze ovarian cancer survival data from the Surveillance, Epidemiology, and End Results cancer registry in the state of New Jersey.

Results

Our simulation results indicate that the proposed model exhibits superior performance in terms of four measurements compared to existing methods, including the geographically weighted Cox model, when the proportional hazards assumption is violated. Furthermore, in scenarios where the sample size per location is 20-25, the simulation data failed to fit the local model, while our proposed model still demonstrates satisfactory performance. In the empirical study, we identify clear spatial variations in the effects of all three covariates.

Conclusion

Our proposed model offers a novel approach to exploring spatial heterogeneity of survival data compared to global and local models, providing an alternative to geographically weighted Cox regression when the proportional hazards assumption is not met. It addresses the issue of certain counties' survival data being unable to fit the model due to limited samples, particularly in the context of rare diseases.

Peer Review reports

Background

In public health studies, such as survey studies, surveillance efforts, and longitudinal studies, survival data are often collected at small geographical levels (such as communities or counties) and aggregated at larger levels [1]. Taking the Surveillance, Epidemiology, and End Results (SEER) Program [2]. as an example, it encompasses about 20 states, each containing several counties. This spatially aggregated survival data inherently contains geographic information. Previous studies have demonstrated a close relationship between geographic information and mortality from advanced ovarian cancer, indicating that influencing factors are not constant across space [3]. Applying traditional ‘global’ models may result in misleading outcomes [4] Neglecting to account for spatial variation in the modeling framework could introduce biased effect estimates [5]. Establishing a “local model” in each specific location may be challenging due to geographically sparse data, such as rare diseases, where traditional methods struggle with limited sample sizes, leading to difficulties in fitting survival models in certain areas and reducing statistical power [6].

To address these challenges, a commonly used approach is geographically weighted regression (GWR), which allows coefficients to vary geographically and employs a distance weighting scheme to assign weights to each observation [7]. GWR is a statistical method that tackles heterogeneity in error variance caused by spatial correlation of error terms [8]. The main principle behind GWR is to leverage information from nearby observations, in line with the “first law of geography” positing that closer things are more strongly related than distant ones [9]. By applying the GWR method, regression parameters are estimated locally at each location, even with limited sample sizes.[10].

Combining the GWR approach with classical survival analysis can effectively model spatial survival data. For instance, Taufiq (2019) [11] employed a Bayesian approach to estimate the Cox survival model with GWR, allowing for the estimation of survival and hazard functions, as well as the determination of prior and posterior distributions. Similarly, Xue (2020) [12] introduced geographically weighted Cox regression for sparse spatial survival data, incorporating a stochastic neighborhood weighting scheme at the county level. While these models are built upon the Cox proportional hazards model, which is widely used for survival data analysis, they often assume the proportional hazards (PH) assumption, which may not hold in practice [13].

In survival analysis, the accelerated failure time (AFT) model is recognized as an alternative to Cox models when the PH assumption is not be statisfied [14]. However, existing literature on the application of the AFT model with spatially varying coefficients for geographic survival data is limited. To address this gap, we propose the geographically weighted AFT (GWAFT) regression to handle spatial survival data that do not adhere to the PH assumption. By estimating regression coefficients at each location and weighting observations based on their distance, the GWAFT model offers a solution for modeling sparse spatial survival data when the PH assumption is violated.

The contributions of this study are as follows:

  1. (1)

    The proposed GWAFT model resolves errors stemming from spatial heterogeneity, which can lead to biased effect estimates. It offers a novel approach to modeling geographic survival data without relying on the proportional hazards assumption.

  2. (2)

    The method effectively tackles the challenge of constructing models with limited local sample sizes, particularly in scenarios with geographically sparse data, such as rare diseases.

  3. (3)

    The methodology is applied to the survival analysis of ovarian cancer patients from the SEER cancer registry in New Jersey. The empirical study demonstrates the successful handling of geographical heterogeneity and sparse data issues, providing spatial parameter estimations.

Method

The geographically weighted AFT model

For \(i = \, 1, \, 2, \, \ldots ,n\), let \(\left( {Y_{i} , \, T_{i} , \, C_{i} , \, \delta_{i} , \, X_{i} , \, l_{i} } \right)\) denotes the i th records of right-censored survival data collected from different sites (locations), where \(T_{i}\) is log-transformed failure time (survival time), \(C_{i}\) denotes the censoring time under the same transformation.

Due to the right censoring, the observed survival time is denoted by \(Y_{i} = min\left( {T_{i} ,C_{i} } \right)\) and the censoring indicator \(\delta_{i} = \, I\left( {T_{i} \le \, C_{i} } \right)\), where \(I\left( \cdot \right)\) is an indicator function. \(X_{i} \in R_{p}\) is the corresponding vector of covariates, and \(l_{i} \in R^{2}\) is the spatial information of the corresponding location, i.e., \(l_{i} = \, \left( {latitude_{i} , \, longitude_{i} } \right)\).

The basic multivariate AFT model was presented in Eq. (1)

$${T}_{i}={X}_{i}\beta +{\varepsilon }_{i}$$
(1)

where β is a p × 1 vector of regression coefficients and \(\varepsilon_{{_{i} }}\) is a random error vector with an unspecified multivariate distribution.

The generalized least-squares estimator can be expressed as [15]

$$\sum\limits_{i = 1}^{n} {(X_{i} - \overline{X})^{T} } \Omega_{i}^{ - 1} (\alpha )(\hat{Y}_{i} (b) - X_{i} \beta ) = 0$$
(2)

where \(\Omega_{i}^{ - 1} (\alpha )\) is a K × K working covariance matrix with an additional correlation parameter α, b is an initial estimator of β, and \(\overline{X} = \sum\limits_{i = 1}^{n} {X_{i} } /n\). Due to right censoring, Ti can be replaced by the least-squares normal equations with its conditional expectation: \(\hat{Y}_{i} (b) = E_{\beta } (T_{i} |Y_{i} ,\delta_{i} ,X_{i} )\).

The estimator can be solved by Eq. (3)

$$L_{multi} (b) = \left\{ {\sum\limits_{i = 1}^{n} {(X_{i} - \overline{X})^{T} } \Omega_{i}^{ - 1} (\alpha )(X_{i} - \overline{X})} \right\}^{ - 1} \times \left\{ {\sum\limits_{i = 1}^{n} {(X_{i} - \overline{X})^{T} } \Omega_{i}^{ - 1} (\alpha )(\hat{Y}_{i} (b) - \overline{Y}_{i} (b))} \right\}$$
(3)

where \(\overline{Y} (b) = \sum\limits_{i = 1}^{n} {\hat{Y}_{i} } (b)/n\).

The approaches mentioned earlier assign equal weights to all observations. However, in the GWAFT model proposed in this study, the main concept is to assign different weights to each observation based on their distance from the center area. This allows us to incorporate the information from surrounding observations. Therefore, within the framework presented above, we incorporate the weighting factor \(w_{i} (l)\) introduced in section of weighting function to incorporate geographic information into the model presented in Eq. (4). The estimation of the point estimator is then performed.

$$\sum_{i=1}^{n}{w}_{i}\left(l\right){\left({X}_{i}-\overline{X }\right)}^{T}{\Omega }_{i}^{-1}\left(\alpha \right)\left({\widehat{Y}}_{i}\left(b\left(l\right)\right)-{X}_{i}\beta \right)=0$$
(4)

where \(b(l)\) means the coefficients are varying with location l. It can be solved by Eq. (5)

$$L_{multi} (b) = \left\{ {\sum\limits_{i = 1}^{n} {w_{i} (l)(X_{i} - \overline{X})^{T} } \Omega_{i}^{ - 1} (\alpha )(X_{i} - \overline{X})} \right\}^{ - 1} \times \left\{ {\sum\limits_{i = 1}^{n} {w_{i} (l)(X_{i} - \overline{X})^{T} } \Omega_{i}^{ - 1} (\alpha )(\hat{Y}_{i} (b) - \overline{Y}_{i} (b))} \right\}$$
(5)

Weighting function

In GWAFT regression, an initial issue that needs to be addressed is the determination of observation weights. According to the “First Law of Geography”, the general idea is that for the sample to be estimated, the closer other samples is to it, the greater the weight is assigned, otherwise the smaller the weight is assigned. As for the distance between observations, in health-related studies, the data is often collected from limited areal units (e.g., counties or cities) with defined boundaries. Therefore, we just need to calculate the distance between each areal center, then use certain kernel functions to assign weights for each observation.

Therefore, the specific assignment of weights is as follows, suppose that the observation i collect at K and j collect at P, there is a circle with radius r centered on K, and we assume that the observations included in the circle are close to K, while those are not far from K. Thus, a traditional weight function is shown as (6).

If the observation i being the sample to be estimate, the weight \(w_{ij}\) given to observation j is

$${w}_{i}\left(j\right)=\left\{\begin{array}{l}\begin{array}{cc}1& \text{if} \;0<{d}_{ij}={d}_{KP}\le {d}_{s}\end{array}\\ \begin{array}{cc}0& \text{otherwise}\end{array}\end{array}\right.,$$
(6)

where \(d_{ij}\) is a certain measure of distance between observation i and j, now is the distance between center K and P. It can be many kinds of distance, such as Graph distance [16], Great circle distance [17] and Euclidean distance [18], etc.

According to Eq. (6), observations with distance from K below r will be included in the model, while observations whose distance to K just exceeds r will be excluded. Such a definition is theoretically sound, however, in practical data analysis, it would be unnatural that the spatial association between observations ends so abruptly.

To solve this issue, a Gaussian distance-decay-based weighting could be introduced into our approach by Eq. (7),

$${w}_{ij}=\text{exp}\left(-{d}_{ij}^{2}/{h}^{2}\right),$$
(7)

This function can be referred to as kernel functions, which can be used to decide the weights of point-reference data where locations vary continuously over a spatial domain. The parameter h is the bandwidth parameter selected by the user and provides some weight controls of the geographical data. Therefore, we use the Eq. (8) weighting function to avoid the situation where the spatial association ends so abruptly.

$${w}_{i}\left(j\right)=\left\{\begin{array}{ll}1& \text{if}\; 0<{d}_{ij}={d}_{KP}\le {d}_{s}\\ \text{exp}\left(-{d}_{ij}^{2}/{h}^{2}\right)& \text{if}\; 0<{d}_{ij}={d}_{KP}\end{array}\right.,$$
(8)

The choice of distance

Deciding on a threshold \(d_{s}\) in (6) for the great circle distance or Euclidean distance, involves determining how close is “close enough” to be considered the same, which is usually a subjective matter. In practice, it will cost lots of calculation power to find the best hyperparameter \(d_{s}\). A more robust and natural distance function, as well as an associated rule, is desired [12].

Compared to the above distances, graph distance provides a new way to solve this issue, without \(d_{s}\) to be determined. Following Bhattacharyya and Bickel (2014) [16], suppose that we have a random graph \(G_{n}\) as the data collected. Let \(V\left( {G_{n} } \right) = \left\{ {v_{1} , \ldots ,v_{n} } \right\}\) denotes the vertices of \(G_{n}\) and \(E\left( {G_{n} } \right) = \left\{ {e_{1} , \ldots ,e_{m} } \right\}\) denotes the edges of \(G_{n}\). Each vertice denotes an area center \(v_{{}}\). The graph distance between each center can be defined as follows:

$${d}_{{v}_{i}{v}_{j}}=\left\{\begin{array}{cc}\left|V\left(e\right)\right|& \text{if} \;e \;\text{is the shortest path connecting} {\;v}_{1}\;\text{and}{\;v}_{j}\\ \infty & {v}_{1}\;\text{and}\; {\;v}_{j} \;\text{are not connected}\end{array},\right.$$
(9)

where \(|V(e)|\) represents the cardinality of edges in e.

According (8) and (9), the weight of each observation is defined as follows

$${w}_{i}\left(j\right)=\left\{\begin{array}{cc}1& \text{if }\;{d}_{{v}_{i}{v}_{j}}\le 1\\ \text{exp}\left(-{d}_{{v}_{i}{v}_{j}}^{2}/{h}^{2}\right)& \text{if }\;1<{d}_{{v}_{i}{v}_{j}}\end{array}\right.,$$
(10)

where \(d_{{v_{i} v_{j} }}\) is the graph distance between center vi and vj. This weighting scheme makes all observations in same area get equal weights and does not require precise latitude and longitude information. Alternatively, it just needs a topology graph, which is more convenient to apply to real study.

The choice of bandwidth h

For the choice of bandwidth h, Brunsdon et al. (1996) [19] proposed that the bandwidth is selected by minimizing the cross-validated out-of-sample sum of squared errors. While in our method, it may not be suitable because as for survival data, there is no response to the hazard. Thus, we choose to use information criterion to choose the bandwidth.

The AIC and the BIC are not suitable for the bandwidth selection in our study, the minimum bandwidth would always be preferred. This is because the component of AIC and BIC accounting for model complexity remains the same across all models being compared. Therefore, we turned to minimize the modified quasi-likelihood information criterion (QIC) [20] to select the optimal bandwidth. The QIC is a modification to AIC, where the likelihood is replaced by the quasi-likelihood and a proper adjustment is made for the penalty term. In this study, we define the QIC as follows:

Assuming the data is aggregated from G unique locations, respectively \(k_{1}^{*} ,k_{2}^{*} ,...,k_{G}^{*}\), and the estimated regression coefficients \(\hat{\beta }(k_{1}^{*} ),\hat{\beta }(k_{2}^{*} ),...,\hat{\beta }(k_{G}^{*} )\), when the bandwidth equals h, the QIC can be calculated as:

$$QIC=-2\sum_{g=1}^{G}Q\left(\widehat{\beta }\left({k}_{g}^{*}\right);I, {\mathcal{D}}_{{s}_{g}^{*}}\right)+2\sum_{g=1}^{G}\widehat{\Omega }\left(\widehat{\beta }\left(\left({k}_{g}^{*}\right)\right)\widehat{V}\left(\widehat{\beta }\left({k}_{g}^{*}\right)\right)\right),$$
(11)

where \(\hat{\Omega }_{I} (\hat{\beta }(k_{g}^{*} )) = - \partial^{2} Q((\hat{\beta }(k_{g}^{*} ));I,{\mathcal{D}})/\left. {\partial \beta \partial \beta^{\prime } } \right|_{{\beta = \hat{\beta }(k_{g}^{*} )}}\),\({\mathcal{D}}_{{s_{g}^{*} }}\) donates the observations data at \(s_{g}^{*}\), \(\hat{V}\) is a working covariance matrix.

The main Algorithm 1 of the GWAFT regression proposed in the current study is organized as follows:

figure a

Algorithm 1. Geographically weighted AFT regression

Simulation study

In this section, we generated simulated survival data similar to the surveillance data from the SEER program. We aimed to study the performances of the proposed method using this data. To ensure the representativeness of our results, we randomly generated five counties along with their graph distance matrix according to SEER.

We conducted five simulations, each with four covariates considered. Two of these covariates were found to be significant, while the other two were non-significant, and the censoring rates are 0.4. The key differences between the four simulations are summarized in Table 1.

Table 1 The key differences between the four simulations

In order to ensure the representativeness of our simulations, Simulation 1 and 2 were conducted to study the performances of the proposed method in situations with a large sample size and satisfaction of the Proportional Hazard assumption (PH assumption), respectively. The On the other hand, Simulation 3 and 4 were conducted in situations with sparse sample sizes. Each simulation introduced additional structures to the data. To more closely mimic real-world conditions, we randomly select the sample size within a specified range rather than using a fixed sample size in each simulation replication. To validate the applicability of our approach to extremely imbalanced datasets, we conducted Simulation 5.

Simulation design

Simulation 1 with large sample size and PH assumption not be satisfied

In this simulation, we generated a dataset which does not satisfy the PH hypothesis to study the performance of the proposed method. So the survival time follows a log-normal distribution LN(μ,σ), and the cumulative risk function expressed by

$${H}_{0}\left(t\right)=-\text{log}\left[1-\Phi \left(\frac{\text{log}\;t-\mu }{\sigma }\right)\right],$$
(12)

where \(\Phi ( \cdot )\) is the distribution function of the central and reduced normal distribution. Survival times can therefore be simulated from:

$$T=\frac1{\text{exp}\left(X_1\beta\right)}\text{exp}\left(\sigma\varnothing^{-1}\left(U\right)+\mu\right),$$
(13)

where U~ Uniform (0,1).

Censoring times were generated independently from Exp(0.3). Then, the sample size of each location was set between 80 to 85 randomly selected from a uniform distribution, which was relatively adequate.

We totally took four covariates into account: x1~Bernoulli(0.3), x2~N(0,1), x3~Bernoulli(0.7), x4~N(0,1). The vector of coefficients (β1, β2, β3, β4) was set to (0.4, 0, -0.6, 0), which means that β1 and β3 were significant. For each county (location) m, the spatially varying coefficients were generated from (0.4, 0, -0.6, 0) + 0.1* (the graph distance between county m and county l- mean of distance between all other counties and county l), with the county l being the center of the five counties.

Our weighting function was (10). In order to avoid the performance of the proposed model being fairly close to a global (unweighted) model, the weight of other locations was 0.8 at maximum refer to Xue [12]. Meanwhile, the largest graph distance was 9, so the maximum bandwidth was 20. Therefore, the grid of bandwidths was set to h {1, 2, …, 20}.

For each location, we fitted the GWAFT regression, Geographically Weighted Cox (GWCox) regression, local AFT regression, and local Cox regression. Besides, we also fitted the global (unweighted) AFT regression and the global (unweighted) Cox regression in the whole datasets. The simulation process described above was repeated 1000 times. The parameter estimates were evaluated using the following four measurements:

$$\text{mean absolute bias }\left(\text{MAB}\right)=\frac{1}{5}\sum_{a=1}^{5}\frac{1}{1000}\sum_{c=1}^{1000}\left|{\widehat{\beta }}_{a,b,c}-{\beta }_{a,c}\right|,$$
(14)
$$\text{mean standard deviation }\left(\text{MSD}\right)=\frac{1}{5}\sum_{a=1}^{5}\sqrt{\frac{1}{999}} \sum_{c=1}^{1000}{\left({\widehat{\beta }}_{a,b,c}-{\widehat{\beta }}_{a,c}\right)}^{2},$$
(15)
$$\text{mean of mean squared error }\left(\text{MMSE}\right)=\frac{1}{5}\sum_{a=1}^{5}\frac{1}{1000}\sum_{c=1}^{1000}{\left({\widehat{\beta }}_{a,b,c}-{\widehat{\beta }}_{a,c}\right)}^{2},$$
(16)
$$\text{mean coverage probability }\left(\text{MCP}\right)=\frac{1}{5}\sum_{a=1}^{5}\frac{1}{1000}\sum_{c=1}^{1000}1\left(\left|{\widehat{\beta }}_{a,b,c}-{\widehat{\beta }}_{a,c}\right|1.96*SE\left({\widehat{\beta }}_{a,b,c}\right)\right),$$
(17)

where \(\hat{\beta }_{a,b,c}\) is the estimate for the b th coefficient of county a in the c th replicate, \(\overline{\hat{\beta }}_{a,c}\) is the average of \(\hat{\beta }_{a,b,c}\) over the 1000 replicates, \(\beta_{a,c}\) is the true underlying parameter. and 1(·) is the indicator function.

As for the insignificant variables β2, β4, we use false positive rate (FPR) to measure the performance of methods [21]. It can be defined as the counts of being estimated significantly mistakenly while non-significant in 1000 simulation replications divided by 1000. Lower FPR values are preferable.

Simulation 2 with large sample size and with PH assumption

In this simulation, survival times were generated from a Cox model with a baseline hazard function \(\lambda_{0} (t) = 0.03\). Censoring times were generated independently using \(Uniform(0,20)\). The coefficient settings, sample size, and other parameters were kept the same as in Simulation 1. The primary objective of this section is to compare the performance of the models with and without the proportional hazards (PH) assumption in the large sample size situation

Simulation 3 with sparse sample size and without PH assumption

As observed in the aforementioned simulation, when each location has a sufficiently large sample size, all models can provide estimations, regardless of their efficiency or accuracy. However, in practical scenarios, the sample size is often limited, particularly for rare diseases. Therefore, the main objective of this section is to investigate the performance of the models in situations with sparse survival data. Consequently, the sample size for each location was randomly selected from a range of 20 to 25. The remaining settings remained the same as in Simulation 1.

Simulation 4 with sparse sample size and with PH assumption

In this section, the data generation process was similar to Simulation 2, and the sample size for each location was randomly selected from a range of 20 to 25. The main purpose of this section is to compare the model performances with and without the proportional hazards (PH) assumption in situations with sparse survival data.

Simulation 5 with extreme sample size and without PH assumption

In this section, we generated survival data with extreme sample size, being 20, 200, 300, 400, 500 in each location. This simulation was specifically designed to study the applicability of our approach in situations where there is a significant disparity in the distribution of data across different locations. By subjecting our methodology to such extreme imbalances, we aimed to assess its robustness and ability to provide accurate and reliable results in challenging scenarios without the proportional hazards (PH) assumption. Other settings were same as Simulation1 and 3.

Simulation result

Results in Simulation 1

The four measurements of the significant coefficients β1 and β3 varying with bandwidth are depicted in Fig. 1. As illustrated in Fig. 2, in this scenario, the MAB, MSD, and MMSE consistently remain at low levels. The GWAFT regression model consistently outperforms the GWCox regression model across all bandwidths. Furthermore, in terms of the FPR for the non-significant coefficients β2 and β4 the GWAFT model also exhibits superior performance across the most of bandwidths.

Fig. 1
figure 1

The QIC value of GWAFT varying with bandwidth in Simulation 1

Fig. 2
figure 2

Performance measures for the GWAFT model and GWCox model in Simulation 1; (a) MAB of the estimator of β1 along with bandwidth (b) MAB of the estimator of β3 along with bandwidth, (c) MSD of the estimator of β1 along with bandwidth, (d) MSD of the estimator of β3 along with bandwidth, (e) MMSE of the estimator of β1 along with bandwidth, (f) MMSE of the estimator of β3 along with bandwidth

When the bandwidth equals 8, the QIC value reaches its lowest point at 329.48. The corresponding results of all models are presented in Table 2. We can find that all AFT family models consistently exhibit significantly smaller values for MAB, MSD, and MMSE of β1 and β3 compared to the Cox family models, even differing by orders of magnitude. Among all models, the GWAFT regression and global AFT regression demonstrate the best performance, with the former outperforming the latter in terms of MSD and MMSE by more than half. The MCP of GWAFT are around 0.95.

Table 2 Performance of global AFT, global Cox, best selected GWAFT, best selected GWCox, local AFT, local Cox in Simulation 1 (without PH assumption)

Regarding the FPR, the GWAFT model consistently outperforms all other models, with values of 0.0152 and 0.0146, respectively. The GWAFT model and the Global model all perform better than corresponding Cox model, while the global AFT model and global Cox model performances are relative similar with each other as indicated in Table 3.

Table 3 The FPR of global AFT, global Cox, best selected GWAFT, best selected GWCox, local AFT, local Cox in Simulation 1 (without PH assumption)

Above all, as for the comparison between GWAFT and GWCox, when the PH assumption is not satisfied, we can find that the MAB, MSD, MMSE, MCP, and FPR of GWAFT are all outperform than GWCox, as shown in Tables 2 and 3.

Result in Simulation 2

As depicted in Fig. 3, we observe contrasting results compared to Simulation 1. The GWCox regression model outperforms in terms of MAB and MMSE, while slightly lagging behind in MSD. All subfigures demonstrate a pattern where model performances vary as the bandwidth increases, eventually stabilizing at a certain value.

Fig. 3
figure 3

Performance measures for the GWAFT model and GWCox model in Simulation 2; (a) MAB of the estimator of β1 along with bandwidth (b) MAB of the estimator of β3 along with bandwidth, (c) MSD of the estimator of β1 along with bandwidth, (d) MSD of the estimator of β3 along with bandwidth, (e) MMSE of the estimator of β1 along with bandwidth, (f) MMSE of the estimator of β3 along with bandwidth

When the bandwidth is set to 19, the lowest QIC value of 573.04 is attained. Likewise, Table 4 illustrates the four measurements for the six models. It is evident that the Cox family models consistently outperform their corresponding AFT models, except for the MSD of GWCox and GWAFT, which are approximately 0.14 versus 0.12. However, in Table 5, we observe that the FPR of the GWCox model, the Local Cox model, and the Local AFT model is relatively higher compared to others, already nearing 0.5.

Table 4 Performance of global AFT, global Cox, best selected GWAFT, best selected GWCox, local AFT, local Cox in Simulation 2 (with PH assumption)
Table 5 The FPR of global AFT, global Cox, best selected GWAFT, best selected GWCox, local AFT, local Cox in Simulation 2 (with PH assumption)

As for the comparison between GWAFT and GWCox, when the PH assumption holds, we can find that all measurements of GWCox are all outperform than GWAFT except MSD and FPR, as shown in Tables 4 and 5.

Result in Simulation 3

Regrettably, in situations with small sample sizes, the local model becomes ineffective and not even converge in some simulations. Consequently, we report the performances of GWAFT regression, GWCox regression, global AFT regression, and global Cox regression. Similarly, as observed in the above simulations, the four measurements of GWAFT regression and GWCox regression, namely MAB, MSD, and MMSE, exhibit variations across different bandwidths, as depicted in Fig. 4. It is evident that the performance of the GWAFT model is significantly superior to that of the GWCox regression model across all bandwidths.

Fig. 4
figure 4

Performance measures for the GWAFT model and GWCox model in Simulation 3; (a) MAB of the estimator of β1 along with bandwidth (b) MAB of the estimator of β3 along with bandwidth, (c) MSD of the estimator of β1 along with bandwidth, (d) MSD of the estimator of β3 along with bandwidth, (e) MMSE of the estimator of β1 along with bandwidth, (f) MMSE of the estimator of β3 along with bandwidth

When the bandwidth is set to 9, the QIC reaches its lowest value at 99.77. As shown in Table 6, it shows a comparison of the four models in terms of MAB, MSD, and MMSE. The GWAFT model consistently outperforms the other models. Moreover, similar to Simulation 1, the AFT family models demonstrate better performance compared to the Cox family models. Additionally, the MAB and MSD values of the Global AFT model are approximately twice those of the GWAFT model, while the MMSE value differs by an order of magnitude. Furthermore, as shown in Table 7, the GWAFT model exhibits a slightly superior robustness in terms of FPR, with respective values of 0.0574 and 0.0546. The MCP of GWAFT are more than 0.95.

Table 6 Performance of global AFT, global Cox, best selected GWAFT, best selected GWCox in Simulation 3 (without PH assumption)
Table 7 The FPR of global AFT, global Cox, best selected GWAFT, best selected GWCox in Simulation 3 (without PH assumption)

Above all, as for the comparison between GWAFT and GWCox, when the PH assumption is not satisfied, we can find that the MAB, MSD, MMSE, MCP, and FPR of GWAFT are all outperform than GWCox, as shown in Tables 6 and 7.

Result in Simulation 4

As seen from Fig. 5, we can find that the MAB and MMSE values of the GWCox model are consistently smaller than those of the GWAFT model across all bandwidths. However, the MSD values appear similar between the two models. Table 8 reveals that the measurements of the Cox family models are significantly smaller than their corresponding AFT family models, except for the MSD, which aligns with the findings in Simulation 2. Furthermore, in terms of β2 and β4, the GWCox model performs relatively worse compared to the other three models.

Fig. 5
figure 5

Performance measures for the GWAFT model and GWCox model in Simulation 4; (a) MAB of the estimator of β1 along with bandwidth (b) MAB of the estimator of β3 along with bandwidth, (c) MSD of the estimator of β1 along with bandwidth, (d) MSD of the estimator of β3 along with bandwidth, (e) MMSE of the estimator of β1 along with bandwidth, (f) MMSE of the estimator of β3 along with bandwidth

Table 8 Performance of global AFT, global Cox, best selected GWAFT, best selected GWCox in Simulation 4 (with PH assumption)

Furthermore, when the bandwidth is set to 19, the GWCox model achieves the lowest QIC value of 159.72. Regarding the significant coefficients β1 and β3, the GWCox model outperforms all other models. However, for β2 and β4, the GWCox model demonstrates relatively worse performance compared to the other three models. Detailed results can be found in Tables 8 and 9.

Table 9 The FPR of global AFT, global Cox, best selected GWAFT, best selected GWCox in Simulation 4 (with PH assumption)

As for the comparison between GWAFT and GWCox, when the PH assumption holds, we can find that all measurements of GWCox are all outperform than GWAFT except MSD and FPR, as shown in Tables 8 and 9.

Result in Simulation 5

In Simulation 5, when dealing with extreme imbalanced sample distributions, the local models struggle to provide stable solutions in locations with small sample sizes. As depicted in Fig. 6, the GWAFT model consistently outperforms the GWCox regression model across all bandwidths.

Fig. 6
figure 6

Performance measures for the GWAFT model and GWCox model in Simulation 5; (a) MAB of the estimator of β1 along with bandwidth (b) MAB of the estimator of β3 along with bandwidth, (c) MSD of the estimator of β1 along with bandwidth, (d) MSD of the estimator of β3 along with bandwidth, (e) MMSE of the estimator of β1 along with bandwidth, (f) MMSE of the estimator of β3 along with bandwidth

Notably, when the bandwidth is set to 10, the GWAFT model demonstrates the lowest QIC. Table 10 presents a comparison of the four models (MAB, MSD, MMSE), indicating that the GWAFT model exhibits the best performance. Additionally, Table 11 shows that the GWAFT model boasts robustness in terms of FPR, with values of 0.0276 and 0.0286, respectively, surpassing all other models. It is worth mentioning that the performance of GWAFT in Simulation 5 falls slightly short of that in Simulation 1, by nearly an order of magnitude. The MCP of GWAFT are around 0.95.

Table 10 Performance of global AFT, global Cox, best selected GWAFT, best selected GWCox in Simulation 5 (without PH assumption)
Table 11 The FPR of global AFT, global Cox, best selected GWAFT, best selected GWCox in Simulation 5 (without PH assumption)

Above all, as for the comparison between GWAFT and GWCox, when the PH assumption is not satisfied, we can find that the MAB, MSD, MMSE, MCP, and FPR of GWAFT are all outperform than GWCox, as shown in Tables 10 and 11.

Empirical study

True to its original mandate, the mission of the Surveillance, Epidemiology, and End Results (SEER) program is to provide comprehensive information on cancer statistics in order to assist in reducing the cancer burden. Currently, SEER collects and publishes data on cancer incidence, prevalence, and survival from population-based cancer registries that cover approximately 48% of the United States population. In this section, we utilized the ovarian cancer surveillance data from the SEER program, focusing on New Jersey, USA, in order to demonstrate the applicability of our proposed methods. The analysis encompassed a total of 21 counties, and exclude observations that have unknown ending statuses or unknown survival times, resulting in 1439 complete patients diagnosed at 2005. Because of the Hurricane Katrina, the number of patients diagnosed between the second half of 2005 is relatively limited [22]. The primary objective of the empirical study is to utilize the GWAFT model to explore spatial heterogeneity within sparse survival data across multiple regions, and to subsequently provide coefficient estimation for each county.

We selected race (Black, White, Other), marital status (Married/Not Married), and surgery (Received/Not Received) as covariates, based on their identification as relevant factors affecting the survival of ovarian cancer patients in previously published studies [23,24,25]. However, these three covariates do not satisfy the PH assumption, whose p-values of the PH assumption are 0.030 for race, 0.015 for marital status, and 0.001 for surgery. As for the geographical heterogeneity, the Moran’s I values of the three covariates were -0.176, -0.217, and -0.200, respectively, and all three z tests showed significance. In this case, these survival data not only contained geographical heterogeneity but did not satisfy the PH assumption. Together with the covariates, survival times, final statuses, and county locations of these observations are also included. Given the presence of both geographical heterogeneity and violation of the PH assumption in the survival data, we employed our proposed method to solve the problems cause by geographically heterogeneity and violation of PH assumption. Specifically, we applied our proposed GWAFT regression to estimate parameters for each of the 21 counties. The distance matrix of the 21 counties of New Jersey in our study is provided in Fig. 7 for illustration.

Fig. 7
figure 7

Visualization of graph distances for 21 counties in New Jersey. Darker colors indicate greater graph distances

According to the algorithm of GWAFT, after sorting data and getting the graph distance, the following task is to select the best bandwidth which makes the QIC value the lowest. The maximum graph distance observed in our study was 8, with a corresponding maximum bandwidth of 18. According to the QIC values varying with the bandwidth as shown in Fig. 8, we found that the optimal bandwidth for our analysis is 2.

Fig. 8
figure 8

The QIC value of GWAFT varying with bandwidth in the empirical study

As shown in Fig. 9, the parameter estimates for Race are consistently negative across all counties, which suggests that there are racial disparities in the outcome of ovarian cancer in New Jersey. The white women tend to have longer survival times than the black women. Furthermore, married patients tend to have longer survival times compared to others in most counties. The parameter estimates for Surgery are positive in all counties, suggesting that patients who undergo surgery are less likely to experience an event than those who do not. On the whole, the pattern of the three covariates is consistent with previous studies [23,24,25]. The difference is that there is clear spatial variation in the effects of all three covariates. By leveraging the proposed GWAFT model, we can discern pronounced geographical variations in the distribution of three covariates across various counties. Notably, the impact of the surgery covariate shows substantial geographic variation across different counties—a significant finding that cannot be obtained using traditional global methods. The observed disparities in the Surgery covariate could be attributed to differing socio-economic profiles and varying levels of access to healthcare services across regions [26, 27].

Fig. 9
figure 9

Parameter estimates obtained from the geographically AFT model for New Jersey counties using h = 2

Discussion

In this study, we introduce a GWAFT model to analyze spatial survival data with varying coefficients at the local or subregional level. The key concept of this model is to effectively utilize surrounding information to estimate coefficients at each location.

Firstly, our proposed model addresses the issue of errors introduced by spatial heterogeneity, which can lead to biased effect estimates. Through Simulations 1, 3, and 5, our model demonstrates superior performance compared to other models, including global and local models. Particularly, the results of Simulations 1, 3, and 5 show that the GWAFT model outperforms the GWCox model, providing an alternative approach for modeling geographical survival data that does not rely on the proportional hazards assumption. This underscores the importance of evaluating the proportional hazards assumption, as its violation can cast doubt on the validity of Cox model results. Neglecting this assumption may lead to erroneous scientific conclusions. Differently with Bayesian framework accelerated failure time models proposed by Hu et al. (2021) [28], we conducted modeling work from the perspective of the frequentist school to overcome some potential drawbacks of Bayesian models.

Secondly, our method presents an effective strategy for establishing a survival model with geographically sparse data. Adequate sample sizes are crucial in survival data analysis using Cox regression or AFT regression to ensure robust parameter estimation and sufficient test power [29]. In scenarios with multiple regions where some areas have sparse samples or overall sample sizes are small, such as in rare disease cases, the GWAFT model can leverage surrounding observations to enhance statistical power. As evidenced in Simulations 3 and 5, even when local models falter, the GWAFT model can offer efficient estimation with sparse survival data. While there is a slight performance decrease in Simulation 1 compared to the GWAFT model, it remains the top-performing model among all models.

The above results indicate that our proposed method is effective even when simulated data does not meet the proportional hazards assumptions, regardless of local data adequacy. While global models are beneficial, they may not always detect non-stationarity. Unlike Xue [12], we incorporated both significant and non-significant coefficients in our simulations to evaluate the FPR of our method.

Thirdly, in Simulations 2 and 4, the GWCox model outperforms other models in most comparisons, but some results, such as the FPR in these simulations, are unsatisfactory. This suggests that GWAFT models may not always be suitable for every study, and when survival data satisfy proportional hazards assumptions, the GWCox model is still the first choice to be utilized. It is kindly similar with the choice of ordinary Cox proportional hazards model and ordinary AFT model. The ordinary AFT model is an alternative method when the PH assumption is not be satisfied. While, when the PH assumption holds, the Cox’s regression model is still the model of choice in the analysis of time to event data in survival analysis [30].

Fourthly, in the empirical study, while combining average effects offers a general understanding of relationships (referred to as the general average), it may overlook local distinctiveness [31]. By utilizing proposed GWAFT, we can identify a clear substantial geographic variation across different counties of three covariates. Especially, for Surgery covariate, it can be cause by the different area socioeconomic characteristics and access to care [26, 27]. This approach aligns with Precision Medicine by capturing local characteristics and finely modeling covariate effects to comprehend disease relationships and geographical distribution patterns [32].

Future research could explore allowing the bandwidth or smoothing factor in GWR to be derived separately for each covariate. Yu et al. (2020) [10] proposed a multiscale GWR (MGWR) framework, which treats GWR as a Generalized Additive Model and extends it to MGWR, deriving standard errors for local parameters. Extending our method to incorporate these aspects in future studies would be valuable.

Conclusion

In conclusion, our proposed GWAFT model employs a kernel function to assign weights to each observation based on graph distance, and uses appropriate bandwidth selection to address the problem caused by spatial heterogeneity and sparsity of multi-region spatial survival data. This approach serves as an alternative when the proportional hazards assumption is violated in certain scenarios. By incorporating surrounding information and considering spatial relationships, our method provides an efficient modeling solution for spatial survival data in big, small, extreme imbalance sample sizes situations.

Availability of data and materials

All data involved in the current study were obtained from SEER program (https://seer.cancer.gov/).

References

  1. Jerrett M, Gale S, Kontgis C. Spatial modeling in environmental and public health research. Int J Environ Res Public Health. 2010;7:1302–29.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Duggan MA, Anderson WF, Altekruse S, Penberthy L, Sherman ME. The Surveillance, Epidemiology, and End Results (SEER) program and pathology: toward strengthening the critical relationship. Am J Surg Pathol. 2016;40:e94–102.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Bristow RE, Chang J, Ziogas A, Gillen DL, Bai L, Vieira VM. Spatial analysis of advanced-stage ovarian cancer mortality in California. Am J Obstet Gynecol. 2015;213:43.e1–43.e8.

    Article  PubMed  Google Scholar 

  4. Nakaya T, Fotheringham AS, Brunsdon C, Charlton M. Geographically weighted Poisson regression for disease association mapping. Stat Med. 2005;24:2695–717.

    Article  CAS  PubMed  Google Scholar 

  5. Paciorek CJ. The importance of scale for spatial-confounding bias and precision of spatial regression estimators. Stat Sci. 2010;25:107–25.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Boyle J, Ward MH, Cerhan JR, Rothman N, Wheeler DC. Modeling variation in mixture effects over space with a Bayesian spatially varying mixture model. Stat Med. 2024;43:1441–57.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Sulekan A, Jamaludin SSS. Review on Geographically Weighted Regression (GWR) approach in spatial analysis. Malays J Fundam Appl Sci. 2020;16:173–7.

    Article  Google Scholar 

  8. Furková A. Simultaneous consideration of spatial heterogeneity and spatial autocorrelation in European innovation: a spatial econometric approach based on the MGWR-SAR estimation. Rev Reg Res. 2021;41:157–84.

    Article  Google Scholar 

  9. Li B, Griffith DA. The Moran Spectrum as a Geoinformatic Tupu: implications for the First Law of Geography. Ann GIS. 2022;28:69–83.

    Article  Google Scholar 

  10. Yu H, Fotheringham AS, Li Z, Oshan T, Kang W, Wolf LJ. Inference in Multiscale Geographically Weighted Regression. Geogr Anal. 2020;52:87–106.

    Article  Google Scholar 

  11. Taufiq A, Astuti AB, Rinaldo Fernandes AA. Geographically Weighted Regression in Cox Survival Analysis for Weibull Distributed Data with Bayesian Approach. IOP Conf Ser: Mater Sci Eng. 2019;546:052078.

    Article  Google Scholar 

  12. Xue Y, Schifano ED, Hu G. Geographically Weighted Cox Regression for Prostate Cancer Survival Data in Louisiana. Geogr Anal. 2020;52:570–87.

    Article  Google Scholar 

  13. Goerdten J, Carrière I, Muniz-Terrera G. Comparison of Cox proportional hazards regression and generalized Cox regression models applied in dementia risk prediction. A&D Transl Res Clin Interv. 2020;6:e12041.

    Article  Google Scholar 

  14. Saikia R, Barman MP. A review on accelerated failure time models. Int J Stat Syst. 2017;12:311–22.

    Google Scholar 

  15. Chiou SH, Kang S, Kim J, Yan J. Marginal semiparametric multivariate accelerated failure time model with generalized estimating equations. Lifetime Data Anal. 2014;20:599–618.

    Article  PubMed  Google Scholar 

  16. Bhattacharyya S, Bickel PJ. Community Detection in Networks using Graph Distance. Preprint at arXiv:1401.3915(2014).

  17. Porcu E, Bevilacqua M, Genton MG. Spatio-temporal covariance and cross-covariance functions of the great circle distance on a sphere. J Am Stat Assoc. 2016;111(514):888–98.

    Article  CAS  Google Scholar 

  18. Sadeghinaeenifard F, Dong P. A comparison of longitude–latitude and Euclidean distance shape descriptors for determining tree crown shapes derived from LiDAR data. Int J Remote Sens. 2019;40:8432–49.

    Article  Google Scholar 

  19. Brunsdon C, Fotheringham AS, Charlton ME. Geographically weighted regression: a method for exploring spatial nonstationarity. Geogr Anal. 1996;28:281–98.

    Article  Google Scholar 

  20. Pan W. Akaike’s information criterion in generalized estimating equations. Biometrics. 2001;57:120–5.

    Article  CAS  PubMed  Google Scholar 

  21. Xue F, Qu A. Integrating Multisource Block-Wise Missing Data in Model Selection. J Am Stat Assoc. 2021;116:1914–27.

    Article  CAS  Google Scholar 

  22. Huse E, Malone J, Ruesch E, Sulak T, Carroll R. An analysis of hurricane impact across multiple cancers: Accessing spatio-temporal variation in cancer-specific survival with Hurricane Katrina and Louisiana SEER data. Health & Place. 2020;63:102326.

    Article  Google Scholar 

  23. Wu J, Sun H, Yang L, Deng Y, Yan Y, Wang S, et al. Improved survival in ovarian cancer, with widening survival gaps of races and socioeconomic status: a period analysis, 1983–2012. J Cancer. 2018;9:3548–56.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Kim S, Dolecek TA, Davis FG. Racial differences in stage at diagnosis and survival from epithelial ovarian cancer: a fundamental cause of disease approach. Social Sci Med. 2010;71:274–81.

    Article  Google Scholar 

  25. Gardner AB, Sanders BE, Mann AK, Liao C-I, Eskander RN, Kapp DS, et al. Relationship status and other demographic influences on survival in patients with ovarian cancer. Int J Gynecol Cancer. 2020;30:1922–7.

    Article  PubMed  Google Scholar 

  26. Burger J, Gochfeld M, Lacy C. Ethnic differences in risk: experiences, medical needs, and access to care after hurricane Sandy in New Jersey. J Toxicol Environ Health A. 2019;82:128–41.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Henry KA, Sherman R, Roche LM. Colorectal cancer stage at diagnosis and area socioeconomic characteristics in New Jersey. Health & Place. 2009;15:505–13.

    Article  Google Scholar 

  28. Hu G, Xue Y, Huffer F. A comparison of Bayesian accelerated failure time models with spatially varying coefficients. Sankhya B. 2021;83:541–57.

    Article  Google Scholar 

  29. Infante G, Miceli R, Ambrogi F. Sample size and predictive performance of machine learning methods with survival data: a simulation study. Stat Med. 2023;42:5657–75.

    Article  PubMed  Google Scholar 

  30. Pang M, Platt RW, Schuster T, Abrahamowicz M. Spline-based accelerated failure time model. Stat Med. 2021;40:481–97.

    Article  PubMed  Google Scholar 

  31. Reich BJ. Spatiotemporal quantile regression for detecting distributional changes in environmental processes. J R Stat Soc C Appl Stat. 2012;61:535–53.

    Article  Google Scholar 

  32. Roberson ML. Precision in Language Regarding Geographic Region of Origin in Severe Cutaneous Adverse Drug Reaction Research. JAMA Dermatol. 2024;160:534.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

We sincerely thank Professor Guoyou Qin for his significant help.

Funding

This research was supported by the National Social Science Fund of China (21CTJ009, F.C.), the National Natural Science Foundation of China (81703325, F.C.), and the National Key Research and Development Program of China (2017YFC0907200 and 2017YFC0907201, H.Y.).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, F.C., H.Y. and J.C.; Methodology, F.C., H.Y. and J.C.; Software, J.C.; Validation, J.C., Y.L.; Formal analysis, J.C., W.H.; Writing—original draft preparation, J.C.; Writing—review and editing: H.Y., F.C., J.C., B.M., L.P., Y.Z.; Visualization: J.C., H.J.; Critical revision: H.Y., F.C.; Supervision: F.C. and H.Y.; Funding acquisition: H.Y., F.C. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Hong Yan or Fangyao Chen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cai, J., Li, Y., Hu, W. et al. Geographically weighted accelerated failure time model for spatial survival data: application to ovarian cancer survival data in New Jersey. BMC Med Res Methodol 24, 239 (2024). https://doi.org/10.1186/s12874-024-02346-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12874-024-02346-8

Keywords