 Research
 Open access
 Published:
Outlier detection in spatial error models using modified thresholdingbased iterative procedure for outlier detection approach
BMC Medical Research Methodology volume 24, Article number: 89 (2024)
Abstract
Background
Outliers, data points that significantly deviate from the norm, can have a substantial impact on statistical inference and provide valuable insights in data analysis. Multiple methods have been developed for outlier detection, however, almost all available approaches fail to consider the spatial dependence and heterogeneity in spatial data. Spatial data has diverse formats and semantics, requiring specialized outlier detection methodology to handle these unique properties. For now, there is limited research exists on robust spatial outlier detection methods designed specifically under the spatial error model (SEM) structure.
Method
We propose the SpatialΘIterative Procedure for Outlier Detection (SpatialΘIPOD), which utilizes a meanshift vector to identify outliers within the SEM. Our method enables an effective detection of spatial outliers while also providing robust coefficient estimates. To assess the performance of our approach, we conducted extensive simulations and applied it to a realworld empirical study using life expectancy data from multiple countries.
Results
Simulation results showed that the masking and JD (Joint Detection) indicators of our SpatialΘIPOD method outperformed several commonly used methods, even in highdimensional scenarios, demonstrating stable performance. Conversely, the ΘIPOD method proved to be ineffective in detecting outliers when spatial correlation was present. Moreover, our model successfully provided reliable coefficient estimation alongside outlier detection. The proposed method consistently outperformed other models (both robust and nonrobust) in most cases. In the empirical study, our proposed model successfully detected outliers and provided valuable insights in the modeling process.
Conclusions
Our proposed SpatialΘIPOD offers an effective solution for detecting spatial outliers for SEM while providing robust coefficient estimates. Notably, our approach showcases its relative superiority even in the presence of high leverage points. By successfully identifying outliers, our method enhances the overall understanding of the data and provides valuable insights for further analysis.
Background
In general, an outlier refers to a data point that significantly deviates from the norm for a specific variable or population [1]. It is also characterized as an observation that is inconsistent with the remaining data [2]. Swersky et al. (2016) further defined an outlier as an observation that diverges to the extent of arousing suspicions [3]. Outliers are inevitable [4] and sometimes carry special information. While in practice, some outliers may simply be considered as “noise” or “dirty data”, more often than not, they have the potential to influence statistical inference and provide valuable insights within the dataset [5]. For instance, in a published breast cancer detection system, inliers may represent healthy patients, while outliers may indicate a higher probability of breast cancer [6]. As a result, incorrect or crude treatment of outliers often results in loss of information, inaccurate statistical inferences and biased estimates. Accurately identifying outliers, especially in the field of public health, is of significant importance for further analysis of outliers to provide additional insights in certain aspects. Therefore, the methodology for detecting outliers is an essential and urgent need in data analysis [5].
A dataset may contain multiple outliers, posing challenges in detecting and addressing the masking and swamping effects [7]. Various methods have been employed for multiple outlier detection, including the fully efficient onestep procedure (GY) proposed by Gervini and Yohai (2002) [8], the least trimmed squares (LTS) [9], and the MMestimators [10]. Moreover, other methods have also been developed to tackle different aspects of outlier detection. For instance, Kong et al. (2018) proposed a method based on the squared loss of the meanshift model with two penalty functions on the meanshift vector and the parameter vector, achieving both high breakdown points and high efficiency [11]. Jiang et al. (2020) introduced the penalized weighted LADLASSO (PWLADLASSO) estimator, which combines robust estimation and outlier detection properties [12]. Among these methods, we noticed that the ΘIPOD method proposed by She & Owen (2011) used a regression model with a mean shift parameter. They incorporated a softthresholding penalty and a hardthresholding penalty, which effectively counter the masking effects [13].
However, in recent years, the presence of spatial heterogeneity in data has become increasingly common in various fields such as survey studies, surveillance efforts, and longitudinal studies, particularly in cancerrelated research [14]. For instance, the Surveillance, Epidemiology, and End Results (SEER) Program, the China Health and Retirement Longitudinal Study (CHARLS), and the China Northwest Cohort (CNC) often involve the collection of data at small geographical levels (such as communities or counties), which are subsequently aggregated at larger levels. This introduces additional complexity to outlier detection tasks. The primary reason for this is that geographic data often exhibit spatial dependence [15]. Traditional methods for outlier detection fail to consider the spatial relationships among input variables, while spatial patterns often demonstrate spatial continuity and autocorrelation with neighboring samples. For instance, the ΘIPOD method relies on a linear structure with a meanshift vector. However, the existence of spatial dependence violates the assumptions of traditional ordinary least squares (OLS) estimation and can result in a decrease in the efficiency and increase in the bias of the OLS estimator for the regression model parameters [16]. There have been some approaches to spatial outlier exploration, however, due to the diverse formats and semantics of spatial data, there is still a urgent need for outlier detection methodology that can accommodate these unique properties especially spatial dependence and heterogeneity [17].
In the area of spatial analysis, one commonly used method is the spatial error model (SEM), which considers the covariance structure between error terms [18]. The SEM model is adept at effectively addressing challenges related to spatial correlation and heterogeneity. SEM has been successfully applied in various applications, providing valuable insights when the spatially autocorrelated error structure is welldefined [19]. Some robust spatial regression approaches have been proposed in recent years. José Montero et al. (2012) introduced a model incorporating a global spatial trend within a Spatial Autoregressive (SAR) framework to address both largescale spatial dependencies and local spatial autocorrelation. The utilization of penalized splines for model estimation was emphasized, leveraging their representation as mixed models [20]. Boente et al. (2012) presented a robust estimation framework encompassing parametric and nonparametric components within the context of a generalized partly linear singleindex model [21]. Additionally, Yildirim et al. (2020) proposed a robust estimation approach utilizing robustified likelihood equations specifically tailored for SEM [22]. However, it is important to highlight that there is limited research available on robust spatial outlier detection specifically tailored to the SEM structure. These spatial robust estimation methods do not yield explicit results identifying which observations are outliers, which is not conducive to our further analysis of outliers.
Therefore, in this study, we propose a novel outlier detection method SpatialΘIPOD for SEMstructure data. Considering the outstanding performance of the ΘIPOD method in detecting outliers under normal circumstances, we have decided to extend its application to the structure of the SEM model to address the task of spatial outlier detection.
The contributions of this paper are as follows:
(1) We proposed an extension of the IPOD method to incorporate the structure of the SEM model, calling SpatialΘIPOD, enabling the detection of spatial outliers while effectively addressing the challenges posed by masking and swamping effects.
(2) In addition to outlier detection, our approach also provided robust estimates for the coefficients.
(3) We evaluated the effectiveness of the proposed algorithms for spatial outlier detection by applying them to the analysis of Life Expectancy (LE) data from multiple countries. We conduct a comprehensive analysis of the detected outliers, providing valuable insights and robust estimated results.
Methods
The ΘIPOD method
The ΘIPOD is based on the meanshift model [13]:
where \({\mathbf{X}} = [{\mathbf{x}}_{{\mathbf{1}}} ,...,{\mathbf{x}}_{{\mathbf{n}}} ]^{T} \in {\mathbb{R}}^{n \times p}\),\({\mathbf{y}} = [y_{1} ,...,y_{n} ]^{T} \in {\mathbb{R}}^{n}\),\({{\varvec{\upbeta}}} = [\beta_{1} ,...,\beta_{p} ]^{T} \in {\mathbb{R}}^{p}\),\(\epsilon\in {\mathbb{R}}^{n}\) is a random error vector. \({{\varvec{\upgamma}}} = (\gamma_{1} ,...,\gamma_{n} )^{T} \in {\mathbb{R}}^{n}\) acts as a vector indicating the locations of outliers. If one γ_{i} does not equal 0, it means the corresponding observation is an outlier.
To deal with masking and swamping in the presence of multiple outliers mentioned before, λ is the regularization parameters, a general threshold function Θ was been used.\(\Theta (t;\lambda )\) is an odd monotone unbounded shrinkage rule for t, at any λ, which satisfies:

(1)
\(\Theta (  t;\lambda ) =  \Theta (t;\lambda )\)

(2)
\(\Theta (t;\lambda ) \le \Theta \left( {t^{\prime } ;\lambda } \right) \, for \, 0 \le t \le t^{\prime }\)

(3)
\(\mathop {\lim }\limits_{t \to \infty } \Theta (t;\lambda ) = \infty\)

(4)
\(0 \le \Theta (t;\lambda ) \le t \, for \, 0 \le t < \infty\)
In their study, they considered two version of threshold function Θ, which are:
For any threshold function Θ(·; λ), a penalty function \(P_{\Theta } ( \cdot ;\lambda )\) with the smallest curvature corresponding can be found by following threestep construction,

(a)
\(\Theta^{  1} (u;\lambda ) = \sup \{ t:\Theta (t;\lambda ) \le u\}\)

(b)
\(s(u;\lambda ) = \Theta^{  1} (u;\lambda )  u\)

(c)
\(P(\theta ;\lambda ) = P_{\Theta } (\theta ;\lambda ) = \int_{0}^{\theta } s (u;\lambda ){\text{d}}u\)
The ultimate goal is to optimize the following formula to obtain the robust estimate of \(({\hat{\mathbf{\beta }}},{\hat{\mathbf{\gamma }}})\) by iterative procedure.
The update of \({{\varvec{\upgamma}}}\) via \({{\varvec{\upgamma}}}^{(j + 1)} = \Theta \left( {{\mathbf{H\gamma }}^{(j)} + ({\mathbf{I}}  {\mathbf{H}}){\mathbf{y}};{{\varvec{\uplambda}}}} \right)\) at each iteration, where \(\lambda_{i} = \lambda \sqrt {1  h_{i} }\), the HatMatrix \({\mathbf{H}}\) can be defined as \({\mathbf{H}} = {\mathbf{H(X)}} = {\mathbf{X(X}}^{{\mathbf{T}}} {\mathbf{X)}}^{{{\mathbf{  1}}}} {\mathbf{X}}^{{\mathbf{T}}}\), \(h_{i}\) donates the ith diagonal entry of \({\mathbf{H}}\).
About the choice of the regularization parameter, the λ can be chosen via BIC (Bayesian information criterion) [23, 24]. To be more specific, it can be chosen by a slight modification BIC. Given \(\lambda\) and the corresponding estimate \(\widehat{\gamma }(\lambda )\), let \(nz(\lambda ) = \{ i:\widehat{\gamma }_{i} (\lambda ) \ne 0\}\),\(\widehat{\gamma }_{nz}\) is an OLS estimate with one parameter per detected outlier, and the degrees of freedom are given by \(DF(\lambda ) = \left {nz(\lambda )} \right\). The slight modification of BIC is as \({\text{BIC}}^{*} (\lambda ) = m\log ({\text{RSS}} /m) + k(\log (m) + 1)\),where \(\mathbf{\overset{\frown}{y}}= {\mathbf{A\gamma }} + \epsilon^{\prime } ,\quad \epsilon^{\prime } \sim {\mathcal{N}}\left( {{\mathbf{0}},\sigma^{2} {\mathbf{I}}_{(n  p) \times (n  p)} } \right)\), \(\mathbf{\overset{\frown}{y}}= {\mathbf{U}}{}_{{\mathbf{c}}}^{{\mathbf{T}}} {\mathbf{y}}\), \({\mathbf{A}}\) can be obtained by the spectral decomposition of HatMatrix \({\mathbf{H}}\), \({\mathbf{H = ADA}}^{{\mathbf{T}}}\), \(m = n  p\),\({\text{RSS}} = \mathbf{\overset{\frown}{y}} {\mathbf{A}}\widehat{{{\varvec{\upgamma}}}}_{2}^{2} = ({\mathbf{I}}  {\mathbf{H}})({\mathbf{y}}  \widehat{{{\varvec{\upgamma}}}})_{2}^{2}\), and k = degrees of freedom + 1.
The selection range of \(\lambda\) is decreasing from \(({\mathbf{I}}  {\mathbf{H}}){\mathbf{y}}/\sqrt {{\text{diag}} ({\mathbf{I}}  {\mathbf{H}})}_{\infty }\) to 0, and select the \(\lambda\) with the minimum \({\text{BIC}}^{*} (\lambda )\).
The detail algorithm is as follows:
Spatial error model
SEM has been extensively utilized in various fields such as econometrics, regional science, forest science, social science, and marketing research. More recently, it has also found applications in the field of public health [25]. SEM regression model involving the coefficient of spatial dependence or autocorrelation (μ) that captures the spatial dependence in the error terms, is presented as follows:
Normal SEM model can be described as
where \({\mathbf{y}}\) contains an n × 1 vector of dependent variables and \({\mathbf{X}}\) represents an n × p matrix of independent variables. \({{\varvec{\upbeta}}}\) is a vector of p × 1 vector of regression parameter to be estimated of the model. μ is the spatial autoregressive parameter needed to be estimated. \({\mathbf{W}}\) is the rowstandardized weight matrix, which is calculated based on the distance matrix indicating how locations are spatially interconnected. The lagerror term \({{\varvec{\upxi}}} = \mu {\mathbf{W\xi }} +\epsilon ,\quad \epsilon\sim {\mathcal{N}}\left( {0,\sigma^{2} {\mathbf{I}}} \right)\) effectively addresses spatial dependence within the error terms, thereby augmenting the conventional linear model. The Eq. (4) shows that the observations have a Gaussian distribution with \({\mathbf{y}}\sim {\mathcal{N}}({\mathbf{X\beta }},\sigma^{2} ({\mathbf{I}}_{n}  \mu {\mathbf{W}})^{  1} )\).
SpatialΘIPOD
As mentioned earlier, while ΘIPOD demonstrates excellent performance under normal regression assumptions, it is observed that the error term deviates from the ordinary linear model. Consequently, ΘIPOD may no longer be applicable in such cases.
To address this limitation, we propose a modified approach called SpatialΘIPOD, which incorporates a mean shift vector γ into the SEM to identify outliers and obtain robust coefficient estimations. This modification enables the method to be suitable for the SEM data structure. The model is described as follows:
Motivated by Yildirim (2020) [22], one possible approach for estimating the regression coefficients of the SEM is the generalized least squares (GLS) method. This method is applicable when the spatial autoregressive parameter μ is known or has been previously estimated. Therefore, we generalize Eq. (5) as follows:
where \({\tilde{\mathbf{y}}} = ({\mathbf{I}}_{n}  \hat{\mu }{\mathbf{W}}){\mathbf{y}},{\tilde{\mathbf{X}}} = ({\mathbf{I}}_{n}  \hat{\mu }{\mathbf{W}}){\mathbf{X}},{\tilde{\mathbf{\gamma }}} = ({\mathbf{I}}_{n}  \hat{\mu }{\mathbf{W}}){{\varvec{\upgamma}}}\).
Under this model setting, the optimization problem turns to
We utilize the iterative procedure to solve the optimization problem. Before that, if μ is known, it can directly be used for the optimization. If μ is unknown, it can be estimated previously by following method [22]:

(i) Choose \(\psi\) function

(ii) Choose initial values \(\beta\), \(\mu\) via OLS (Ordinary least square) or GMM (Generalized Moment Model)

(iii) Compute \({{\varvec{\upbeta}}}^{(i + 1)}\) from equation \({{\varvec{\upbeta}}}^{(i + 1)} = {{\varvec{\upbeta}}}^{(i)} + \left[ {I\left( {{{\varvec{\upbeta}}}^{(i)} } \right)} \right]^{  1} s_{\beta }^{(i)}\).

(iv) Compute residuals with the estimated \({{\varvec{\upbeta}}}^{(i + 1)}\).

(v) Compute \(\mu^{(i + 1)}\) from equation \(\mu^{(i + 1)} = \mu^{(i)} + [I(\mu^{(i)} )]^{  1} s_{\mu }^{(i)}\).

(vi) Repeat steps iiiv until convergence for \({{\varvec{\upbeta}}}\) and \(\mu\).
where \({\mathbf{r}} = \hat{\Omega }_{\lambda }^{  1/2} \frac{{({\mathbf{y}}  {\mathbf{X\beta }})}}{{\hat{\sigma }}}\),\(\Omega_{\lambda } = \left( {{\mathbf{I}}_{n}  \lambda {\mathbf{W}}} \right)^{  1} \left( {{\mathbf{I}}_{n}  \lambda {\mathbf{W}}^{\prime } } \right)^{  1}\),\(K = \int_{  \infty }^{\infty } {\psi^{2} } (r)f(r)dr\), \(\psi ( \cdot )\) is the influence function can be chosen, containing Cauchy function, Insha function, etc. The observed information matrix \(I( \cdot )\) can be obtained as minus the expected value of the second derivatives of the robust loglikelihood functions. The score functions are \(s_{\beta } = \, \frac{{\hat{\sigma }}}{{\sigma^{2} }}{\mathbf{X}}^{\prime } \left( {{\mathbf{I}}_{n}  \mu {\mathbf{W}}} \right)^{2} \left( {{\mathbf{I}}_{n}  \hat{\mu }{\mathbf{W}}} \right)^{  1} \psi (r) = 0\) and \(s_{\mu } =  K{\text{tr}} \left( {\left( {{\mathbf{I}}_{n}  \mu {\mathbf{W}}} \right)^{  1} {\mathbf{W}}} \right) + \frac{{\hat{\sigma }^{2} }}{{\sigma^{2} }}\psi (r)^{\prime } \left( {{\mathbf{I}}_{n}  \hat{\mu }{\mathbf{W}}} \right)^{  1} \left( {{\mathbf{I}}_{n}  \mu {\mathbf{W}}} \right) \times {\mathbf{W}}\left( {{\mathbf{I}}_{n}  \hat{\mu }{\mathbf{W}}} \right)^{  1} \psi (r) = 0\).
The SpatialΘIPOD algorithm is listed as follows:
Similar with IPOD, the regularization parameter of our proposed SpatialΘIPOD is tuned in a datadependent way by a slight modification of BIC, with decreasing \(\lambda\) from \(({\mathbf{I}}  {\tilde{\mathbf{H}}}){\tilde{\mathbf{y}}}/\sqrt {{\text{diag}} ({\mathbf{I}}  {\tilde{\mathbf{H}}})}_{\infty }\) to 0.
Simulation study
Simulation design
We carried out simulation experiments to test the performance of the SpatialΘIPOD. It is well known that the presence of leverage points can cause failure in outlier detection methods. To be more specific, a data point whose xvalue (independent) is unusual, yvalue follows the predicted regression line though. Thus, we considered different combinations of dimensions, outlier quantities, and leverage values.
The observations were generated according to
The predictor matrix \({\mathbf{X}}\) is constructed as follows. Firstly, let \({\mathbf{X}} = {\mathbf{U\Sigma }}^{1/2}\), where \(U_{ij} \mathop \sim \limits^{iid} U(  5,5)\) and \(\Sigma_{ij} = \rho^{{1_{i \ne j} }}\) with ρ = 0.5. The dimension of \({\mathbf{X}}\) is set \(p \in \{ 15,50\}\), n = 500. Next, we modify the first O rows to represent leverage points, which are given by \(L \cdot [1,...,1]\). We consider six cases, involving variations of \(L \in \{ 15,20\}\) and \(O \in \{ 10,20,50\}\). Additionally, three more cases involve additive outliers at O points that are not leverage points, meaning that no rows of \({\mathbf{X}}\) are changed. The β vector is set as [1,…,1]_{p}. The shift vector is generated by \({{\varvec{\upgamma}}} = (\{ 8\}^{O} ,\{ 0\}^{n  O} )\). In order to add spatial heterogeneity, we incorporate a spatial error term \({{\varvec{\upxi}}}\) into the model. The generation of the spatial error term \({{\varvec{\upxi}}}\) is constructed as follows, with λ set to 0.7.
The spatial contiguity matrix \({\mathbf{W}} = ({\mathbf{W}}_{ij} )\) can be generated based on \(w_{ij} = \left\{ \begin{gathered} r^{i  j} ,i \ne j \hfill \\ 0,i = j \hfill \\ \end{gathered} \right.\), where r = 0.5. Here, we assume that these observations are arranged in a linear sequence. Generally, it can be considered as a graph structure. The \(\left {i  j} \right\) donates the graph distance between observation i and j. The σ^{2} is set 0.2.
Our simulation experiments mainly contain two aspects:
The first part of our simulation experiments focuses on comparing the outlier detection performance of seven different methods: SpatialhardIPOD, SpatialsoftIPOD, hardIPOD, softIPOD, MMestimator, fully efficient onestep procedure proposed by Gervini and Yohai (donoted by GY), and the least trimmed squares (LTS). These methods are implemented in the robust package (R version 4.1.2) and available in the S–PLUS Robust library. To ensure a fair comparison with ΘIPOD, we evaluate their performance based on three benchmark measures: the mean masking probability (M), the mean swamping probability (S), and the joint outlier detection rate (JD).
The mean masking probability (M) represents the fraction of true outliers that go undetected. The mean swamping probability (S) indicates the fraction of nonoutliers that are incorrectly labeled as outliers. The JD is the joint outlier detection rate, which measures the fraction of simulations with no masking (false negatives). In outlier detection, masking is considered more serious than swamping as it can lead to significant distortions. Swamping, on the other hand, typically results in a loss of efficiency. Ideally, we aim for M to be close to 0, S to be close to 0, and JD to be close to 100%. The joint outlier detection rate (JD) is particularly important for easier problems, while the mean masking probability (M) is more relevant for harder problems.
In the second part of our experiments, we compare the Mean Squared Error (MSE) of the estimated parameter β among 13 methods. These include the seven outlier detection methods mentioned earlier, as well as several robust methods for spatial estimation regression such as RoMLE (Robust estimation approach for spatial error model), including (RoMLE_Cauchy, RoMLE_Welsch, RoMLE_Insha, and RoMLE_Logistic). Because the RoMLE for SEM has smaller mean squared errors and exhibits more robust empirical influence function than the classical methods, when there are outliers in the dataset, we also conclude in our comparison. The difference between the four RoMLE method is that they choose different ψ function. The ψ function is introduced in Method section. Additionally, we consider nonrobust methods, such as MLE (Maximum Likelihood Estimation) and GMM (Generalized Moments Method).
All calculations were performed in R. The code and scripts reproducing the examples in this simulation study are publicly available online at GitHub (https://github.com/Justin0607/spatialoutlierdetection).
Simulation results
Tables 1 and 2 present the outlier identification performances of seven models in various simulation scenarios. Figs. 1 and 2 illustrate the results of Masking and JD for p = 15 and 50 respectively. While our main objective is to identify outliers, our proposed SpatialΘIPOD model also provides a robust coefficient estimate \(\hat{\beta }\).
The MSE in β for p equals 15 and 50 can be found in Tables 3 and 4 respectively, with corresponding trends shown in Figs. 3 and 4. Because our model significantly outperforms other models, even by several orders of magnitude, we have applied a logarithmic transformation to the MSE for ease of visualization and to better illustrate the trend.
In terms of masking, our proposed model consistently outperforms the other models across all simulation scenarios when p equals 15. We also notice that both our SpatialhardIPOD and SpatialsoftIPOD models exhibit similar performance (Tables 1 and 2, Figs. 1 and 2).
Additionally, we compare our models with three standard methods (MM, GY, and LTS) from the SPLUS Robust library. Among these, the GYestimator ranks second in terms of performance. However, the MMestimator, despite its popularity in robust analysis, and SpatialsoftIPOD show relatively weaker performance. When p equals 50, the overall results remain largely consistent, with a slight improvement in MM's performance, although it still falls in the middle when compared to other models (Tables 1 and 2, Figs. 1 and 2).
In terms of the JD indicator, when p equals 15, our proposed model consistently outperforms the other models in most scenarios, except for one scenario with a small number of outliers and no leverage. In this particular scenario, the SpatialsoftIPOD model falls slightly behind the softIPOD, but the SpatialhardIPOD still remains the topperforming model among all. In contrast, the performance of the hardIPOD, softIPOD, MM, and LTS models is not as satisfactory. Notably, the GYestimator performs poorly across all cases, indicating limited effectiveness in outlier detection even with a large number of simulations. When p = 50, we find that the performance of our proposed method is not significantly affected, as the JD indicators continue to remain at a high level (Tables 1 and 2, Figs. 1 and 2).
Regarding swamping, it is worth mentioning that although our proposed SpatialΘIPOD model excels in masking, it shows slightly weaker performance in terms of swamping. However, this tradeoff is acceptable, as masking poses a greater risk and harm.
Overall, the softIPOD, hardIPOD, MM, LTS, and GY models demonstrate high masking probabilities and low joint detection rates, particularly when the dimensionality (p) is high. However, our proposed SpatialΘIPOD method surpasses all of these models in terms of both masking probability and joint detection rate.
We also present the MSE of \(\widehat{\beta }\). As depicted in Table 3, when p equals 15, it is evident that our method significantly outperforms other methods in most cases. The hardIPOD, softIPOD, MM, LTS, and GY models exhibit considerably poorer performance, with a magnitude difference that is much larger compared to other models. The MLE and GMM models demonstrate better performance than the aforementioned five methods but still have room for improvement. Among all the models, the four RoMLE models are the closest to our MSE, but generally, our method still yields lower MSE, except in two scenarios (Outliers = 10, no leverage and Outliers = 50, leverage = 20) where we slightly lag behind. In terms of our proposed SpatialhardIPOD and SpatialsoftIPOD models, the SpatialsoftIPOD consistently outperforms the SpatialhardIPOD in all situations, while the MSEs of both methods increase as the number of outliers increases (Tables 3 and 4, Figs. 3 and 4).
When p equals 50, the overall performance situation remains largely unchanged, with our proposed SpatialΘIPOD model still exhibiting the best MSE performance among all the models. The only difference is that the MSE of SpatialΘIPOD is slightly larger compared to that of p equals 15 (Tables 3 and 4, Figs. 3 and 4).
Empirical study
In this section, we conducted a multicountry crosssectional study using public data from the World Bank (https://data.worldbank.org/) among 267 countries and regions to detect outliers in life expectancy (LE) measurement for the year 2020. In order to ensure that missing values will not affect the results of our empirical study, we excluded data with missing values from some countries, resulting in a selection of 82 countries and regions. The adjacency matrix for these countries was obtained using GeoDa (Luc Anselin 1.22.0.2).
Following the variables chosen by Ranabhat (2018) [26], the dependent variable in our study is the life expectancy of each country, while the independent variables include economic growth rate, child immunization rate, outofpocket expenditure percentage, domestic private health expenditure percentage, and access to improved sanitation percentage.
The fitting model is
\({\mathbf{W}}^{{\mathbf{*}}}\) is spatial contiguity matrix which contains the distance between each country. Because the performance of SpatialsoftIPOD is slightly better than SpatialhardIPOD in our simulation, we apply our proposed SpatialsoftIPOD to conduct this empirical study.
The results shows that the \(\gamma_{10} = 11.82855\), while other \(\gamma_{i} = 0\), it indicates that the 10th observation is an outlier in this situation, which is Suriname, a country in South America. The corresponding map of these countries with one outlier observation (red dot) is shown as Supplementary Fig. 1.
Accurately detecting outliers has many implications, including detecting outliers often provides valuable insights about the dataset. We furthermore conducted a thorough check of all variables for this country. Suriname's life expectancy ranks 42nd among the 82 countries, while its rankings for the remaining five indicators all fall behind 43rd. Specifically, the rankings for the other indicators are as follows: economic growth rate (81st), child immunization rate (82nd), outofpocket expenditure percentage (55th), domestic private health expenditure percentage (47th), and access to improved sanitation percentage (65th).
Generally, these factors all have a positive correlation with life expectancy. Under this assumption, the Suriname’s life expectancy should not rank as high as 42nd. However, the life expectancy of Suriname does not seem to align with the general trend. Therefore, it has been identified as an outlier based on these five variables.
Subsequently, we endeavored to determine the reasons behind the occurrence of this outlier. We examined other predictors related with life expectancy but not included in the study. For instance, Suriname's rankings in current health expenditure, enrollment, external health expenditure, and population growth are 39th, 22nd, 41st, and 42nd, respectively, which are higher than life expectancy ranks 42nd. Therefore, in the study, Suriname has been identified as an outlier, which may be associated with our choice of variables.
Discussion
In this study, we proposed SpatialΘIPOD for detecting spatial data outliers in SEM structures, while providing robust coefficient estimation results. We extended the IPOD method to incorporate spatial data structures, allowing for consideration of spatial error lag effects and inheriting the desirable properties of IPOD in combating masking.
In addition, due to the potential inadequacy of relying solely on raw residuals for effectively detecting outliers occurring at leverage points. Therefore, we not only examined the impact of outliers but also investigated the influence of leverage points on outlier detection, an aspect that has been rarely addressed in previous spatial outlier detection studies. Our simulation results demonstrated that the original IPOD method was not effective in detecting outliers in the presence of spatial correlation. Our masking and JD indicators outperformed several commonly used methods, both robust and nonrobust, even in highdimensional settings, with stable algorithm performance. While outlier detection was our primary objective, our model also provided stable coefficient estimation. Simulation study showed that our algorithm performed better than other models in the majority of cases, with only slight inferiority to the RoMLE model in a few instances. Furthermore, the MSE of our method slightly increased with increasing data contamination, which is consistent with general knowledge.
Accurately detecting outliers is important because it provides valuable insights about the dataset. The empirical study given is of Suriname being identified as an outlier observation in a study. The rankings of Suriname in various indicators, such as life expectancy and other variables, do not align with the general trend. This exemplifies one aspect of the significance of outlier detection, as analyzing outlier points can provide additional information. As demonstrated in this example, it indicates that the selected variables cannot fully explain all observations. When other four relevant variables are included in the model, Suriname is no longer classified as an outlier. Outliers offer valuable insights for uncovering hidden knowledge and enhancing healthcare services. Medical professionals can utilize these results to make informed predictions from extensive medical databases.
A limitation of this study is that in our simulation study, we have not considered the case of p > n. Currently, there are some issues with inadequate sample sizes in existing research, which will be the focus of our future studies. Another limitation of this study is that we tailored for crosssectional data analysis rather than longitudinal data. The longitudinal data offers benefits such as capturing temporal trends and changes over time. We intend to extend our model to longitudinal data in future research.
Conclusion
In conclusion, we proposed a SpatialΘIPOD method that effectively detects spatial outliers in the context of SEM structure and provides robust estimates of coefficients. Our method demonstrates relative superiority even in the presence of high leverage points. The detection of outliers offers valuable insights and enhances our understanding of the data.
Availability of data and materials
All data involved in the current empirical study were obtained from World Bank program (https://data.worldbank.org/).
References
Foorthuis R. On the nature and types of anomalies: a review of deviations in data. Int J Data Sci Anal. 2021;12:297–331.
Aguinis H, Gottfredson RK, Joo H. BestPractice Recommendations for Defining, Identifying, and Handling Outliers. Organ Res Methods. 2013;16:270–301.
Swersky L, Marques HO, Sander J, Campello RJGB, Zimek A. On the Evaluation of Outlier Detection and OneClass Classification Methods. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada. 2016;1–10.
Wang T, Li Q, Chen B, Li Z. Multiple outliers detection in sparse highdimensional regression. J Stat Comput Simul. 2018;88:89–107.
Smiti A. A critical overview of outlier detection methods. Computer Science Review. 2020;38: 100306.
SchellerKreinsen D, Quentin W, Geissler A, Busse R. Breast cancer surgery and diagnosisrelated groups (DRGs): Patient classification and hospital reimbursement in 11 European countries. The Breast. 2013;22:723–32.
Mohammed Rashid A, Midi H, Dhhan W, Arasan J. Detection of outliers in highdimensional data using nusupport vector regression. J Appl Stat. 2022;49:2550–69.
Gervini D, Yohai VJ. A class of robust and fully efficient regression estimators. Ann Statist. 2002;30(2):583–616.
Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. Hoboken, NJ: WileyInterscience; 2003.
Yohai VJ. High BreakdownPoint and High Efficiency Robust Estimates for Regression. Ann Stat. 1987;15:642–56.
Kong D, Bondell HD, Wu Y. Fully Efficient Robust Estimation, Outlier Detection and Variable Selection Via Penalized Regression. Stat Sin. 2018;28:1031–52.
Jiang Y, Wang Y, Zhang J, Xie B, Liao J, Liao W. Outlier detection and robust variable selection via the penalized weighted LADLASSO method. J Appl Stat. 2021;48:234–46.
She Y, Owen AB. Outlier Detection Using Nonconvex Penalized Regression. J Am Stat Assoc. 2011;106:626–39.
Xu B, Zhou F. The Roles of CloudBased Systems on the CancerRelated Studies: A Systematic Literature Review. IEEE Access. 2022;10:64126–45.
Cartone A, Postiglione P. Principal component analysis for geographical data: the role of spatial effects in the definition of composite indicators. Spat Econ Anal. 2021;16:126–47.
Bhatti SH, Khan FW, Irfan M, Raza MA. An effective approach towards efficient estimation of general linear model in case of heteroscedastic errors. Communications in Statistics  Simulation and Computation. 2023;52:392–403.
Kou Y, Lu CT, Chen D. Spatial Weighted Outlier Detection. In Proceedings of the 2006 SIAM International Conference on Data Mining (SDM). Society for Industrial and Applied Mathematics, Bethesda, Maryland, US. 2006;614–618.
LopezHernandez FA. Secondorder polynomial spatial error model. Global and local spatial dependence in unemployment in Andalusia. Econ Model. 2013;33:270–9.
Comber A, Brunsdon C, Charlton M, Dong G, Harris R, Lu B, et al. A Route Map for Successful Applications of Geographically Weighted Regression. Geogr Anal. 2023;55:155–78.
Montero JM, Mínguez R. SAR models with nonparametric spatial trends. A Pspline approach. Estadística Española. 2012;54(177):89–111.
Boente G, Rodriguez D. Robust estimates in generalized partially linear singleindex models. TEST. 2012;21:386–411.
Yildirim V, Mert KY. Robust estimation approach for spatial error model. J Stat Comput Simul. 2020;90:1618–38.
Antoniadis A. Wavelet methods in statistics: some recent developments and their applications. Stat Surv. 2007;1 none:16–55.
She Y. Thresholdingbased iterative selection procedures for model selection and shrinkage. Electron J Stat. 2009;3 none:384–415.
Dutta I, Basu T, Das A. Spatial analysis of COVID19 incidence and its determinants using spatial modeling: A study on India. Environmental Challenges. 2021;4:100096.
Ranabhat CL, Atkinson J, Park MB, Kim CB, Jakovljevic M. The Influence of Universal Health Coverage on Life Expectancy at Birth (LEAB) and Healthy Life Expectancy (HALE): A MultiCountry CrossSectional Study. Front Pharmacol. 2018;9:960.
Acknowledgements
We sincerely thank Professor Guoyou Qin for his significant help and guidance in statistical modelling, statistical simulations, and interpretation of results in this study.
Funding
This research was supported by the National Social Science Fund of China (21CTJ009, F.C.), the National Natural Science Foundation of China (81703325, F.C.), the Natural Science Basic Research Program of Shaanxi Province (2022JQ769, F.C.) and the National Key Research and Development Program of China (2017YFC0907200 and 2017YFC0907201, H.Y.).
Author information
Authors and Affiliations
Contributions
J.C., H.Y., and F.C. proposed the study concept and design; J.C., W.H. and Y.Y. drafted the manuscript; J.C. performed statistical analysis; H.Y., and F.C. conducted study supervision; and all authors critically revised the manuscript for important intellectual content.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Cai, J., Hu, W., Yang, Y. et al. Outlier detection in spatial error models using modified thresholdingbased iterative procedure for outlier detection approach. BMC Med Res Methodol 24, 89 (2024). https://doi.org/10.1186/s12874024022083
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12874024022083