 Research article
 Open Access
 Published:
Managing distance and covariate information with pointbased clustering
BMC Medical Research Methodology volume 16, Article number: 115 (2016)
Abstract
Background
Geographic perspectives of disease and the human condition often involve pointbased observations and questions of clustering or dispersion within a spatial context. These problems involve a finite set of point observations and are constrained by a larger, but finite, set of locations where the observations could occur. Developing a rigorous method for pattern analysis in this context requires handling spatial covariates, a method for constrained finite spatial clustering, and addressing bias in geographic distance measures. An approach, based on Ripley’s K and applied to the problem of clustering with deliberate selfharm (DSH), is presented.
Methods
Pointbased MonteCarlo simulation of Ripley’s K, accounting for socioeconomic deprivation and sources of distance measurement bias, was developed to estimate clustering of DSH at a range of spatial scales. A rotated Minkowski L_{1} distance metric allowed variation in physical distance and clustering to be assessed. Selfharm data was derived from an audit of 2 years’ emergency hospital presentations (n = 136) in a New Zealand town (population ~50,000). Study area was defined by residential (housing) land parcels representing a finite set of possible point addresses.
Results
Areabased deprivation was spatially correlated. Accounting for deprivation and distance bias showed evidence for clustering of DSH for spatial scales up to 500 m with a onesided 95 % CI, suggesting that social contagion may be present for this urban cohort.
Conclusions
Many problems involve finite locations in geographic space that require estimates of distancebased clustering at many scales. A MonteCarlo approach to Ripley’s K, incorporating covariates and models for distance bias, are crucial when assessing healthrelated clustering. The case study showed that social network structure defined at the neighbourhood level may account for aspects of neighbourhood clustering of DSH. Accounting for covariate measures that exhibit spatial clustering, such as deprivation, are crucial when assessing pointbased clustering.
Background
Point pattern analysis to assess clustering or dispersion of a set of events in a bounded spatial region is commonly based on quadrantbased sampling aggregations or pointbased measures such as the empty space function, pairwise and nearest neighbour distance [1–3]. This work extends the pairwise distance measure of Ripley’s K [2, 4, 5] to assess clustering over a range of spatial scales, while taking into account covariate and metric bias [3, 6]. Ripley’s K function [4] was originally designed for characterising stationary pointpatterns for a homogeneous Poisson process. The K function is a cumulative function defined over a range of pairwise distance counts that can distinguish clustered, random and dispersed spatial point patterns as a comparison against complete spatial randomness (CSR). Theoretical comparisons against CSR require an estimate of the intensity of points within a study region. The example presented in this paper evaluates clustering of episodes of deliberate selfharm (DSH) over 2 years in an urban environment. Because the study contains a finite set of points representing residential addresses means that the distance measure may not be planar and placement of points in the study area are not continuous. Similar difficulties have been addressed with modelling point distributions on a network [7, 8], although in this case distance was well defined by network connectivity. The approach presented here also addresses similar issues to the secondorder analysis of clustering for inhomogeneous populations [5], where a set of control cases are randomly selected to form a comparative K estimate. However this approach does not consider the influence of clustering due to covariate relationships in the observed point pattern. The spatial variation of disease presents similar issues, but is normally handled by kernelbased regression methods [6], without consideration of the influence of metrics on observed clustering.
Here we present a method to examine clustering for a finite set of point locations and present a method to examine uncertainty in the planar distance measurement. In addition, the observed point data is correlated with a spatial variable that is clustered, and therefore must be accounted for when assessing clustering via an estimate of Ripley’s K.
There has been a long history examining the relationship between social behaviour and the patterning of societal structure [9–11]. Two main theories are generally proposed [11]: behaviour is characterised by the underlying structure of the environment that defines the living conditions of individuals; or that behaviour is influenced by social interaction (often described as social contagion) that results in behaviours being shared and amplified between individuals. Both theories have been used to explain patterns of behavioural clustering and it is generally acknowledged that one cannot occur without the other. For example, Baller and Richardson [11] examined the patterning of suicide within the historical context of French departments from 1872 to 1876, and data for U.S. Counties from 1990. Using areabased spatial analysis methods they concluded that the French example showed clustering after social integration was accounted for in the data, while the U.S. example did not show any residual clustering once social integration was incorporated in the model. They concluded that both concepts of social structure and contagion through imitation were responsible for the spatial patterning of suicide.
Previous clustering methods for selfharm behaviour used areabased counts for index events which aligned with other areabased covariates (such as social deprivation). This allowed regression approaches to be used to account for covariates and spatial lag [12]. Moran’s I or other simple countbased methods were then used to assess clustering [9, 10, 13, 14]. However, the ability to now collect and manage pointbased data and incorporate this directly into spatial analysis means there is a need to develop appropriate clustering measures that handle covariate measures with points and address distance bias when Euclidean distance may not be appropriate, or where a distance metric is difficult to define.
Methods
The second order moment (Ripley’s K) for an unlabelled, homogeneous, isotropic point process observed as a set of points x_{i} ∈ ℝ^{2} is defined as [5]:
where λ is the intensity of the point process per unit area. For an isotropic process comparisons with K(r) are normally based on the homogeneous Poisson process Kpois(r) = πr^{2} [2]. For this type of process λ is approximated as the number of points/observed region area. For our derivation of K(r) observations are constrained to a finite set of possible locations. Hence λ is set to the number of points/(maximum observed Euclidean distance between any two points in x_{i}).
Consider a set of n possible point locations in a finite region of the plane W. Note that W is not explicitly used; however the locations are bounded. This unordered set of points may be defined as:
Each point x_{i} has an associated mark from a finite set of marks M, defining a marked point pattern:
We observe a set of q marked points q ∈ y, where q < < n and want to determine if the set q deviates from complete spatial randomness. In addition, since the marked pattern may be spatially correlated to the process generating the point pattern, the distribution of the observed marks of q must be taken into account when simulating a random sample from y. Initially (since q is fixed), we construct the discrete cumulative distribution function for the q marks as:
K(r) is now defined over a set of distance thresholds r_{i} ∈ ℝ for one Monte Carlo simulation as follows:
For each distance threshold r_{i},

1.
P = {}

2.
Repeat until q points have been selected:

2.1
Draw a uniform random number ρ ∈ [0,1).

2.2
Determine the mark m_{k} for F(ρ). This corresponds to a proportional selection of a mark value from the frequency distribution of marks for the observed pattern.

2.3
Select the subset of points t = {(x_{i}, m_{i}) ∈ y : m_{i} = m_{k}} that correspond to this mark.

2.4
Randomly select a point s ∈ t

2.5
P = P ⋃ s

2.1

3.
The number of points from the set of points P within the Minkowski Distance L_{2} (Euclidean distance) r_{i} is defined as K(r_{i}) = λ^{− 1}P.
This method does not assume that the observed marks are clustered, but takes into account their spatial structure when determining K(r). For our case study we show the effect of taking socioeconomic structure (defined as a deprivation index) into account has a significant effect on the estimate of clustering (see Results section).
Multiple simulation runs allow an envelope to be constructed. For a onesided 5 % significant level for q observed points the above simulation is performed 1000 times to define a reference set \( \widehat{K}\left({r}_i\right) \). For each distance r_{i} the \( \widehat{K}\left({r}_i\right) \) are sorted. A 5 % significance level for the clustering of observed K(r_{i}) means that K(r_{i}) is greater than the 951st observed value of K from the reference set [7, 15].
Addressing distance bias
The use of L_{2} distance on the plane (step 3 above) assumes a barrier free, isotropic measure for the distance between points. From a social contagion perspective it is difficult to know what, if any, planar distance is appropriate for the connection between any two index events. In addition, social media and other forms of communication mean that a spatial distance may not be appropriate. Since Ripley’s K requires a distance measure, we would like to confirm that L_{2} distance does not significantly influencing the results.
Consider the Minkowski distance L_{1} (Manhattan or rectilinear distance) defined between two points a(x_{1}, y_{1}) and b(x_{2}, y_{2}):
Although L_{2} is invariant under rotation, L_{1} will vary between the xaxis only and yaxis only difference as the point set x is rotated about the origin. Hence to examine the influence of distance bias, step 3 can be extended by considering a set of rotations θ between 0 and 90° using L_{1}:

5.
For each rotation θ_{i} ∈ θ

a.
Rotate the original observed points q by θ_{i} and compute Ripley’s K using L_{1} distance.

b.
Rotate the set of points s by θ_{i} about the origin to form the set s’.

c.
For each distance threshold r_{i} count the number of points in s’ within the Minkowski Distance L_{1} (Manhattan distance) d_{i} from each point in s’.

a.
This metric is clearly justified for gridlike road patterns but may also be used when the geographic distance between points is difficult to define or involves some uncertainty.
Case study: clustering of deliberate selfharm in an urban environment
This case study is based on data obtained from Invercargill, a small urban centre (population = 51,696 [16]) in the south of New Zealand. This was a retrospective 2year audit based on file review of all patients who presented with DSH of any type to the Emergency Department or Emergency Psychiatric Service Team between January 2011 and Dec 2012. The audit was approved by the University of Otago Ethics Committee (H13/033). Data collected included demographic and clinical details and residential address.
Land parcels for Invercargill were obtained from the Land Information New Zealand online database [17]. The residential parcels were selected by selecting parcels where parcel_int = “DCDB” or “Fee Simple Title” AND statutory = NULL and survey_area >0. The parcels were then assessed by area, with the smallest 5 % and largest 5 % of parcels removed. This excluded schools, recreational areas and other parcels that were filling space but not accessible as polygons for residences. This was further reduced by using Google Maps to visually assess and remove those parcels that were shops or industrial areas. This resulted in 16,516 residential parcels that could be used as possible residential addresses. Note that every effort was made to reduce the number of residential parcels since this would reduce the likelihood of false clustering due to an oversized study area. Figure 1 shows the location of Invercargill (upper panels) and a section of the final residential parcels used in this study (lower panel).
The initial individual DSH data (n = 291, of which there were 245 unique individuals) was reduced to those that intersected the residential parcels (n = 164 with 134 unique individuals; data that were not included were for individuals who lived outside of the urban boundaries). Since we were interested in evidence for clustering and social contagion, only index episodes for a given location were kept. This meant that individuals with repeat DSH at the same address were removed; however the same individual who repeated DSH at different addresses, or a different individual at the same address, were kept in the dataset. The final DSH data consisted of 136 index episodes, with two repeat individuals. A measure of socioeconomic quality of life, the New Zealand Deprivation Index (NZDep) was obtained based on the New Zealand Census data of 2006. NZDep is based on proportional measures of nine variables and constructed as a weighted sum determined by a principal component analysis of variable importance [18]. Deprivation index is a small area measure ranging from 1 (high quality) to 10 (poorest).
Figure 2 shows the meshblocks for the Invercargill urban area, and the spatial Moran’s I and autocorrelation measures [19] for the deprivation index associated with each meshblock (Panels A, B and C). Since observed DSH episodes are not uniform across deprivation (Panel D), spatial clustering of DSH will be observed due to the underlying clustered social structure of the urban environment. A similar relationship has been previously observed for suicide [14] and in a DSH young cohort study in New Zealand [20].
Although previous work on attempted suicide in New Zealand [10] suggested the existence of social contagion for spacetime patterns, no account was made for the inherent clustering of social structure. Normally social structure is accounted for through incorporating their description into a regression model (see for example [11, 13]) however the use of a distancebased metric for clustering has no formal model for this type of integration. Hence a Monte Carlo simulation is appropriate for determining the null hypothesis [21], while removing the social clustering of deprivation as a model for DSH.
For the Invercargill DSH data, the set x corresponds to the centroids of each residential parcel, the set y corresponds to the observed index events, and the marks M = {1…10} are the deprivation index. The gridlike pattern of roads within urban Invercargill (Fig. 3) justifies the use of rotated L_{1} distance measures to reducing the bias with Euclidean distance and increase confidence in any observed clustering of DSH.
Results
Figure 4 shows K(r) estimated with uniform mark distribution (Panel A) and when the covariate distribution of deprivation index is taken into account (Panel B). Note that the yaxis is calculated as λ^{2}K(r) which gives the expected number of points within r of an observed point. Panel A shows that without accounting for social structure (deprivation) clustering of index episodes is significant for all distances up to ~800 m. However, Panel B evidence for clustering is only apparent up to ~500 m once deprivation is accounted for when estimating K(r).
Figure 5 shows a range of K(r, θ) values using L_{1} distance. Although some rotations (such as 22.5°) were below the 5 % threshold of evidence for clustering, it is apparent that for almost all rotations, clustering was significant up to ~500 m.
Discussion
The original formulation of the secondorder estimate for clustering K(r) assumes a stationary process generating the intensity of observed points and no constraint regarding the placement of points in the study area. However, there are many situations where the possible observation of a point is space is constrained due to the nature of the observed process, or through explicit constraints in the way that the defining space is created. For example, a residential analysis of patterns assumes that people live at valid addresses that do not include parks, businesses, etc., while an analysis of road accident clustering is constrained to locations on a road network. The use of individual data for health analysis will increase with improved data collection and the linked integration of datasets. The method presented here addresses some aspects of how to consider spatial clustering when individual data is used within a constrained spatial region and where a clustered covariate relationship exists. The results for DSH clustering, as shown in Fig. 4, show that once social structure is accounted for there still exists evidence for clustering up to ~500 m. This clustering may suggest aspects of social contagion [11], especially given evidence for clustering is demonstrated with the rotated L_{1} metric.
The issue of clustering and a distance metric is a difficult concept to manage and quantify with the increased use of social media as a tool for communication. Physically being close is no longer a requirement for proximity and social influence [22]. However since social networking tools are independent of location, DSH that derives from these influences should be spatially random once clustered covariates are managed.
The evidence for clustering presented in Fig. 5 suggests that there is a physical (geographic) relationship between individuals and DSH, although the study here has a number of limitations. The dataset is restricted to just 2 years of observations, and for only a single community. Both of these aspects limit any generalisation but do suggest that further work extending the data collection period and range of urban settings would be useful. In addition, the clustering method assumed a single spatial covariate (deprivation), however there could be other clustered covariates such as alcohol outlets [23, 24] that are creating the observed pattern for DSH clustering. This problem can be handled by extending the marked point pattern probability method to incorporate a multivariate density analysis [25] to create a probability surface for selecting fixed locations. Given that many physical constructions, such as alcohol outlets, are also often correlated with deprivation [23] may mean that handling a single variable that captures socioeconomic structure is sufficient for estimating DSH clustering. Further work is required to determine the impact on clustering estimation with other configurations in the urban environment.
The concept of stationarity in space and time did not need to be considered here given the short timeframe and small urban area. However, although a longer time period and/or larger urban centre would produce a greater number of cases this would also increase the possibility of nonstationarity in the clustering behaviour. This would require additional methods for both detection and handling. Concepts such as nonstationarity in space and time are difficult to manage when assessing clustering and a likely solution would be to treat the clustering algorithm as a set of local statistics [26]. This is clearly future work but should be considered when large areas or long time frames are used in any assessment of spatial patterning.
Finally, extensions to Ripley’s K include a cross K function [5], which examines the relationship between two sets of finite (but differently marked) point observations. Extensions of the finite method to a cross function would allow questions of clustering to be related to point data that was not associated with the attributes of individuals and therefore extend the possible applications within the health domain.
Conclusions
Pointbased analysis is normally considered for a planar space with no placement constraints. In addition, since healthrelated data are often correlated with other social patterns that may have spatial structure (e.g. deprivation), there is a requirement to take these into account to handle bias in estimating clustering at a range of scales. The finitelocation method presented here is simple to implement and allows any pointbased healthrelated problem to be assessed for clustering. In addition, the use of a rotated L_{1} distance metric allows a more rigorous assessment of the observed clustering effect by determining the influence of the assumption of Euclidean distance when assessing K(r). This paper supports previous work on the influence of social deprivation on clustering of DSH in a small urban centre (8). In addition, evidence for social contagion has been demonstrated for DSH at small distance scales.
The presented finite point Ripley’s K approach allows an assessment of pointbased observations, while handling a spatially clustered covariate and addressing distance bias. This paper is the first work to demonstrate social contagion as a likely influence for DSH at small distance scales within an urban centre. Whether this relationship can be generalised across different communities will require further studies of DSH in other urban environments. In addition, the relationship between covariates, clustering and health measures needs to be examined in more detail. It will therefore be important to confirm the utility of this approach in other urban settings using different outcome measures and covariates.
Abbreviations
 CI:

Confidence interval
 CSR:

Complete spatial randomness
 DSH:

Deliberate selfharm
 NZDep:

New Zealand Deprivation Index
References
 1.
Bailey T, Gatrell A. Interactive Spatial Data Analysis. Essex: Person Education Limited; 1995.
 2.
Anselin L, Rey S, editors. Perspectives on Spatial Data Analysis. Advances in Spatial Science: SpringerVerlag Berlin Heidelberg; 2010.
 3.
Helbich M, Arsanjani JJ. Spatial eigenvector filtering for spatiotemporal crime mapping and spatial crime analysis. Cartogr Geogr Inf Sci. 2015;42(2):134–48.
 4.
Ripley B. Modelling spatial patterns (with discussion). J R Stat Soc Ser B. 1977;39:172–212.
 5.
Diggle P, Chetwynd A. Secondorder Analysis of Spatial Clustering for Inhomogeneous Populations. Biometrics. 1991;47(3):1155–63.
 6.
Kelsall J, Diggle P. Spatial variation in risk of disease: a nonparametric binary regression approach. Appl Stat. 1998;47(Part 4):559–73.
 7.
Yamada I, Thill JC. Comparison of planar and network Kfunctions in traffic accident analysis. J Transp Geogr. 2004;12:149–58.
 8.
Okabe A, Yamada I. The KFunction Method on a Network and Its Computational Implementation. Geogr Anal. 2001;33(3):271–90.
 9.
Hawton K, Fortune S. Suicidal Behavior and Delibrate SelfHarm. In: Rutter M, Bishop D, Pine D, Scott S, Stevenson J, Taylor E, et al., editors. Rutter’s Child and Adolesecent Psychiatry. 5th ed. 2008. p. 648–69.
 10.
Gould M, Petrie K, Kleinman MH, Wallenstein S. Clustering of Attempted Suicide: New Zealand National Data. Int J Epidemiol. 1994;23(6):1185–9.
 11.
Baller RD, Richardson KK. Social Integration, Imitation, and the Geographic Patterning of Suicide. Am Sociol Rev. 2002;67:873–88.
 12.
Anselin L, Bera AK, Florax R, Yoon MJ. Simple diagnostic tests for spatial dependence. Reg Sci Urban Econ. 1996;26:77–104.
 13.
Evans E, Hawton K, Rodham K. Factors associated with suicidal phenomena in adolescents: a systematic review of populationbased studies. Clin Psychol Rev. 2004;24:957–79.
 14.
Rehkopf DH, Buka SL. The association between suicide and the socioeconomic characteristics of geographical areas: a systematic review. Psychol Med. 2005;36:145–57.
 15.
Hope A. A Simplified Monte Carlo Significance Test Procedure. J R Stat Soc. 1968;30(3):582–98.
 16.
Population Data. Statistics New Zealand. 2013. Available from: http://www.stats.govt.nz/Census. Accessed 25 Jan 2016.
 17.
New Zealand Primary Parcels [database on the Internet] 2011. Available from: https://data.linz.govt.nz. Accessed: 23 Oct 2015
 18.
Salmond S, Crampton P. Development of New Zealand’s Deprivation Index (NZDep) and Its Uptake as a National Policy Tool. Can J Public Health. 2012;103 Suppl 2:S7–S11.
 19.
Cliff AD, Ord JK. Spatial Processes: Models and Applications. London: Pion Ltd.; 1981.
 20.
de Graaf B, Srivastava R, Whigham PA, Baxter J, Glue P. Deliberate SelfHarm in Under15YearOlds: 5 Year National Trends in New Zealand. 2016. Manuscript in preparation.
 21.
Besag J, Diggle P. Simple Monte Carlo tests for Spatial Pattern. J R Stat Soc. 1977;26(3):327–33.
 22.
Duggan JM, Whitlock J. Selfinjury Behaviors in Cyber Space. In: Yan Z, editor. Encyclopedia of Cyber Behavior. Information Science Reference, IGI Global; 2012. p. 768–81.
 23.
Hay G, Whigham PA, Kypri K, Langley JD. Neighbourhood deprivation and access to alcohol outlets: a national study. Health Place. 2009;15(4):1086–93.
 24.
Huckle T, Huadau J, Sweetsur P, Huisman O, Cassell S. Density of alcohol outlets and teenage drinking: living in an alcogenic environment is associated with higher consumption in a metropolitan setting. Addiction. 2008;103(10):1614–21.
 25.
Scott D. Multivariate density estimation: theory, practice and visualization. New York: Wiley; 1992.
 26.
Anselin L. Local Indicators of Spatial Association–LISA. Geogr Anal. 1995;27(2):93–115.
Acknowledgements
The authors would like to thank Dr David Eyers, Computer Science Dept., University of Otago, for suggestions regarding distance bias and rotated space.
Funding
No specific funding.
Availability of data and materials
The dataset analysed during the current study is not publicly available because it identifies individuals via a specific home address and date.
Authors’ contributions
PW developed the finitepoint clustering method, wrote the “R” code, ran the simulations and drafted the original manuscript. PG organised resources for the project, supervised RS for the data collection, gave statistical and methodological support, and contributed to the manuscript. RS collected the data and gave advice on interpretation. BG assisted with developing the Ripley K method and Minkowski rotation approach and data preparation. PW, PG and BG contributed to the writing of the manuscript. All authors have approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Patient consent was not sought as this was a retrospective audit of data already collected in clinical files. Data were deidentified after collection and before analysis.
Ethics approval and consent to participate
Ethics approval was obtained from the University of Otago Ethics Committee (H13/033). Data collected included demographic, clinical details and residential address.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Whigham, P.A., de Graaf, B., Srivastava, R. et al. Managing distance and covariate information with pointbased clustering. BMC Med Res Methodol 16, 115 (2016). https://doi.org/10.1186/s128740160218z
Received:
Accepted:
Published:
Keywords
 Deliberate selfharm
 Clustering
 Ripley’s K
 Deprivation
 Social contagion
 MonteCarlo simulation
 Minkowski distance