COMPUTER SCIENCE DEPARTMENT

ST-DBSCAN: An algorithm for clustering

spatial–temporal data

Abstract

This paper presents a new density-based clustering algorithm, ST-DBSCAN, which is based on DBSCAN. We propose three marginal extensions to DBSCAN related with the identification of (i) core objects, (ii) noise objects, and (iii) adjacent clusters. In contrast to the existing density-based clustering algorithms, our algorithm has the ability of discovering clusters according to non-spatial, spatial and temporal values of the objects. In this paper, we also present a spatial–temporal data warehouse system designed for storing and clustering a wide range of spatial–temporal data. We show an implementation

of our algorithm by using this data warehouse and present the data mining results.

Keywords:

Data mining; Cluster analysis; Spatial–temporal data; Cluster visualization; Algorithms

1. Introduction

Clustering is one of the major data mining methods for knowledge discovery in large databases. It is the process of grouping large data sets according to their similarity . Cluster analysis is a major tool in many areas of engineering and scientific applications including data segmentation, discretization of continuous attributes,data reduction, outlier detection, noise filtering, pattern recognition and image processing. In the field of Knowledge Discovery in Databases (KDD), cluster analysis is known as unsupervised learning process, since there is no priori knowledge about the data set.

Most studies in KDD [12] focus on discovering clusters from ordinary data (non-spatial and non-temporal data), so they are impractical to use for clustering spatial temporal data. Spatial temporal data refers to data which is stored as temporal slices of the spatial dataset. Knowledge discovery from spatial temporal data is a very promising subfield of data mining because increasingly large volumes of spatial temporal data are collected and need to be analyzed. The knowledge discovery process for spatial temporal data is more complex

than for non-spatial and non-temporal data. Because spatial temporal clustering algorithms have to consider the spatial and temporal neighbors of objects in order to extract useful knowledge. Clustering algorithms designed for spatial temporal data can be used in many applications such as geographic information systems, medical imaging, and weather forecasting. This paper presents a new density-based clustering algorithm ST-DBSCAN, which is based on the algorithmDBSCAN (Density-Based Spatial Clustering of Applications with Noise)[5]. In DBSCAN, the density associated with a point is obtained by counting the number of points in a region of specified radius around the point.Points with a density above a specified threshold are constructed as clusters. Among the existing clustering algorithms, we have chosen DBSCAN algorithm, because it has the ability in discovering clusters with arbitrary shape such as linear, concave, oval, etc. Furthermore, in contrast to some clustering algorithms, it does not require the predetermination of the number of clusters. DBSCAN has been proven in its ability of processing very large databases[3,5,6].. We have improved DBSCAN algorithm in three important directions. First, unlike the existing density-based clustering algorithms, our algorithm can cluster spatial temporal data according to

its non-spatial, spatial and temporal attributes. Second, DBSCAN cannot detect some noise points when clusters of different densities exist. Our algorithm solves this problem by assigning to each cluster a density factor. Third, the values of border objects in a cluster may be very different than the values of border objects in opposite side, if the non-spatial values of neighbor objects have little differences and the clusters are adjacent to each

other. Our algorithm solves this problem by comparing the average value of a cluster with new coming value. In addition to new clustering algorithm, this paper also presents a spatial data warehouse system designed for storing and clustering a wide range of spatial–temporal data. Environmental data, from a variety of sources, were integrated as coverages, grids, shape files, and tables. Special functions were developed for data integration, data conversion, visualization, analysis and management. User-friendly interfaces were also developed allowing relatively inexperienced users to operate the system. In order demonstrate the applicability of our algorithm to real world problems, we applied our algorithm to the data warehouse, and then presented and discussed the data mining results.

Spatial temporal data is indexed and retrieved according to spatial and time dimensions. A time period attached to the spatial data expresses when it was valid or stored in the database. A temporal database may support valid time, transaction time or both. Valid time denotes the time period during which a fact is true with respect to the real world. Transaction time is the time period during which a fact is stored in the database. This study focuses on valid time aspect of temporal data.The rest of the paper is organized as follows. Section 2 summaries the existing clustering algorithms and gives basic concepts of density-based clustering algorithms. Section 3describes the drawbacks of existing density-based clustering algorithms and our efforts to overcome these problems. Section4 explains our algorithm in detail and presents the performance of the algorithm. Section5 presents three applications which are implemented to demonstrate the applicability of it to real world problems. It shows and discusses the data mining results. Finally, a conclusion and some directions for future works are given in Section 6.2. Related works and basic concepts.This section summaries and discusses the existing clustering algorithms and then gives basic concepts of

Density based algorithms.

2.1. Density-based clustering

The problem of clustering can be defined as follows:

Definition 1. Given a database of n data objects D={o 1,o2,...,o n}. The process of partitioning D into C ={ C 1,C2,...,Ck} based on a certain similarity measure is called clustering, Ci’s are called clusters, where C iD,(i=1,2,...,k),Tki¼1Ci¼ and Ski¼1Ci¼D.

Clustering algorithms can be categorized into five main types[13]:Partitional,Hierarchical,

Grid-based ,Model-based and Density-based clustering algorithms. In Partitional

algorithms, cluster similarity is measured in regard to the mean value of the objects in a cluster, center of gravity, (K-Means[19]) or each cluster is represented by one of the objects of the cluster located near its center (K -Medoid[26]). K is an input parameter for

these algorithms, unfortunately it is not available for many applications. CLARANS[20]

is an improved version of K-Medoid algorithm Hierarchical algorithms such as CURE [9] ,BIRCH [31] produce a set of nested clusters organized as a hierarchical tree. Each node of the tree represents a cluster of a database D. Grid-based algorithms such as STING [28], WaveCluster [23] are based on multiple-level grid structure on which all operations for clustering are performed. In Model-based algorithms (COB-WEB[8], etc.), a model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other. They are often based on the assumption that the data are generated by a mixture of underlying probability distributions.The Density-based notion is a common approach for clustering. Density-based clustering algorithms are based on the idea that objects which form a dense region should be grouped together into one cluster. They use a fixed threshold value to determine dense regions. They search for regions of high density in a feature space that are separated by regions of lower density.

Density-based clustering algorithms such as DBSCAN [5], OPTICS[2], DENCLUE[15], CURD[18] are to some extent capable of clustering databases[21]. One drawback of these algorithms is that they capture only certain kinds of noise points when clusters of different densities exist. Furthermore, they are adequate if the clusters are distant from each other, but not satisfactory when clusters are adjacent to each other. The detailed description of these problems and our solutions are given in Section 3.In our study, we have chosen DBSCAN algorithm, because it has the ability in discovering clusters with arbitrary shape such as linear, concave, oval, etc. Furthermore, in contrast to some clustering algorithms,

it does not require the predetermination of the number of clusters. DBSCAN has been proven in its abilityof processing very large databases [3,6].

In the literature, DBSCAN algorithm was used in many studies. For example, the other popular density-based algorithm OPTICS (Ordering Points To Identify the Clustering Structure) [2] is based on the concepts of DBSCAN algorithm and identifies nested clusters and the structure of clusters. Incremental DBSCAN[7]algorithm is also based on the clustering algorithm DBSCAN and is used for incremental updates of a clustering after insertion of a new object to the database and deletion of an existing object from the database.

Based on the formal notion of clusters, the incremental algorithm yields the same result as the non-incremental DBSCAN algorithm. SDBDC (Scalable Density-Based Distributed Clustering)[16] method also uses DBSCAN algorithm on both local sites and global site to cluster distributed objects. In this method, DBSCAN algorithm is firstly carried out on each local site. Then, based on these local clustering results, cluster representatives are determined. Then, based on these local representatives, the standard DBSCAN algorithm is carried out on the global site to construct the distributed clustering. This study pro

poses the usage of different Eps-values for each local representative. Wen et al.[29]

adopted DBSCAN and Incremental DBSCAN as the core algorithms of their query clustering tool. They used DBSCAN to cluster frequently asked questions and most popular topics on a search engine. Spieth et al.[24] applied DBSCAN to identify solutions for the inference of regulatory networks. Finally, SNN density-based clustering algorithm [25]is also based on DBSCAN and it is applicable to high-dimensional data consisting of time series data of

atmospheric pressure at various points on the earth.

2.2. Basic concepts

DBSCAN is designed to discover arbitrary-shaped clusters in any database D and at the same time can distinguish noise points. More specifically, DBSCAN accepts a radius value Eps(e) based on a user defined distance measure and a value MinPts for the number of minimal points that should occur within Eps radius.Some concepts and terms to explain the DBSCAN algorithm can be defined as follows[5].

Definition 2 (Neighborhood). It is determined by a distance function (e.g., Manhattan Distance, Euclidean Distance) for two points p and q , denoted by dist(p,q).

Definition 3(Eps-neighborhood). The Eps-neighborhood of a point p is defined by {q2Dj dist(p,q)6Eps}.

Definition 4 (Core object). A core object refers to such point that its neighborhood of a given radius (Eps) has to contain at least a minimum number (MinPts) of other points ( Fig. 1.c).

Definition 5 (Directly density-reachable ). An object p is directly density-reachable from the object q if p is within the Eps neighborhood of q. q is a core object.

Definition 6 (Density-reachable ). An object p is density-reachable from the object q with respect to Eps and MinPts if there is a chain of objects p 1,...,pn,p1=q and pn=q such that

pi+1 is directly density-reachable from pi with respect to Eps and MinPts, for 16 I 6 n ,pi 2D(Fig. 1a).

Definition 7(Density-connected). An object p is density-connected to object q with respect to Eps and MinPts if there is an object o 2 D such that both p and q are density-reachable from

O with respect to Eps and MinPts ( Fig. 1b).

Definition 8(Density-based cluster). A cluster C is a non-empty subset of D satisfying the following ‘‘maximality’’ and ‘‘connectivity’’ requirements: (1) "p,q:if q2 C and p is density-reachable from q with respect to Eps and MinPts, then p2C.(2)"p,q2C:p is density-connected to q with respect to Eps and MinPts.

Definition 9 (Border object). An object p is a border object if it is not a core object but density-reachable from another core object.The algorithm starts with the first point p

in database D , and retrieves all neighbors of point p within Eps distance. If the total number of these neighbors is greater than MinPts if p is a core object a new cluster is created. The point p and its neighbors are assigned into this new cluster. Then, it iteratively collects the neighbors within Eps distance from the core points. The process is repeated until all of the points have been processed.

3. Problems of existing approaches

3.1. Problem of clustering spatial–temporal data

In order to determine whether a set of points is similar enough to be considered a cluster or not, we need a distance measure dist(i, j) that tells how far points i and j are. The most common distance measures used are Manhattan distance, Euclidean distance, and Minkowski distance. Euclidean distance is defined as Eq. (1).

(1)dist(i,j)=sqrt(xi1-xj1)²+(xi2-xj2)²+…+(xin-xjn)²

where i = (x_i₁, x_i₂, … , x_in) and j = (x_j₁, x_j₂, … , x_jn) are two n-dimensional data objects. For example, the Euclidean distance between the two data objects A(1, 2) and B(5, 3) is 4.12.

DBSCAN algorithm uses only one distance parameter Eps to measure similarity of spatial data with one dimension. In order to support two dimensional spatial data, we propose two distance metrics, Eps1 and Eps2, to define the similarity by a conjunction of two density tests. Eps1 is used for spatial values to measure the closeness of two points geographically. Eps2 is used to measure the similarity of non-spatial values. For example, A(x1, y1) and B(x2, y2) are two points (spatial values), t1, t2 (DayTimeTemperature, NightTimeTemperature) and t3, t4 are four temperature values of these points respectively (non-spatial values). In this example, Eps1 is used to measure the closeness of two points geographically, while Eps2 is used to measure thesimilarity of temperature values. If A(x1, y1, t1, t2) and B(x2, y2, t3, t4) are two points, Eps1 and Eps2 are calculated by the formulas in Eq. (2).

(2)Eps1=sqrt(x1-x2)²+(y1-y2)²

Eps2=sqrt(t1-t2)²+(t1-t2)²

In order to support temporal aspects, spatio-temporal data is first filtered by retaining only the temporal neighbors and their corresponding spatial values. Two objects are temporal neighbors if the values of these objects are observed in consecutive time units such as consecutive days in the same year or in the same day in consecutive years.

3.2. Problem of identifying noise objects

From the view of a clustering algorithm, noise is a set of objects not located in clusters of a database. More formally, noise can be defined as follows:

Definition 10 Noise

Let C₁, … , C_k be the clusters of the database D. Then the noise is the set of points in the database D not belonging to any cluster C_i, where i = 1, … , k, i.e., noise = {p ∈ D ∣ ∀i: p ∉ C_i}.

Existing density-based clustering algorithms [14] produce meaningful and adequate results under certain conditions, but their results are not satisfactory when clusters of different densities exist. To illustrate, consider the example given in Fig. 2.

Fig.2.Clusters of different densities exist.

This is a simple dataset containing 52 objects. There are 25 objects in the first cluster C₁, 25 objects in the second cluster C₂, and two additional noise objects o₁ and o₂. In this example, C₂ forms a denser cluster than C₁. In other words, the densities of the clusters are different from each other. DBSCAN algorithm identifies only one noise object o₁. Because approximately for every object p in C₁, the distance between the object p and its nearest neighbor is greater than distance between o₂ and C₂. For this reason, we can’t determine an appropriate value for the input parameter Eps. If the Eps value is less than the distance between o₂ and C₂, some objects in C₁ are assigned as noise object. If the Eps value is greater than the distance between o₂ and C₂, the object o₂ is not assigned as noise object.

4. ST-DBSCAN algorithm

4.1. The description of the algorithm

While DBSCAN algorithm needs two inputs, our algorithm ST-DBSCAN requires four parameters Eps1, Eps2, MinPts, and Δϵ because of the extensions described in Section 3. Eps1 is the distance parameter for spatial attributes (latitude and longitude). Eps2 is the distance parameter for non-spatial attributes. A distance metric such as Euclidean, Manhattan or Minkowski Distance Metric can be used for Eps1 and Eps2. MinPts is the minimum number of points within Eps1 and Eps2 distance of a point. If a region is dense, then it should contain more points than MinPts value. In [5], a simple heuristic is presented which is effective in many cases to determine the parameters Eps and MinPts. The heuristic suggests MinPts ≈ ln(n) where n is the size of the database and Eps must be picked depending on the value of MinPts. The first step of the heuristic method is to determine the distances to the k-nearest neighbors for each object, where kis equal to MinPts. Then these k-distance values should be sorted in descending order. Then we should determine the threshold point which is the first “valley” of the sorted graph. We should select Eps to less than the distance defined by the first valley. The last parameter Δϵ is used to prevent the discovering of combined clusters because of the little differences in non-spatial values of the neighboring locations.The algorithm starts with the first point p in database D and retrieves all points density-reachable from p with respect to Eps1 and Eps2. If p is a core object (see Definition 4), a cluster is formed. If p is a border object (see Definition 9), no points are density-reachable from p and the algorithm visits the next point of the database. The process is repeated until all of the points have been processed.

4.2. Performance evaluation

The average runtime complexity of the DBSCAN algorithm is O(n ∗ log n), where n is the number of objects in the database. Our modifications do not change the runtime complexity of the algorithm. DBSCAN has been proven in its ability of processing very large databases. The paper [6] shows that the runtime of other clustering algorithms such as CLARANS [20], DBCLASD [30] is between 1.5 and 3 times the runtime of DBSCAN. This factor increases with increasing size of the database.

6. Conclusions and future work

Clustering is a main method in many areas, including data mining and knowledge discovery, statistics, and machine learning. This study presents a new density-based clustering algorithm ST-DBSCAN which is constructed by modifying DBSCAN algorithm. The first reason of this modification is to be able to discover the clusters on spatial–temporal data. The second modification is necessary to find noise objects when clusters of different densities exist. We introduce a new concept: density factor. We assign to each cluster a density factor, which is the degree of the density of the cluster. The third modification provides a comparison of the average value of a cluster with new coming value. In order to demonstrate the applicability of our algorithm to real world problems, we present an application using a spatial–temporal data warehouse. Experimental results demonstrate that our modifications appear to be very promising when spatial–temporal data is used to be clustered.Very large databases need extreme computing power. In future studies, it is intended to run the algorithm in parallel in order to improve the performance. In addition, more useful heuristics may be found to determine the input parameters Eps and MinPts.

References

[1] T. Abraham, J.F. RoddickSurvey of spatio-temporal databases GeoInformatica, Springer, 3 (1) (1999), pp. 61-99

[2] M. Ankerst, M.M. Breunig, H.-P. Kriegel, J. Sander, OPTICS: Ordering points to identify the clustering structure, in: Proceedings of ACM SIGMOD International Conference on Management of Data, Philadelphia, PA, 1999, pp. 49–60.

[3] Z. Aoying, Z. ShuigengApproaches for scaling DBSCAN algorithm to large spatial database Journal of Computer Science and Technology, 15 (6) (2000), pp. 509-526

[4] C. Böhm, S. Berchtold, H.-P. Kriegel, U. MichelMultidimensional index structures in relational databases Journal of Intelligent Information Systems (JIIS), Springer, 15 (1) (2000), pp. 51-70

[5] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, 1996, pp. 226–231.

[6] M. Ester, H.-P. Kriegel, J. Sander, X. XuClustering for mining in large spatial databases

KI-Journal (Artificial Intelligence), 12 (1) (1998), pp. 18-24Special Issue on Data Mining

[7] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, X. Xu, Incremental clustering for mining in a data warehousing environment, in: Proceedings of International Conference on Very Large Databases (VLDB’98), New York, USA, 1998, pp. 323–333.

[8] D. FisherKnowledge acquisition via incremental conceptual clustering

Machine Learning, 2 (2) (1987), pp. 139-172

L.Hemalatha

Asst.prof, Dept of Computer Science

COMPUTER SCIENCE DEPARTMENT

Wednesday, 6 September 2017

3.1. Problem of clustering spatial–temporal data

3.2. Problem of identifying noise objects

References

No comments:

Post a Comment