ST-DBSCAN: An algorithm for
clustering
spatial–temporal data
Abstract
This paper presents a new density-based clustering
algorithm, ST-DBSCAN, which is based on DBSCAN. We propose three marginal
extensions to DBSCAN related with the identification of (i) core objects, (ii)
noise objects, and (iii) adjacent clusters. In contrast to the existing
density-based clustering algorithms, our algorithm has the ability of
discovering clusters according to non-spatial, spatial and temporal values of
the objects. In this paper, we also present a spatial–temporal data warehouse
system designed for storing and clustering a wide range of spatial–temporal
data. We show an implementation
of our algorithm by using this data warehouse and present
the data mining results.
Keywords:
Data mining; Cluster analysis;
Spatial–temporal data; Cluster visualization; Algorithms
1. Introduction
Clustering
is one of the major data mining methods for knowledge discovery in large
databases. It is the process of grouping large data sets according to their similarity . Cluster analysis is a major tool
in many areas of engineering and scientific applications including data
segmentation, discretization of continuous attributes,data reduction, outlier
detection, noise filtering, pattern recognition and image processing. In the
field of Knowledge Discovery in Databases (KDD), cluster analysis is known as
unsupervised learning process, since there
is no priori knowledge about the data set.
Most
studies in KDD [12] focus on discovering
clusters from ordinary data (non-spatial and non-temporal data), so they are
impractical to use for clustering spatial
temporal data. Spatial temporal data refers to data which is stored as
temporal slices of the spatial dataset. Knowledge discovery from spatial temporal
data is a very promising subfield of data mining because increasingly large
volumes of spatial temporal data are collected and need to be analyzed. The
knowledge discovery process for spatial temporal data is more complex
than for non-spatial and
non-temporal data. Because spatial temporal clustering algorithms have to consider
the spatial and temporal neighbors of objects in order to extract useful
knowledge. Clustering algorithms designed for spatial temporal
data can be used in many applications such as geographic information systems, medical
imaging, and weather forecasting. This paper presents a new density-based
clustering algorithm ST-DBSCAN, which is based on the algorithmDBSCAN
(Density-Based Spatial Clustering of Applications with Noise)[5]. In DBSCAN,
the density associated with a point is obtained by counting the number of
points in a region of specified radius around the point.Points with a density
above a specified threshold are constructed as clusters. Among the existing
clustering algorithms, we have chosen DBSCAN algorithm, because it has the
ability in discovering clusters with arbitrary shape such as linear, concave,
oval, etc. Furthermore, in contrast to some clustering algorithms, it does not require
the predetermination of the number of clusters. DBSCAN has been proven in its
ability of processing very large databases[3,5,6].. We have improved DBSCAN
algorithm in three important directions. First, unlike the existing
density-based clustering algorithms, our algorithm can cluster spatial temporal
data according to
its non-spatial, spatial and
temporal attributes. Second, DBSCAN cannot detect some noise points when clusters
of different densities exist. Our algorithm solves this problem by assigning to
each cluster a density factor. Third, the values of border objects in a cluster
may be very different than the values of border objects in opposite side, if
the non-spatial values of neighbor objects have little differences and the
clusters are adjacent to each
other. Our algorithm solves this
problem by comparing the average value of a cluster with new coming value. In
addition to new clustering algorithm, this paper also presents a spatial data
warehouse system designed for storing and clustering a wide range of
spatial–temporal data. Environmental data, from a variety of sources, were
integrated as coverages, grids, shape files, and tables. Special functions were
developed for data integration, data conversion, visualization, analysis and
management. User-friendly interfaces were also developed allowing relatively
inexperienced users to operate the system. In order demonstrate the
applicability of our algorithm to real world problems, we applied our algorithm
to the data warehouse, and then presented
and discussed the data mining results.
Spatial temporal data is indexed and
retrieved according to spatial and time dimensions. A time period attached to
the spatial data expresses when it was valid or stored in the database. A
temporal database may support valid time, transaction time or both. Valid time
denotes the time period during which a fact is true with respect to the real
world. Transaction time is the time period during which a fact is stored in the
database. This study focuses on valid time aspect of temporal data.The rest of
the paper is organized as follows. Section 2 summaries the existing clustering algorithms
and gives basic concepts of density-based clustering algorithms. Section 3describes
the drawbacks of existing density-based clustering algorithms and our efforts
to overcome these problems. Section4 explains our algorithm in detail and
presents the performance of the algorithm. Section5 presents three applications
which are implemented to demonstrate the applicability of it to real world
problems. It shows and discusses the data mining results. Finally, a conclusion
and some directions for future works are given in Section 6.2. Related works
and basic concepts.This section summaries and discusses the existing clustering
algorithms and then gives basic concepts of
Density based algorithms.
2.1. Density-based clustering
The problem of clustering can be
defined as follows:
Definition 1. Given a database of n data
objects D={o 1,o2,...,o
n}. The process of partitioning D into C ={ C 1,C2,...,Ck}
based on a certain similarity measure is called clustering, Ci’s are called clusters, where C iD,(i=1,2,...,k),Tki¼1Ci¼ and Ski¼1Ci¼D.
Clustering algorithms can be
categorized into five main types[13]:Partitional,Hierarchical,
Grid-based ,Model-based and Density-based
clustering algorithms. In Partitional
algorithms, cluster similarity is
measured in regard to the mean value of the objects in a cluster, center of
gravity, (K-Means[19]) or each cluster is represented by one of the objects of
the cluster located near its center (K -Medoid[26]). K is an input parameter
for
these algorithms, unfortunately it
is not available for many applications. CLARANS[20]
is an improved version of K-Medoid algorithm Hierarchical algorithms
such as CURE [9] ,BIRCH [31] produce a set of nested clusters organized as a
hierarchical tree. Each node of the tree represents a cluster of a database D. Grid-based
algorithms such as STING [28], WaveCluster [23] are based on multiple-level
grid structure on which all operations for clustering are performed. In Model-based
algorithms (COB-WEB[8], etc.), a model is hypothesized for each of the clusters
and the idea is to find the best fit of that model to each other. They are
often based on the assumption that the data are generated by a mixture of
underlying probability distributions.The Density-based notion is a common
approach for clustering. Density-based clustering algorithms are based on the
idea that objects which form a dense region should be grouped together into one
cluster. They use a fixed threshold value to determine dense regions. They
search for regions of high density in a feature space that are separated by
regions of lower density.
Density-based clustering algorithms
such as DBSCAN [5], OPTICS[2], DENCLUE[15], CURD[18] are to some extent capable
of clustering databases[21]. One drawback of these algorithms is that they
capture only certain kinds of noise points when clusters of different densities
exist. Furthermore, they are adequate if the clusters are distant from each
other, but not satisfactory when clusters are adjacent to each other. The
detailed description of these problems and our solutions are given in Section 3.In
our study, we have chosen DBSCAN algorithm, because it has the ability in
discovering clusters with arbitrary shape such as linear, concave, oval, etc.
Furthermore, in contrast to some clustering algorithms,
it does not require the
predetermination of the number of clusters. DBSCAN has been proven in its
abilityof processing very large databases [3,6].
In the literature, DBSCAN algorithm
was used in many studies. For example, the other popular density-based
algorithm OPTICS (Ordering Points To Identify the Clustering Structure) [2] is
based on the concepts of DBSCAN algorithm and identifies nested clusters and
the structure of clusters. Incremental DBSCAN[7]algorithm is also based on the
clustering algorithm DBSCAN and is used for incremental updates of a clustering
after insertion of a new object to the database and deletion of an existing
object from the database.
Based on the formal notion of
clusters, the incremental algorithm yields the same result as the
non-incremental DBSCAN algorithm. SDBDC (Scalable Density-Based Distributed
Clustering)[16] method also uses DBSCAN algorithm on both local sites and
global site to cluster distributed objects. In this method, DBSCAN algorithm is
firstly carried out on each local site. Then, based on these local clustering results,
cluster representatives are determined. Then, based on these local
representatives, the standard DBSCAN algorithm is carried out on the global
site to construct the distributed clustering. This study pro
poses the usage of different
Eps-values for each local representative. Wen et al.[29]
adopted DBSCAN and Incremental
DBSCAN as the core algorithms of their query clustering tool. They used DBSCAN
to cluster frequently asked questions and most popular topics on a search
engine. Spieth et al.[24] applied DBSCAN to identify solutions for the
inference of regulatory networks. Finally, SNN density-based clustering
algorithm [25]is also based on DBSCAN and it is applicable to high-dimensional
data consisting of time series data of
atmospheric pressure at various
points on the earth.
2.2. Basic concepts
DBSCAN is designed to discover
arbitrary-shaped clusters in any database D and at the same time can
distinguish noise points. More specifically, DBSCAN accepts a radius value Eps(e)
based on a user defined distance measure and a value MinPts for the number of
minimal points that should occur within Eps radius.Some concepts and terms to
explain the DBSCAN algorithm can be defined as follows[5].
Definition 2 (Neighborhood). It is
determined by a distance function (e.g., Manhattan Distance, Euclidean Distance)
for two points p and q , denoted by dist(p,q).
Definition 3(Eps-neighborhood). The
Eps-neighborhood of a point p is defined by {q2Dj dist(p,q)6Eps}.
Definition 4 (Core object). A core
object refers to such point that its neighborhood of a given radius (Eps) has to
contain at least a minimum number (MinPts) of other points ( Fig. 1.c).
Definition
5 (Directly density-reachable ). An object p is directly density-reachable from
the object q if p is within the Eps neighborhood of q. q is a core object.
Definition 6 (Density-reachable ).
An object p is density-reachable from the object q with respect to Eps and MinPts
if there is a chain of objects p 1,...,pn,p1=q and pn=q such that
pi+1 is
directly density-reachable from pi with
respect to Eps and MinPts, for 16
I 6
n ,pi 2D(Fig. 1a).
Definition 7(Density-connected). An
object p is density-connected to object q with respect to Eps and MinPts if
there is an object o 2 D such that both p and q are
density-reachable from
O with respect to Eps and MinPts ( Fig.
1b).
Definition 8(Density-based cluster).
A cluster C is a non-empty subset of D satisfying the following ‘‘maximality’’
and ‘‘connectivity’’ requirements: (1) "p,q:if q2 C and p is density-reachable from q with respect to Eps and
MinPts, then p2C.(2)"p,q2C:p is density-connected to q with respect to Eps and
MinPts.
Definition 9 (Border object). An
object p is a border object if it is not a core object but density-reachable
from another core object.The algorithm starts with the first point p
in database D , and retrieves all
neighbors of point p within Eps distance.
If the total number of these neighbors is greater than MinPts if p is a core
object a new cluster is created. The point p and its neighbors are assigned
into this new cluster. Then, it iteratively collects the neighbors within Eps
distance from the core points. The process is repeated until all of the points
have been processed.
3. Problems of existing approaches
3.1. Problem of clustering spatial–temporal data
In order
to determine whether a set of points is similar enough to be considered a cluster or
not, we need a distance measure dist(i, j) that tells
how far points i and j are. The most common distance measures
used are Manhattan distance, Euclidean distance, and Minkowski distance.
Euclidean distance is defined as Eq. (1).
(1)dist(i,j)=sqrt(xi1-xj1)2 +(xi2-xj2)2+…+(xin-xjn)2
where i = (xi1, xi2, … , xin)
and j = (xj1, xj2, … , xjn)
are two n-dimensional
data objects. For example, the Euclidean distance between the two data objects A(1, 2) and B(5, 3) is 4.12.
DBSCAN
algorithm uses only one distance parameter Eps to measure similarity of spatial data with one dimension. In
order to support two dimensional spatial data, we propose two distance metrics,
Eps1 and Eps2, to define the similarity by a conjunction of two density tests.
Eps1 is used for spatial
values to measure the
closeness of two points geographically. Eps2 is used to measure the similarity of non-spatial
values. For example, A(x1, y1)
and B(x2, y2)
are two points (spatial values), t1, t2 (DayTimeTemperature,
NightTimeTemperature) and t3, t4 are four temperature
values of these points respectively (non-spatial values). In this example, Eps1
is used to measure the closeness of two points geographically, while Eps2 is
used to measure thesimilarity of
temperature values. If A(x1, y1, t1, t2)
and B(x2, y2, t3, t4)
are two points, Eps1 and Eps2 are calculated by the formulas in Eq. (2).
(2)Eps1=sqrt(x1-x2)2+(y1-y2)2
Eps2=sqrt(t1-t2)2+(t1-t2)2
In order
to support temporal aspects, spatio-temporal data is first filtered by
retaining only the temporal neighbors and their corresponding spatial values.
Two objects are temporal neighbors if the values of these objects are observed
in consecutive time units such as consecutive days in the same year or in the
same day in consecutive years.
3.2. Problem of identifying noise objects
From the
view of a clustering algorithm, noise is a set of objects not located in
clusters of a database. More formally, noise can be defined as follows:
Definition 10 Noise
Let C1, … , Ck be the clusters of the database D. Then the noise is the set of points in the database D not belonging to any cluster Ci, where i = 1, … , k,
i.e., noise = {p ∈ D ∣ ∀i: p ∉ Ci}.
Existing
density-based clustering algorithms [14] produce meaningful and adequate
results under certain conditions, but their results are not satisfactory when
clusters of different densities exist. To illustrate, consider the example
given in Fig. 2.
Fig.2.Clusters of different densities exist.
This is a
simple dataset containing 52 objects. There are 25 objects in the first cluster C1, 25 objects in
the second cluster C2,
and two additional noise objects o1 and o2.
In this example, C2 forms a denser cluster than C1. In other
words, the densities of the clusters are different from each other. DBSCAN
algorithm identifies only one noise object o1.
Because approximately for every object p in C1,
the distance between the object p and its nearest neighbor is greater
than distance between o2 and C2.
For this reason, we can’t determine an appropriate value for the input
parameter Eps. If the Eps value is less than the distance between o2 and C2,
some objects in C1 are assigned as noise object. If the
Eps value is greater than the distance between o2 and C2,
the object o2 is not assigned as noise object.
4. ST-DBSCAN
algorithm
4.1. The description of the algorithm
While
DBSCAN algorithm needs two inputs, our algorithm ST-DBSCAN requires four
parameters Eps1, Eps2, MinPts, and Δϵ because of the extensions described in
Section3.
Eps1 is the distance parameter for spatial attributes (latitude and longitude).
Eps2 is the distance parameter for non-spatial attributes. A distance metric
such as Euclidean, Manhattan or Minkowski Distance Metric can be used for Eps1
and Eps2. MinPts is the minimum number of points within Eps1 and Eps2 distance
of a point. If a region is dense, then it should contain more points than
MinPts value. In [5],
a simple heuristic is presented which is effective in many cases to determine
the parameters Eps and MinPts. The heuristic suggests MinPts ≈ ln(n)
where n is the size of the database and Eps
must be picked depending on the value of MinPts. The first step of the
heuristic method is to determine the distances to the k-nearest neighbors for each object, where kis equal to MinPts. Then
these k-distance values should be sorted in descending order.
Then we should determine the threshold point which is the first “valley” of the
sorted graph. We should select Eps to less than the distance defined by the
first valley. The last parameter Δϵ is
used to prevent the discovering of combined clusters because of the little
differences in non-spatial values of the neighboring locations.The algorithm
starts with the first point p in database D and retrieves all points
density-reachable from p with respect to Eps1 and Eps2. If p is a core object (see Definition 4),
a cluster is formed. If p is a border object (see Definition 9),
no points are density-reachable from p and the algorithm visits the next
point of the database. The process is repeated until all of the points have
been processed.
4.2. Performance
evaluation
The average runtime complexity of the DBSCAN algorithm is O(n ∗ log n), where n is the number of objects in the
database. Our modifications do not change the runtime complexity of the
algorithm. DBSCAN has been proven in its ability of processing very large
databases. The paper [6] shows that the runtime of other
clustering algorithms such as CLARANS [20],
DBCLASD [30] is between 1.5 and 3 times the runtime
of DBSCAN. This factor increases with increasing size of the database.
6. Conclusions and
future work
Clustering is a main method in many areas, including data mining
and knowledge discovery, statistics, and machine learning. This study presents
a new density-based clustering algorithm ST-DBSCAN which is constructed by
modifying DBSCAN algorithm. The first reason of this modification is to be able
to discover the clusters on spatial–temporal data. The second modification is
necessary to find noise objects when clusters of different densities exist. We
introduce a new concept: density
factor. We assign to each cluster a density
factor, which is the degree of the density of the cluster. The third
modification provides a comparison of the average value of a cluster with new
coming value. In order to demonstrate the applicability of our algorithm to
real world problems, we present an application using a spatial–temporal data
warehouse. Experimental results demonstrate that our modifications appear to be
very promising when spatial–temporal data is used to be clustered.Very large
databases need extreme computing power. In future studies, it is intended to
run the algorithm in parallel in order to improve the performance. In addition,
more useful heuristics may be found to determine the input parameters Eps and
MinPts.
References
[1] T. Abraham, J.F. RoddickSurvey
of spatio-temporal databases GeoInformatica,
Springer, 3 (1) (1999),
pp. 61-99
[2] M. Ankerst, M.M. Breunig, H.-P.
Kriegel, J. Sander, OPTICS: Ordering points to identify the clustering
structure, in: Proceedings of ACM SIGMOD International Conference on Management
of Data, Philadelphia, PA, 1999, pp. 49–60.
[3] Z. Aoying, Z. ShuigengApproaches
for scaling DBSCAN algorithm to large spatial database Journal of Computer Science and Technology, 15 (6) (2000), pp. 509-526
[4] C. Böhm, S. Berchtold, H.-P. Kriegel, U. MichelMultidimensional index
structures in relational databases Journal
of Intelligent Information Systems (JIIS), Springer, 15 (1) (2000), pp. 51-70
[5] M. Ester, H.-P. Kriegel, J. Sander, X.
Xu, A density-based algorithm for discovering clusters in large spatial
databases with noise, in: Proceedings of Second International Conference on
Knowledge Discovery and Data Mining, Portland, OR, 1996, pp. 226–231.
[6] M. Ester, H.-P. Kriegel, J. Sander, X. XuClustering for mining in large
spatial databases
KI-Journal (Artificial Intelligence), 12 (1) (1998), pp. 18-24Special Issue on Data Mining
[7] M. Ester, H.-P. Kriegel, J. Sander, M.
Wimmer, X. Xu, Incremental clustering for mining in a data warehousing
environment, in: Proceedings of International Conference on Very Large
Databases (VLDB’98), New York, USA, 1998, pp. 323–333.
[8] D. FisherKnowledge
acquisition via incremental conceptual clustering
Machine Learning, 2 (2) (1987),
pp. 139-172
by
L.Hemalatha
Asst.prof, Dept of Computer Science
No comments:
Post a Comment