Multi-Density Datasets Clustering Using K-Nearest Neighbors and Chebyshev’s Inequality

Received


Introduction
Clustering is a powerful machine learning tool for detecting structures in datasets.Detecting, analyzing, and describing natural clusters within a dataset, is of fundamental importance to a number of various fields [1][2][3][4].Such fields concern bioinformatics to identify similar genes, marketing to segment customers to establish market research, social sciences, psychology, biology, security, computer vision and image processing in various areas such as the medical or industrial fields, and so on.
As among the most popular density based clustering algorithm, DBSCAN [10] has several key advantages.First, it groups data into clusters of arbitrary forms.Second, it does not require that the number of clusters be given as input.The number of clusters is determined by the nature of the data and the values of Eps and minPts.Third, it is insensitive to the order of entry of points in the dataset.All these strengths are very important for any clustering algorithm.
However, DBSCAN requires the user to provide a global separation threshold Eps and the minimum number of points for any cluster.DPC (density-based clustering) also requires the user to enter the distance threshold dc.
For these two types of algorithms and their variants, the clustering results depend on these input parameters which are difficult to determine by the user, in the absence of precise rules for their calculation.Though they provided important advances in the clustering techniques and more and more widely used, they are still exhibiting limits to efficiently deal with complex datasets having clusters of varying shapes, sizes and densities [19][20][21].
In this article, we propose a new density-based clustering approach that overcomes these limitations.It is based on the main idea of characterizing the data points in the datasets with their local densities and using Chebyshev's inequality [22][23] to estimate local Eps for every cluster.In doing so, we not only automatically determine Eps, but also we efficiently handle clusters with various densities and shapes.
We define the concept of local neighbors of a given data point x, from which it determines the local threshold Eps for the current cluster.Indeed, for each data point x in the dataset, we calculate the local density so that it is the smallest in a dense region and the largest in a sparse region.Next, we sort the dataset on the density key, in descending order.
In doing so, the first data point with the highest density will initiate the construction of the first cluster by calculating its local separation threshold Eps using the Chebyshev inequality [22] and its distances, applied to the k-nearest neighbors.Following which, our algorithm extends the cluster initially built from these k-neighbors in the same way as DBSCAN.
When no new Eps-neighbor is found, the algorithm closes the current cluster and repeats the same procedure for unclassified data points.It stops when the dataset is empty and thus, all the clusters are automatically determined, without requiring any distance threshold nor number of clusters.In doing so, the local Eps can be adapted to a specific cluster density, which is an advantage of the proposed method compared to DBSCAN and its improvements.
The rest of the article is organized as follows.Section 2 presents the related work.Section 3 Presents DBSCAN and DPC algorithms.Section 4 describes our approach to cluster datasets with various densities, shapes and sizes.Section 5 presents the experimental evaluation.Section 6 presents a discussion.Finally, section 7 concludes the paper and outlines the future work.

Related work
The literature presents several data clustering techniques, such as hierarchical, model-based, and density-based algorithms.Partitioning-based clustering algorithms such as k-means and the likes [5] specify an initial number of clusters at the start and represent each cluster by a centroid, and objects close to the same centroid are considered to be similar.These algorithms determine all the clusters at the same time by aiming for a strong intragroup similarity.They require a number of user-defined clusters that are often unknown in practice.In addition, they build spherical clusters, which are not suitable for arbitrarily shaped clusters.
Hierarchical clustering algorithms [8] are bottom-up approaches such that each data point begins in a separate cluster, and the pairs of clusters at the bottom are merged as we move up the hierarchy.In hierarchical clustering algorithms, clustering process handles clusters of various shapes well but they are not suitable for clusters of varying density.
Model-based clustering methods [9] estimate a model for data by incorporating a measure of probability or uncertainty into cluster assignments.Model-based clustering attempts to address this concern and provide flexible allocation when observations are likely to belong to each cluster.In addition, model-based clustering offers the added benefit of automatically identifying the optimal number of clusters.
In density-based clustering methods, Rodriguez and Laio [16] proposed a new clustering algorithm by finding density peaks which are the potential centers of the cluster.In this algorithm, each data point can obtain its own local density and its distance to the nearest neighbor which has a higher density.Depending on the density and distance of the points, the result of the grouping can be obtained in a single step without further iteration.Since clustering methods based on the search for density peaks produce clusters of arbitrary shapes, the notion of "center" is somewhat misleading [17][18].These clustering algorithms based on density peaks have a deficiency in the allocation process, which is likely to trigger a domino effect.They require a userdefined parameter (cut-off distance) and visual judgment to estimate cluster centers based on the decision diagram.In addition, they cannot process certain non-spherical data sets such as Spiral.
DBSCAN [10] is probably the most popular densitybased clustering algorithm for building clusters of arbitrary shape.A cluster is created if there is a minPts of objects in a dense region with an Eps (distance threshold); minPts (minimum cluster size), which are user defined.DBSCAN starts with an arbitrary starting point.The Eps-neighborhood of this point is retrieved and a cluster is created, if it contains a sufficient number of points.Otherwise, the point is considered as noise.The selected point and its accessible points are considered as a cluster.The cluster will increase several times by adding other points which are in the distance Eps of the accessible points as new accessible points.The algorithm continues until all the points have either a cluster label or there are less minPts points around them and they are therefore considered as noises.Table 1 summarizes the comparison between ones of the most notable works on density-based clustering.

Method
Key features limitation reference DBSCAN Users need to define two parameters: What distance Eps to determine for each observation the Eps-neighborhood?What is the minimum number of neighbor's minPts necessary to consider that an observation is a core observation?
DBSCAN fails to discover clusters of varying densities and requires two user-defined input parameters Eps and minPts.
[10] DPC User needs to select cluster centers from a decision graph to identify clusters Sensitive to user input and results are dependent on the center selected by user [16] GMDBSCAN uses a space division technique and considers each grid as a separate part, then it estimates the independent minPts for each grid according to its density applies several DBSCANs on every grid and finally, it uses a distance-based method to improve the boundaries [11] DPCSA User needs to select cluster centers from a decision graph to identify clusters in two stages Sensitive to user input [27] Table DBSCAN and its aforementioned variants [9-10][12-15] have significant advantages: i) discovering clusters with various shapes, sizes and noises, ii) robust with outliers, iii) no need to specify the number of clusters in the data, iv) insensitive to the order of the points.However, these algorithms fail to discover clusters of varying densities and require two user-defined input parameters Eps and minPts.
The GMDBSCAN method [11] uses a space division technique and considers each grid as a separate part, then it estimates the independent minPts for each grid according to its density, after which it applies several DBSCANs on every grid and finally, it uses a distance-based method to improve the boundaries.
Hou et al. proposed a clustering algorithm without parameters based on the combination of DSets (dominant sets) and DBSCAN algorithms [10].In the AGED algorithm [15], the value of the Eps as defined in DBSCAN is determined according to local densities.Wang et al. proposed a modified DBSCAN algorithm to automatically determine the Eps values for different data distributions [13].
In [24], the author introduced a method which estimates the local density -for each point of the dataset -as the sum of the distances to the k-nearest neighbors.By arranging the data points in ascending order according to the local density, the proposed algorithm determines clusters of various densities.However, the density threshold associated with the cluster initiator sample lacks precise definition, which leads to clustering failures in certain situations.
In [27], the authors proposed the DPCSA algorithm as an improvement of DPC [16], however, users are still need to select cluster centers from a decision graph to identify clusters in two stages.

Density-based clustering
A clustering process partitions a dataset into groups of data points according to their proximity to each other.It takes as input a set of data points and a measure of similarity between them, and produces as output a set of partitions describing the general structure of the dataset.Density-based clustering techniques are one of the most popular unsupervised learning techniques used in machine learning algorithms.They proceed by detecting areas where data points are concentrated and where they are separated by areas that are sparse or empty.Points that do not belong to any cluster are labeled as noise.
Density-based clustering analysis is well known and widely appreciated for two desirable features.DBSCAN is a simple algorithm that defines clusters using local density estimation.It can be divided into 4 steps: Step 1: For each observation, we look at the number of points at most a distance Eps from it.This area is called the Eps -neighborhood of the observation.
Step 2: If an observation has at least a number of neighbors including itself, it is considered a core observation.A high-density observation was then detected.
Step 3: All observations in the vicinity of a core observation belong to the same cluster.There may be core observations close to each other.Consequently, step by step, we obtain a long sequence of core observations which constitutes a single cluster.
Step 4: Any observation that is not a core observation and that does not include a core observation in its vicinity is considered an anomaly.
Therefore, to apply DBSCAN, users need to define two parameters: What distance Eps to determine for each observation the Epsneighborhood?What is the minimum number of neighbor's minPts necessary to consider that an observation is a core observation?These two parameters are provided freely by the user.
Unlike the k-means algorithm or hierarchical ascending classification, there is no need to define the number of clusters, which makes the algorithm less rigid.Another advantage of DBSCAN is that it also allows to manage outliers or anomalies.
On the other hand, the concept of distance and choice of Eps and minPts are critical for the clustering quality.In this algorithm two points are fundamental: What is the metric used to evaluate the distance between an observation and its neighbors?What is the ideal Eps?
In the DBSCAN we generally use the Euclidean distance, let  = ( 1 , … ., ) and  = ( 1 , … ., ): At each observation, to count the number of neighbors at a distance Eps at most, we calculate the Euclidean distance between the neighbor and the observation and check if it is less than Eps.

Proposed method
In this section, we present our method to address the abovementioned problems in clustering datasets with various densities.The main idea is to use local density concept and the statistical Chebyshev's inequality [22][23] to automatically determine a separation threshold for every density class in a dataset, and we build every cluster with its associated distance threshold, leading to clusters with various densities and shapes.Our algorithm we call MDC (Multi-Density Clustering) uses the K-nearest-neighbors' information to define a local density of every data point as follows: Where   is the local density of a data point i, and   is the distance between data point i and a neighboring data point j in the neighborhood ().In doing so, the density value is reduced from the global density to the local density of the k neighbors closest to the data point, and the greater the distance between a data point and its neighborhood k, the smaller is the local density.This finding will better reflect the local information in the dataset.

Chebyshev's inequality
Chebyshev's inequality is adequate for completely arbitrary distributions [22].It constitutes an efficient estimation technique facilitating the calculation of the probability solely on the basis of the mathematical expectation and the variance, without requiring information on the distribution.Chebyshev's inequality is highly useful in that it can be applied to completely arbitrary distributions [2].
We can express a variant of Chebyshev's inequality as the following expression: (3) and we reformulate in the following inequality (Figure 1) : It means that "greater than (1 − 1  2 ) of population falls within  ( > 1) standard deviations (i.e.) from the population mean ".In other words, it states that for a distribution, the percentage of observations that lie within  standard deviations is at least: 1 − 1  2 (Figure 1).

The proposed algorithm
In the clustering step, the proposed clustering algorithm is used to extract spatial clusters from the dataset.Finally, the outputs are evaluated and visualized.Algorithm 1 takes as input a dataset and produces clusters with different shapes and various densities.
It takes the point of the highest density as a cluster initiator  that absorbs its  nearest neighbors {1, 2, … , } to build a new cluster.Thus the current initiated cluster is composed of data points  = {, 1, 2, … , }.At this step, we use Chebyshev's inequality to estimate the local Eps of the cluster.We calculate pairwise distances of C as distances between every two data points in C to have  = {1, 2, … , }.Eps is estimated as the bound of D. To estimate the maximum extending of the current cluster, we consider the collection of distances D and we apply the following Chebyshev's formula: Where mean is the mean of D and std is the standard deviation of D. Afterward, we remove the classified data points from the dataset.We initiate a new cluster by taking out a new initiator and proceed in the same manner, while removing the classified data points from the dataset, while using minPts to control the minimum density allowed within a cluster, and so on till obtaining an empty dataset.In doing so, an Eps value is estimated for every new cluster, leading to a set of clusters with various densities.
Algorithm 1: multi-density datasets clustering (MDC) Input: dataset X, k (for the k-nearest neighbors) and minPts (may be 4 or 5) Output: a set of multi-density clusters Step 1: Calculate the density of every data point in the dataset X using equation ( 1), store the result in the vector RO and create an empty list Clusters.
Step 2: Sort RO in decreasing order of densities.
Step 3: Take out the first data point p1 in RO as a cluster initiator.
Step 7: Using RO, expand C using the local Eps and k, while controlling the size of the cluster using minPts.
Step 8: Append C to Clusters, and remove from RO all the data points in C.
Step 9: If RO is not empty, repeat from step3 to step 9.
In step 1, for every data point in the dataset, it calculates its distances from the  neighboring data points, calculates the local density according to equation (1) for every data point, stores the result in RO and creates an empty list Clusters.In step 2, it sorts RO in decreasing order of their local densities.In step 3, it takes out the first data point p1 (it has the highest density) as a new cluster initiator.In step 4, algorithm 1 builds a new cluster C = {p1,v1,.., vk}, where v1,v2,…,vk are the k-neighbors of p1, and removes these points from RO.
To be able to search the remaining members of the cluster from RO, we need to estimate a local Eps for this cluster.To this end, at step 5, algorithm 1 constructs an initial set of pairwise distances, (between every two data points in the current cluster) and stores the result in .In step 6, it applies Chebyshev's inequality on D to determine the local  of the current cluster using equation (5).
In step 7, it expands the current cluster  by searching reachable data points using the calculated  in the same manner as in DBSCAN, while removing from the dataset all the data points that compose the current cluster.Moreover,  is utilized to control the minimum size of the cluster. value may be 4 or 5.
In step 8, it appends  to Clusters and removes all new classified data points from RO.In step 9, it checks the datasets, if it is empty it stops clustering and execute step 10 to output clusters, otherwise, it initiates a new cluster in the same manner and so on.

Experimental results
In order to validate the efficiency of the proposed MDC algorithm, we conducted dataset clustering experiments not only on synthetic datasets, but also on real data, with clusters of various densities, shapes and sizes [25][26], while comparing the results with those obtained by DBSCAN and DPC.The algorithms are implemented in the Integrated Development Environment of Python (Python IDE version 3.7.5).For DBSCAN, we used the scikit-learn library in Python.
In DPC the required dc is calculated using percentage value 2% as advised in [16].We looked for the optimal parameter value of 0.7 to 3.0 with a step of 0.3.In order to simplify the implementation of DPC, we select M points with the first values of  × , directly as cluster centers.The Eps value varies in the range of 0.5 to 3.0 with the step is 0.  2 shows the quality measures of the algorithms, when applied to the datasets in Figure 2. By using two cluster validity indexes, F-measure with  = 1 (F1) and adjusted rand index (ARI), we evaluate and discuss the performances of these three methods.F-measure is the weighted average (or harmonic mean) of Precision and Recall.Using F1 score as a metric, we are sure that if the F1 score is high, both precision and recall of the classifier indicate good results.ARI has the maximum value 1, and its expected value is 0 in the case of random clusters.A larger adjusted rand index means a higher agreement between two partitions.ARI is recommended for measuring agreement even when the partitions compared have different numbers of clusters.
Below we show through Figure 2, the results of clustering produced by different algorithms.Flame dataset has two components with similar densities but different shapes.And, they are very close to each other.MDC and DPC clustered it correctly, but DBSCAN identified seven outliers as noise (Figure 2).Jain is divided in two components with different densities.DBSCAN, DPC, and MDC correctly identified the two components.
It is difficult to correctly distinguish the components in Compound, composed of six components with varying densities and shapes, with complex spatial relationships.We see in the upper left corner, two contiguous components.A small disc-like component is surrounded by a ring-like component.They are located at the bottom left.On the right, an irregularly shaped component is surrounded by a set of scattered, spatially overlapping points.
As shown in Figure 2, MDC correctly identified the six clusters of the compound dataset.Unlike DBSCAN and DPC where the first detected only five clusters, and a number of points considered as noise, while the second divided the ring component into two parts and merged the scattered component with the dense one.In the aggregation dataset, there are seven components with various sizes and shapes.Except the two slightly connected component pairs, the other components are independent of each other.DBSCAN groups each of these two pairs in one cluster.MDC and DPC produced correct results.

Discussion
Density-based clustering proceeds by calculating the density of each sample point.DBSCAN and DPC algorithms [16] are ones of the most emerging density based clustering algorithms.However, these clustering algorithms have certain defects.
In DBSCAN, required Eps and minPts are difficult to be determined and only global values of such parameters are used, leading to inaccurate clustering result of multi-density datasets.
In DPC algorithm, subjective selection of cluster centers using a decision graph, and the dc parameter with only a unique value does not handle the multidensity data.Aiming at the problem of the selection of the parameter dc of the DPC algorithm, in [27] the authors proposed the use of the parameters optimized according to the local distance of data points.However, with multi-density data, this cannot obtain stable clustering effect.To tackle these limitations, our method addresses the abovementioned problems in clustering datasets with various densities.Thanks to the Chebyshev's inequality, our approach allows to automatically determine a separation threshold for every density class in a dataset, and we build every cluster with its associated distance threshold.

Conclusion and future work
This paper shows how to automatically cluster data into groups with multiple densities and shapes through determining multiple set of parameters Eps and minPts.To execute DBSCAN, users need to introduce Eps and minPts that are difficult to be determined.Moreover, DBSCAN simply uses the global Eps and minPts so that the clustering result of multi-density data is inaccurate.
To tackle such shortcomings, a new dataset clustering technique to efficiently handle clusters with various densities and shapes is presented in this paper.It introduced simple modification on DBSCAN.The main idea lies on the definition of local density of data points in the dataset and the use of Chebyshev's inequality to define a local threshold separation at the initiating step of each cluster, while using minPts to control the minimum density allowed within a cluster.The experiments conducted on synthetic and realworld datasets showed that the proposed algorithm provides important enhancements of clustering techniques on datasets with various density levels and shapes.As a future work, we plan to focus on applying MDC to solving practical problems in digital image processing such as segmentation and edge detection.
First, it does not require a predefined number of clusters as an input parameter, and is robust to noise.Second, it can discover clusters with arbitrary shapes, and not just ball-shaped clusters.Density based clustering algorithm has played a vital role in finding non linear shapes structure based on the density.Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is the most widely used density-based algorithm.It uses the concept of density reachability and density connectivity.

Figure 2 :
Figure 2: Clustering results of DBSCAN, DPC and MDC through a) flame, b) jain, c) compound, and d) aggregation

Table 2 :
Comparison using ARI and F-measure metrics Table