OMCOKE: A Machine Learning Outlier-based Overlapping Clustering Technique for Multi-Label Data Analysis

Said Baadel, Fadi Thabtah, Joan Lu, Saida Harguem

Abstract


Clustering is one of the challenging machine learning techniques due to its unsupervised learning nature. While many clustering algorithms constrain objects to single clusters, K-means overlapping partitioning clustering methods assign objects to multiple clusters by relaxing the constraints and allowing objects to belong to more than one cluster to better fit hidden structures in the data. However, when datasets contain outliers, they can significantly influence the mean distance of the data objects to their respective clusters, which is a drawback. Therefore, most researchers address this problem by simply removing the outliers. This can be problematic especially in applications such as fraud detection or cybersecurity attacks risk analysis. In this study, an alternative solution to this problem is proposed that captures outliers and stores them on-the-fly within a new cluster, instead of discarding. The new algorithm is named Outlier-based Multi-Cluster Overlapping K-Means Extension (OMCOKE). Empirical results on real-life multi-label datasets were derived to compare OMCOKE’s performance with other common overlapping clustering techniques. The results show that OMCOKE produced a better precision rate compared to the considered clustering algorithms. This method can benefit various stakeholders as these outliers could have real-life applications in cybersecurity, fraud detection, and the anti-phishing of websites.


Full Text:

PDF

References


Aggarwal, C., & Reddy, C. K. (2014). Data clustering: Algorithms and applications. CRC Press.

Arabie, L. J., Hubert, G., & DeSoete, P. (1999). Clustering and classification. World Scientific.

Baadel, S., Thabtah, F., & Lu, J. (2015). MCOKE: Multi-Cluster Overlapping K-Means Extension Algorithm. International Journal of Computer, Control, Quantum and Information Engineering 9(2). Pp. 374-377.

Baadel, S., Thabtah, F., & Lu, J. (2016). Overlapping clustering: A review. IEEE SAI Computing Conference, London, UK. Pp 233-237.

Baadel, S. (2021). Big Data Analytics: A Tutorial of Some Clustering Techniques. International Journal of Management and Data Analytics, 1(2). Pp 38-46.

Barai, A., & Dey, L. (2017). Outlier detection and removal algorithm in K-means and hierarchical clustering. World Journal of Computer Application and Technology, 5(2). 24-29.

Bay, S., & Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD.

Beltran, B., Vilarino, D., Martinez-Trinidad, J., Carrasco-Ochoa, J.A. (2020). K-means based method for overlapping document clustering. Journal of Intelligent and Fuzzy Systems, 39 (2). Pp. 2127-2135.

BenN’Cir, C., & Essoussi, N. (2012). Overlapping patterns recognition with linear and non-linear separations using positive definite kernels. International Journal of Computer Applications (IJCA), pp 1–8.

BenN’Cir, C., Cleuziou, G., & Essoussi, N. (2013). Identification of non-disjoint clusters with small and parameterizable overlaps. In IEEE International Conference on Computer Applications Technology (ICCAT), pages 1–6.

BenN’Cir, C., Essoussi, N., & Bertrand, P. (2010). Kernel overlapping k-means for clustering in feature space. In International Conference on Knowledge discovery and Information Retrieval (KDIR), pp 250–256.

Berkhin P. (2006) A survey of clustering data mining techniques. In: Kogan J., Nicholas C., Teboulle M. (eds) Grouping Multidimensional Data. Springer, Berlin, Heidelberg.

Boundaillier, E., & Hebrail, G. (1988). Interactive interpretation of hierarchical clustering. Intelligent Data Analysis.

Chagas, G. O., Lorena, A., Dos Santos, R. (2019). A hybrid Heuristic for the overlapping Clustering problem. Applied Soft Computing. 81(105482), 1-48.

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3), 1-72.

Celebi, M., Kingravi, H., & Vela, P. (2013). A comparative study of efficient initialization methods for the K-means clustering algorithm. Expert Systems with Applications. 40 (1). 200-210.

Cleuziou, G. (2008). An extended version of the k-means method for overlapping clustering. In International Conference on Pattern Recognition ICPR, pp 1–4.

Cleuziou, G. (2009). Two variants of the okm for overlapping clustering. Advances in Knowledge Discovery and Management. pp 149–166.

Danganan, A., Sison, A., Medina, R. (2019). OCA: Overlapping Clustering application unsupervised approach for data analysis. Indonesian Journal of Electrical Engineering and Computer Science, 14 (3) pp. 1473-1478.

Elisseeff, A., & Weston, J. (2001). A kernel method for multi-labelled classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (eds), Advances in Neural Information Processing Systems.

Gan, G., & Ng, M. K. (2017). K-means clustering with outlier removal. Pattern Recognition Letters, 90, 8-14.

Höppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). Fuzzy cluster analysis: Methods for classification, data analysis and image recognition. Wiley.

Hrushka, E. R., Campello, R., Freitas, A., & Carvalho, A. (2009). A survey of evolutionary algorithms for clustering. IEEE Transactions on Systems, Man, and cybernetics, Part C. (Applications and Reviews), 39(2), 133-155.

Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8) 651–666.

Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data, Prentice Hall.

Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys 31(3) 264–323.

Kadam, N. V., & Pund, M. A. (2013). Joint approach for outlier detection. International Journal of Computer Science Application, 6 (2), 445–448.

Lam, D., & Wunsch, D. (2014). Clustering. Academic Press Library in Signal Processing, Signal Processing Theory and Machine Learning, (1).

Liu, H., Li, J., Wu, Y., & Fu, Y. (2018). Clustering with outlier ermoval. Proceedings of ACM Sig on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, USA.

McQueen, J. B. (1967). Some methods of classification and analysis of multivariate observations, In: Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297.

Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In SIGMOD.

Saxena, A., Prasad, M., … Gupta, A. (2017). A review of clustering techniques and developments. International Journal of Neurocomputing. 267. Pp 664-681.

Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008). Multilabel classification of music into emotions. Proceeding of the 2008 International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA.

Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data, data mining and knowledge discovery handbook, O. Maimon, L. Rokach (Ed.), Springer, 2nd ed., 2010.

Yu, Q., Luo, Y., Chen, & C., Ding, X. (2016). Outlier-eliminated k-means clustering algorithm based on differential privacy preservation. Applied Intelligence, 45 (4). 1179–1191.

Zhang, J. S., & Leung, Y. (2003). Robust clustering by pruning outliers. IEEE Trans. on Systems, Man, and Cybernetics – Part B 33 (6) 983–999.




DOI: https://doi.org/10.31449/inf.v46i4.3476

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.