An ensemble of Clustering Algorithms Using Different Distance Metrics for Network Traffic Analysis of Companies

Bissarinov Baituma; Valeriy Lakhno; Valeriy Kozlovskyi; Denys Redko; Alona Desiatko; Valentyn Yaremych

doi:10.31449/inf.v49i26.10652

Abstract

The network traffic analysis problem in large companies utilizing Big Data is considered. An ensemble of clustering algorithms based on Bayesian probability updating employing various distance metrics and adaptive weighting is proposed. An exponential dependence was applied in the weight calculation to enhance differentiation between algorithms. This enhanced the method's sensitivity to the quality of individual models. The developed approach was tested on open datasets CIC-IDS2017, UNSW-NB15, and CTU-13. The results demonstrated a consistent improvement in clustering quality, with ARI and NMI values reaching 0.78 and 0.75, respectively. The result surpasses the performance of baseline methods (K-means, DBSCAN, classical ensembles). The proposed method demonstrated linear scalability and is applicable for analyzing high-volume corporate network traffic. The results obtained confirmed the practical value of integration into monitoring and anomaly detection systems.

Author Biographies

Bissarinov Baituma, Al-Farabi Kazakh National University

PHd of Department of Information systems
Valeriy Lakhno, National University of Life and Environmental Sciences of Ukraine

Professor at the Department of Computer Systems
Valeriy Kozlovskyi, National Aviation University

PhD at the Department of Information Protection System
Denys Redko, State University of Trade and Economics

Postgraduate Student of the Department of Software Engineering and Cybersecurity
Alona Desiatko, State University of Trade and Economics

Doctor of Philosophy in Computer ScienceHead of the Department of Software Engineering and Cybersecurity
Valentyn Yaremych, State University of Trade and Economics

Postgraduate student of the Department of Software Engineering and Cybersecurity,

References

K. Takyi, A. Bagga, and P. Goopta, "Clustering techniques for traffic classification: A comprehensive review," in 2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), IEEE, Aug. 2018, pp. 224-230. doi: 10.1109/ICRITO.2018.8748772.

J. Li, H. Zhang, D. Tang, and C. Lin, "Traffic classification using cluster analysis," in 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI), IEEE, Sep. 2021, pp. 463-467. doi: 10.1109/CISAI54367.2021.00094.

J. E. Rodriguez Rodriguez, V. H. M. Garcia, and M. A. O. Usaquén, "Corporate networks traffic analysis for knowledge management based on random interactions clustering algorithm" in Knowledge Management in Organizations: 13th International Conference, KMO 2018, Žilina, Slovakia, Aug. 2018, pp. 523-536. https://doi.org/10.1007/978-3-319-95204-8_44

J. MacQueen, "Some methods for classification and analysis of multivariate observations" in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281-297, 1967.

S. Lloyd, "Least squares quantization in PCM" in IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129-137, March 1982, doi: 10.1109/TIT.1982.1056489.

P. C. Cheeseman and J. C. Stutz, "Bayesian classification (AutoClass): theory and results" Advances in Knowledge Discovery and Data Mining, vol. 180, pp. 153-180, 1996.

Zhang, T., Ramakrishnan, R. & Livny, M. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery 1, 141–182 (1997).

https://doi.org/10.1023/A:1009783824328

Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 27, 2 (June 1998), 73–84. https://doi.org/10.1145/276305.276312

Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96). AAAI Press, 226–231.

Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: ordering points to identify the clustering structure. SIGMOD Rec. 28, 2 (June 1999), 49–60. https://doi.org/10.1145/304181.304187

K. Subramani, A. Velkov, I. Ntoutsi, P. Kroger and H. -P. Kriegel, "Density-based community detection in social networks," 2011 IEEE 5th International Conference on Internet Multimedia Systems Architecture and Application, Bangalore, India, 2011, pp. 1-8, doi: 10.1109/IMSAA.2011.6156357.

S. Zander, T. Nguyen and G. Armitage, "Automated traffic classification and application identification using machine learning," The IEEE Conference on Local Computer Networks 30th Anniversary (LCN'05)l, Sydney, NSW, Australia, 2005, pp. 250-257, doi: 10.1109/LCN.2005.35.

A. McGregor, M. Hall, P. Lorier, and J. Brunskill, "Flow clustering using machine learning techniques" in Passive and Active Network Measurement: 5th International Workshop, PAM 2004, Springer, Apr. 2004, pp. 205-214. doi: 10.1007/978-3-540-24668-8_21

Peter Cheeseman and John Stutz. 1996. Bayesian classification (AutoClass): theory and results. Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, USA, 153–180.

Aouedi, O., Piamrat, K., Hamma, S. et al. Network traffic analysis using machine learning: an unsupervised approach to understand and slice your network. Ann. Telecommun. 77, 297–309 (2022). https://doi.org/10.1007/s12243-021-00889-1

Pedro Casas, Alessandro D'Alconzo, Tanja Zseby, and Marco Mellia. 2016. Big-DAMA: Big Data Analytics for Network Traffic Monitoring and Analysis. In Proceedings of the 2016 workshop on Fostering Latin-American Research in Data Communication Networks (LANCOMM '16). Association for Computing Machinery, New York, NY, USA, 1–3. https://doi.org/10.1145/2940116.2940117

Seydali, M., Khunjush, F. & Dogani, J. Streaming traffic classification: a hybrid deep learning and big data approach. Cluster Comput 27, 5165–5193 (2024). https://doi.org/10.1007/s10586-023-04234-0

M. Hirvonen and J. -P. Laulajainen, "Two-phased network traffic classification method for quality of service management," 2009 IEEE 13th International Symposium on Consumer Electronics, Kyoto, Japan, 2009, pp. 962-966, doi: 10.1109/ISCE.2009.5157009.

A. K. Jain, "Data clustering: 50 years beyond K-means" Pattern Recognition Letters, vol. 31, no. 8, pp. 651-666, 2010. https://doi.org/10.1016/j.patrec.2009.09.011

Ghosh, J. and Acharya, A. (2011), Cluster ensembles. WIREs Data Mining Knowl Discov, 1: 305-315. https://doi.org/10.1002/widm.32

Berikov, V. (2011). A Latent Variable Pairwise Classification Model of a Clustering Ensemble. In: Sansone, C., Kittler, J., Roli, F. (eds) Multiple Classifier Systems. MCS 2011. Lecture Notes in Computer Science, vol 6713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21557-5_30.

K. Djouzi and K. Beghdad-Bey, "A Review of Clustering Algorithms for Big Data," 2019 International Conference on Networking and Advanced Systems (ICNAS), Annaba, Algeria, 2019, pp. 1-6, doi: 10.1109/ICNAS.2019.8807822.

Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T. (2014). Big Data Clustering: A Review. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2014. ICCSA 2014. Lecture Notes in Computer Science, vol 8583. Springer, Cham. https://doi.org/10.1007/978-3-319-09156-3_49.

Saeed MM, Al Aghbari Z, Alsharidah M. 2020. Big data clustering techniques based on Spark: a literature review. PeerJ Computer Science 6:e321 https://doi.org/10.7717/peerj-cs.321.

Gao, M.; Ma, L.; Liu, H.; Zhang, Z.; Ning, Z.; Xu, J. Malicious Network Traffic Detection Based on Deep Neural Networks and Association Analysis. Sensors 2020, 20, 1452. https://doi.org/10.3390/s20051452.

Berikov, V.B. Construction of an optimal collective decision in cluster analysis on the basis of an averaged co-association matrix and cluster validity indices. Pattern Recognit. Image Anal. 27, 153–165 (2017). https://doi.org/10.1134/S1054661816040040.

An ensemble of Clustering Algorithms Using Different Distance Metrics for Network Traffic Analysis of Companies

Abstract

Author Biographies

References

Authors

DOI:

Downloads

Published

Issue

Section

License

How to Cite

Developed By

Information