An ensemble of Clustering Algorithms Using Different Distance Metrics for Network Traffic Analysis of Companies
Abstract
The network traffic analysis problem in large companies utilizing Big Data is considered. An ensemble of clustering algorithms based on Bayesian probability updating employing various distance metrics and adaptive weighting is proposed. An exponential dependence was applied in the weight calculation to enhance differentiation between algorithms. This enhanced the method's sensitivity to the quality of individual models. The developed approach was tested on open datasets CIC-IDS2017, UNSW-NB15, and CTU-13. The results demonstrated a consistent improvement in clustering quality, with ARI and NMI values reaching 0.78 and 0.75, respectively. The result surpasses the performance of baseline methods (K-means, DBSCAN, classical ensembles). The proposed method demonstrated linear scalability and is applicable for analyzing high-volume corporate network traffic. The results obtained confirmed the practical value of integration into monitoring and anomaly detection systems.
Full Text:
PDFReferences
K. Takyi, A. Bagga, and P. Goopta, "Clustering techniques for traffic classification: A comprehensive review," in 2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), IEEE, Aug. 2018, pp. 224-230. doi: 10.1109/ICRITO.2018.8748772.
J. Li, H. Zhang, D. Tang, and C. Lin, "Traffic classification using cluster analysis," in 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI), IEEE, Sep. 2021, pp. 463-467. doi: 10.1109/CISAI54367.2021.00094.
J. E. Rodriguez Rodriguez, V. H. M. Garcia, and M. A. O. Usaquén, "Corporate networks traffic analysis for knowledge management based on random interactions clustering algorithm" in Knowledge Management in Organizations: 13th International Conference, KMO 2018, Žilina, Slovakia, Aug. 2018, pp. 523-536. https://doi.org/10.1007/978-3-319-95204-8_44
J. MacQueen, "Some methods for classification and analysis of multivariate observations" in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, no. 14, pp. 281-297, 1967.
S. Lloyd, "Least squares quantization in PCM" in IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129-137, March 1982, doi: 10.1109/TIT.1982.1056489.
P. C. Cheeseman and J. C. Stutz, "Bayesian classification (AutoClass): theory and results" Advances in Knowledge Discovery and Data Mining, vol. 180, pp. 153-180, 1996.
Zhang, T., Ramakrishnan, R. & Livny, M. BIRCH: A New Data Clustering Algorithm and Its Applications. Data Mining and Knowledge Discovery 1, 141–182 (1997).
https://doi.org/10.1023/A:1009783824328
Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. 1998. CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 27, 2 (June 1998), 73–84. https://doi.org/10.1145/276305.276312
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96). AAAI Press, 226–231.
Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. 1999. OPTICS: ordering points to identify the clustering structure. SIGMOD Rec. 28, 2 (June 1999), 49–60. https://doi.org/10.1145/304181.304187
K. Subramani, A. Velkov, I. Ntoutsi, P. Kroger and H. -P. Kriegel, "Density-based community detection in social networks," 2011 IEEE 5th International Conference on Internet Multimedia Systems Architecture and Application, Bangalore, India, 2011, pp. 1-8, doi: 10.1109/IMSAA.2011.6156357.
S. Zander, T. Nguyen and G. Armitage, "Automated traffic classification and application identification using machine learning," The IEEE Conference on Local Computer Networks 30th Anniversary (LCN'05)l, Sydney, NSW, Australia, 2005, pp. 250-257, doi: 10.1109/LCN.2005.35.
A. McGregor, M. Hall, P. Lorier, and J. Brunskill, "Flow clustering using machine learning techniques" in Passive and Active Network Measurement: 5th International Workshop, PAM 2004, Springer, Apr. 2004, pp. 205-214. doi: 10.1007/978-3-540-24668-8_21
Peter Cheeseman and John Stutz. 1996. Bayesian classification (AutoClass): theory and results. Advances in knowledge discovery and data mining. American Association for Artificial Intelligence, USA, 153–180.
Aouedi, O., Piamrat, K., Hamma, S. et al. Network traffic analysis using machine learning: an unsupervised approach to understand and slice your network. Ann. Telecommun. 77, 297–309 (2022). https://doi.org/10.1007/s12243-021-00889-1
Pedro Casas, Alessandro D'Alconzo, Tanja Zseby, and Marco Mellia. 2016. Big-DAMA: Big Data Analytics for Network Traffic Monitoring and Analysis. In Proceedings of the 2016 workshop on Fostering Latin-American Research in Data Communication Networks (LANCOMM '16). Association for Computing Machinery, New York, NY, USA, 1–3. https://doi.org/10.1145/2940116.2940117
Seydali, M., Khunjush, F. & Dogani, J. Streaming traffic classification: a hybrid deep learning and big data approach. Cluster Comput 27, 5165–5193 (2024). https://doi.org/10.1007/s10586-023-04234-0
M. Hirvonen and J. -P. Laulajainen, "Two-phased network traffic classification method for quality of service management," 2009 IEEE 13th International Symposium on Consumer Electronics, Kyoto, Japan, 2009, pp. 962-966, doi: 10.1109/ISCE.2009.5157009.
A. K. Jain, "Data clustering: 50 years beyond K-means" Pattern Recognition Letters, vol. 31, no. 8, pp. 651-666, 2010. https://doi.org/10.1016/j.patrec.2009.09.011
Ghosh, J. and Acharya, A. (2011), Cluster ensembles. WIREs Data Mining Knowl Discov, 1: 305-315. https://doi.org/10.1002/widm.32
Berikov, V. (2011). A Latent Variable Pairwise Classification Model of a Clustering Ensemble. In: Sansone, C., Kittler, J., Roli, F. (eds) Multiple Classifier Systems. MCS 2011. Lecture Notes in Computer Science, vol 6713. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21557-5_30.
K. Djouzi and K. Beghdad-Bey, "A Review of Clustering Algorithms for Big Data," 2019 International Conference on Networking and Advanced Systems (ICNAS), Annaba, Algeria, 2019, pp. 1-6, doi: 10.1109/ICNAS.2019.8807822.
Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., Herawan, T. (2014). Big Data Clustering: A Review. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2014. ICCSA 2014. Lecture Notes in Computer Science, vol 8583. Springer, Cham. https://doi.org/10.1007/978-3-319-09156-3_49.
Saeed MM, Al Aghbari Z, Alsharidah M. 2020. Big data clustering techniques based on Spark: a literature review. PeerJ Computer Science 6:e321 https://doi.org/10.7717/peerj-cs.321.
Gao, M.; Ma, L.; Liu, H.; Zhang, Z.; Ning, Z.; Xu, J. Malicious Network Traffic Detection Based on Deep Neural Networks and Association Analysis. Sensors 2020, 20, 1452. https://doi.org/10.3390/s20051452.
Berikov, V.B. Construction of an optimal collective decision in cluster analysis on the basis of an averaged co-association matrix and cluster validity indices. Pattern Recognit. Image Anal. 27, 153–165 (2017). https://doi.org/10.1134/S1054661816040040.
DOI: https://doi.org/10.31449/inf.v49i26.10652
This work is licensed under a Creative Commons Attribution 3.0 License.








