Stacking and Voting-Based Boosting Ensembles for Robust Malicious URL Classification
Abstract
The rising prevalence of malicious URLs poses serious risks to cybersecurity, enabling phishing, malware delivery, and data theft. Conventional blacklist and heuristic-based detection methods struggle to identify emerging and obfuscated attacks. To address this gap, we present an ensemble learning framework that integrates stacking and voting strategies with multiple boosting algorithms for reliable malicious URL classification. The system employs six advanced learners—XGBoost, AdaBoost, Gradient Boosting, Light-GBM, CatBoost, and LogitBoost—whose outputs are combined through majority voting and a two-layer stacking scheme, where logistic regression is used as the meta-learner. Evaluation was carried out ona Kaggle dataset containing 1,043,311 URLs (817,986 benign and 225,325 malicious), using a stratified 70:30 train/test split to preserve class balance. The proposed ensembles surpassed individual boosting models and conventional ensembles in accuracy, precision, recall, F1-score, and AUC. Stacking achieved93.44% across all metrics, while voting achieved 93.25%. In addition to strong predictive performance, the approach shows low prediction latency and effective handling of imbalanced data, making it practical for large-scale, near real-time deployment. This work demonstrates that combining stacking and voting ensembles offers a robust defense against evolving malicious URL threats.References
Anti-Phishing Working Group, ”APWG’s
Threat Report for Q4 2023,” Anti-Phishing
Working Group, 2023. [Online]. Avail-
able:
https://www.apwg.org/reports.
[Accessed: Dec. 2, 2024].
The SANS Institute, ”Cyber Threat Intel-
ligence (CTI) Survey 2024,” The SANS In-
stitute, 2024. [Online]. Available: https:
//www.sans.org. [Accessed: Dec. 2, 2024].
CrowdStrike, ”2024 Global Threat Re-
port,” CrowdStrike, 2024. [Online]. Avail-
able: https://www.crowdstrike.com. [Ac-
cessed: Dec. 2, 2024].
D. Sahoo, ”Malicious URL detection us-
ing machine learning: a survey,” ArXiv,
vol. 1701, 2019, version 3. [Online]. Avail-
able: https://arxiv.org/abs/1701. [Ac-
cessed: Dec. 2, 2024].
M. Aljabri, H. S. Altamimi, S. A. Albe-
lali, M. Al-Harbi, H. T. Alhuraib, N. K.
Alotaibi, A. A. Alahmadi, F. Alhaidari, R.
M. A. Mohammad, and K. Salah, ”Detect-
ing malicious URLs using machine learning
techniques: review and research directions,”
IEEE Access, vol. 10, pp. 121395–121417,
, doi: 10.1109/ACCESS.2022.3225741.
Ç. Catal, G. Giray, B. Tekinerdogan, S. Ku-
mar, and S. Shukla, ”Applications of deep
learning for phishing detection: a systematic
literature review,” Knowledge and Informa-
tion Systems, vol. 64, no. 6, pp. 1457–1500,
, doi: 10.1007/s10115-022-01693-3.
F. Carroll, J. A. Adejobi, and R. Montasari,
”How good are we at detecting a phishing
attack? Investigating the evolving phish-
ing attack email and why it continues to
successfully deceive society,” SN Computer
Science, vol. 3, no. 2, p. 170, 2022, doi:
1007/s42979-022-01003-0.
Q. Abu Al-Haija and M. Al-Fayoumi, ”An in-
telligent identification and classification sys-
tem for malicious uniform resource locators (URLs),” Neural Computing and Applica-
tions, vol. 35, no. 23, pp. 16995–17011, 2023.
N. Reyes-Dorta, P. Caballero-Gil, and C.
Rosa-Remedios, ”Detection of malicious
URLs using machine learning,” Wireless Net-
works, 2024, pp. 1–18.
Das Guptta, Sumitra, Khandaker Tayef
Shahriar, Hamed Alqahtani, Dheyaaldin Al-
salman, and Iqbal H. Sarker, ”Modeling hy-
brid feature-based phishing websites detec-
tion using machine learning techniques,” An-
nals of Data Science, vol. 11, no. 1, pp. 217–
, 2024.
Alsaedi, Mohammed, Fuad A. Ghaleb, Faisal
Saeed, Jawad Ahmad, and Mohammed
Alasli, ”Cyber threat intelligence-based ma-
licious URL detection model using ensemble
learning,” Sensors, vol. 22, no. 9, p. 3373,
Zuguo, Chen, Liu Yanglong, Chen
Chaoyang, Lu Ming, and Zhang Xuzhuo,
”Malicious URL Detection Based on Im-
proved Multilayer Recurrent Convolutional
Neural Network Model,” Security and
Communication Networks, 2021.
D. R. Patil and J. B. Patil, ”Feature-based
Malicious URL and Attack Type Detection
Using Multi-class Classification,” ISeCure,
vol. 10, no. 2, 2018.
Jiang, Jianguo, Jiuming Chen, Kim-Kwang
Raymond Choo, Chao Liu, Kunying Liu,
Min Yu, and Yongjian Wang, ”A deep
learning based online malicious URL and
DNS detection scheme,” in Security and Pri-
vacy in Communication Networks: 13th In-
ternational Conference, SecureComm 2017,
Niagara Falls, ON, Canada, pp. 438–448,
Springer, 2018.
W. Yang, W. Zuo, and B. Cui, ”Detecting
malicious URLs via a keyword-based con-
volutional gated-recurrent-unit neural net-
work,” IEEE Access, vol. 7, pp. 29891–29900,
Alshingiti, Zainab, Rabeah Alaqel, Jalal Al-
Muhtadi, Qazi Emad Ul Haq, Kashif Saleem,
and Muhammad Hamza Faheem, ”A deep
learning-based phishing detection system us-
ing CNN, LSTM, and LSTM-CNN,” Elec-
tronics, vol. 12, no. 1, p. 232, 2023.
Rafsanjani, Ahmad Sahban, Norshaliza Binti
Kamaruddin, Mehran Behjati, Saad Aslam,
Aaliya Sarfaraz, and Angela Amphawan,
”Enhancing Malicious URL Detection: A
Novel Framework Leveraging Priority Coeffi-
cient and Feature Evaluation,” IEEE Access,
D. R. Patil and J. B. Patil, ”Malicious URLs
detection using decision tree classifiers and
majority voting technique,” Cybernetics and
Information Technologies, vol. 18, no. 1, pp.
–29, 2018.
S. Kumi, C. Lim, and S. G. Lee, ”Malicious
URL detection based on associative classifi-
cation,” Entropy, vol. 23, no. 2, p. 182, 2021.
Peng, Yongfang, Shengwei Tian, Long
Yu, Yalong Lv, and Ruijin Wang, ”Mali-
cious URL recognition and detection using
attention-based CNN-LSTM,” KSII Trans-
actions on Internet and Information Systems
(TIIS), vol. 13, no. 11, pp. 5580–5593, 2019.
Yuan, Jianting, Guanxin Chen, Shengwei
Tian, and Xinjun Pei, ”Malicious URL detec-
tion based on a parallel neural joint model,”
IEEE Access, vol. 9, pp. 9464–9472, 2021.
Balogun, Abdullateef O., Kayode S. Ade-
wole, Muiz O. Raheem, Oluwatobi N.
Akande, Fatima E. Usman-Hamza, Modinat
A. Mabayoje, Abimbola G. Akintola, ”Im-
proving the phishing website detection using
empirical analysis of Function Tree and its
variants,” Heliyon, vol. 7, no. 7, 2021.
Rafsanjani, Ahmad Sahban, Norshaliza Binti
Kamaruddin, Hazlifah Mohd Rusli, and Mo-
hammad Dabbagh, ”Qsecr: Secure QR code
scanner according to a novel malicious URL
detection framework,” IEEE Access, 2023.
B. C. Ujah-Ogbuagu, O. N. Akande, and E.
Ogbuju, ”A hybrid deep learning technique
for spoofing website URL detection in real-
time applications,” Journal of Electrical Sys-
tems and Information Technology, vol. 11,
no. 1, p. 7, 2024.
Y. Freund and R. E. Schapire, “A decision-
theoretic generalization of on-line learning
and an application to boosting,” in Proceed-
ings of the Second European Conference on
Computational Learning Theory, pp. 23–37,
Springer, 1995.
T. Chen and C. Guestrin, “XGBoost: A scal-
able tree boosting system,” in Proceedings of
the 22nd ACM SIGKDD International Con-
ference on Knowledge Discovery and Data
Mining, pp. 785–794, ACM, 2016.
J. H. Friedman, “Greedy function approxi-
mation: A gradient boosting machine,” TheAnnals of Statistics, vol. 29, no. 5, pp. 1189–
, 2001.
Ke, G., Meng, Q., Finley, T., Wang, T., and
Yang, W. , “LightGBM: A highly efficient
gradient boosting decision tree,” in Proceed-ings of the 31st Conference on Neural Infor-
mation Processing Systems, pp. 3146–3154,
A. V. Dorogush, V. Ershov, and A. Gulin,
“CatBoost: A high-performance gradient
boosting library,” in Proceedings of the 2018Data Mining and Knowledge Discovery Con-
ference, pp. 1–10, 2018.
L. Prokhorenkova, G. Gusev, A. Vorobev,
A. V. Dorogush, and A. Gulin, “Cat-
Boost:
Unbiased boosting with cate-
gorical features,” in Advances in Neu-
ral
Information
Processing
Systems
(NIPS), vol. 31, 2018. [Online]. Avail-able:
https://proceedings.neurips.
cc/paper_files/paper/2018/file/
f5f8590cd58a54e94377e6ae2eded4d9-Paper.
pdf.
J. Friedman, T. Hastie, and R. Tibshirani,
“Additive logistic regression: A statistical
view of boosting,” The Annals of Statis-
tics, vol. 28, no. 2, pp. 337–407, 2000. DOI:
1214/aos/1016218223.
D. H. Wolpert, “Stacked generalization,”
Neural Networks, vol. 5, no. 2, pp. 241–259,
[Online]. Available: https://doi.
org/10.1016/S0893-6080(05)80023-1
A. K. Seewald, “How to make stacking bet-ter and faster while also taking care of an un-
known weakness,” in Proceedings of the 19th
International Conference on Machine Learn-
ing (ICML), 2002, pp. 554–561.
J. Sill, G. Takacs, L. Mackey, and D. Lin,
“Feature-weighted linear stacking,” in Ad-
vances in Neural Information Processing
Systems (NIPS), vol. 22, 2009.
E. Bauer and R. Kohavi, “An empirical com-
parison of voting classification algorithms:
Bagging, boosting, and variants,” Machine
Learning, vol. 36, no. 1, pp. 105–139, 1999.
[Online]. Available: https://doi.org/10.
/A:1007515423169
L. Breiman, “Bagging predictors,” Machine
Learning, vol. 24, no. 2, pp. 123–140, 1996.
L. I. Kuncheva, Combining Pattern Classi-
fiers: Methods and Algorithms. John Wiley
& Sons, 2004.
T. G. Dietterich, “Ensemble methods
in machine learning,” in International
Workshop on Multiple Classifier Systems
(MCS).
Springer, 2000, pp. 1–15.
[Online]. Available: https://doi.org/10.
/3-540-45014-9_1
Z.-H. Zhou, Ensemble Methods: Foundations
and Algorithms.
Chapman & Hall/CRC,
Tabular dataset ready for malicious URL
detection. [Online]. Available: https://
www.kaggle.com/datasets/pilarpieiro/
tabular-dataset-ready-for-malicious-url-detection
[Accessed: Dec. 2, 2024].
M. Sokolova and G. Lapalme, “A systematic
analysis of performance measures for clas-
sification tasks,” Information Processing &
Management, vol. 45, no. 4, pp. 427–437,
Jul. 2009. DOI: https://doi.org/10.1016/
j.ipm.2009.03.002.
S. Abad, H. Gholamy, and M. Aslani, “Clas-
sification of Malicious URLs Using Machine
Learning,” Sensors, vol. 23, no. 18, pp. 7760,
DOI: 10.3390/s23187760.
X. Do, C. Hoa Dinh Nguyen, and V. N.
Tisenko, “Malicious URL Detection Based
on Machine Learning,” International Journal
of Advanced Computer Science and Applica-
tions, vol. 11, no. 1, pp. 1–6, 2020.
T. Swetha, M. Seshaiah, K. L. Hemalatha,
S. V. N. Murthy, and M. Kumar, “Hybrid
Machine Learning Approach for Real-Time
Malicious URL Detection Using SOM-RMO
and RBFN with Tabu Search,” International
Journal of Advanced Computer Science and
Applications, vol. 15, no. 8, pp. 1–10, 2024.
DOI:
https://doi.org/10.31449/inf.v49i35.7762Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika







