Enhancing Phishing Website Detection via Feature Selection in URL-Based Analysis

Marwa-Aqa Abd Al Hussein Qasim, Nahla A. Flayh


Detecting a phishing website accurately is crucial for ensuring the safety of online users, underscoring the importance of maintaining a secure digital environment. This research delves into the effectiveness of enhancing the detection of phishing websites through the application of a new dataset generation method. The method involves the transformation of a pure dataset obtained from Mendeley, by the utilization of regular expressions to extract the important features so that a detection process can be performed correctly with high performance. Based on the proposed features, we selected the best machine-learning algorithm.

We performed a rigorous evaluation using Three prominent machine learning algorithms: Decision Trees, Support Vector Machines (SVM), and Random Forests, achieving 0.96% for Decision Tree Accuracy, 0.97% for SVM Accuracy, and 0.98% for Random Forest Accuracy.

One of the critical contributions of this research is the deliberate selection of features. We have leveraged regular expressions to create a feature set that captures salient aspects of URLs and optimizes the algorithms' detection capabilities.

This research has examined how feature selection affects the performance of each algorithm, highlighting its strengths and uncovering its weaknesses.

Povzetek: glavni prispevek te raziskave je namerna izbira lastnosti. Izkoristili smo regularne izraze, da smo ustvarili nabor funkcij, ki zajame pomembne vidike URL-jev in optimizira zmožnosti zaznavanja algoritmov

Full Text:



K. Ahmed and S. Naaz, "Detection of phishing websites using machine learning approach," in Proceedings of International Conference on Sustainable Computing in Science, Technology and Management (SUSCOM), Amity University Rajasthan, Jaipur-India, 2019.

M. Ahsan, K. E. Nygard, R. Gomes, M. M. Chowdhury, N. Rifat, and J. F. Connolly, "Cybersecurity threats and their mitigation approaches using Machine Learning—A Review," Journal of Cybersecurity and Privacy, vol. 2, no. 3, pp. 527-555, 2022.

Y. Xu et al., "Artificial intelligence: A powerful paradigm for scientific research," The Innovation, vol. 2, no. 4, 2021.

N. Kareem, "Afaster Training Algorithm and Genetic Algorithm to Recognize Some of Arabic Phonemes."

A. S. Hashim, W. A. Awadh, and A. K. Hamoud, "Student performance prediction model based on supervised machine learning algorithms," in IOP Conference Series: Materials Science and Engineering, 2020, vol. 928, no. 3: IOP Publishing, p. 032019.

W. Chu, B. B. Zhu, F. Xue, X. Guan, and Z. Cai, "Protect sensitive sites from phishing attacks using features extractable from inaccessible phishing URLs," in 2013 IEEE international conference on communications (ICC), 2013: IEEE, pp. 1990-1994.

W. Fadheel, M. Abusharkh, and I. Abdel-Qader, "On Feature selection for the prediction of phishing websites," in 2017 IEEE 15th Intl Conf on Dependable, Autonomic and Secure Computing, 15th Intl Conf on Pervasive Intelligence and Computing, 3rd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), 2017: IEEE, pp. 871-876.

I. Tyagi, J. Shad, S. Sharma, S. Gaur, and G. Kaur, "A novel machine learning approach to detect phishing websites," in 2018 5th International conference on signal processing and integrated networks (SPIN), 2018: IEEE, pp. 425-430.

A. D. Kulkarni and L. L. Brown III, "Phishing websites detection using machine learning," 2019.

D. N. Kumar, N. S. R. Hemanth, S. Premnath, V. N. Kumar, and S. Uma, "Detection of phishing websites using an efficient machine learning framework," International Journal of Engineering Research and Technology, vol. 9, no. 5, 2020.

A. Lakshmanarao, P. S. P. Rao, and M. B. Krishna, "Phishing website detection using novel machine learning fusion approach," in 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), 2021: IEEE, pp. 1164-1169.

M. Abutaha, M. Ababneh, K. Mahmoud, and S. A.-H. Baddar, "URL phishing detection using machine learning techniques based on URLs lexical analysis," in 2021 12th International Conference on Information and Communication Systems (ICICS), 2021: IEEE, pp. 147-152.

S. Jain, "Phishing Websites Detection Using Machine Learning," Available at SSRN 4121102.


A. Prathap, M. L. Mounika, M. Reethika, N. Navya, and R. S. Sahithi, "PHISHING WEBSITE DETECTION USING MACHINE LEARNING MODELS," Machine learning, vol. 52, no. 4, 2023.

U. B. Penta, B. Panda, and S. S. Gantayat, "MACHINE LEARNING MODEL FOR IDENTIFYING PHISHING WEBSITES," Journal of Data Acquisition and Processing, vol. 38, no. 1, p. 2455, 2023.

O. K. Sahingoz, E. Buber, O. Demir, and B. Diri, "Machine learning based phishing detection from URLs," Expert Systems with Applications, vol. 117, pp. 345-357, 2019.

A. Aljofey et al., "An effective detection approach for phishing websites using URL and HTML features," Scientific Reports, vol. 12, no. 1, p. 8842, 2022.

E. M. Karabulut, S. A. Özel, and T. Ibrikci, "A comparative study on the effect of feature selection on classification accuracy," Procedia Technology, vol. 1, pp. 323-327, 2012.

S. F. Ariyadasa, Shantha; Fernando, Subha, "Phishing Websites Dataset," Mendeley Data, 2021, doi: http://doi.org/10.17632/n96ncsr5g4.1.

G. Stiglic, S. Kocbek, I. Pernek, and P. Kokol, "Comprehensive decision tree models in bioinformatics," PloS one, vol. 7, no. 3, p. e33812, 2012.

S. V. Razavi-Termeh, A. Sadeghi-Niaraki, and S.-M. Choi, "Spatial modeling of asthma-prone areas using remote sensing and ensemble machine learning algorithms," Remote Sensing, vol. 13, no. 16, p. 3222, 2021.

J. Cervantes, F. Garcia-Lamont, L. Rodríguez-Mazahua, and A. Lopez, "A comprehensive survey on support vector machine classification: Applications, challenges and trends," Neurocomputing, vol. 408, pp. 189-215, 2020.

A. Mammone, M. Turchi, and N. Cristianini, "Support vector machines," Wiley Interdisciplinary Reviews: Computational Statistics, vol. 1, no. 3, pp. 283-289, 2009.

L. Breiman, "Random Forests," Machine Learning, 45(1), 5-32. , 2021, doi: 10.1023/A:1010933404324.

S. Athey, J. Tibshirani, and S. Wager, "Generalized random forests," 2019.

Y. Liu, Y. Zhou, S. Wen, and C. Tang, "A Strategy on Selecting Performance Metrics for Classifier Evaluation," International Journal of Mobile Computing and Multimedia Communications, vol. 6, pp. 20-35, 10/01 2014, doi: 10.4018/IJMCMC.2014100102.

N. Japkowicz and M. Shah, Evaluating learning algorithms: a classification perspective. Cambridge University Press, 2011.

D. M. Powers, "Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation," arXiv preprint arXiv:2010.16061, 2020.

DOI: https://doi.org/10.31449/inf.v47i9.5177

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.