Static Malware Detection through Ensemble Feature Selection and Supervised Classification

Isai Moreno-Lara; Alejandra Silva-Trujillo; Juan C. Cuevas-Tello; Jose Nunez-Varela

doi:10.31449/inf.v49i37.10728

Abstract

In a digital landscape where malicious software evolves faster than traditional defenses, intelligent andproactive detection has become essential. This study presents a machine learning framework for staticmalware detection based on the analysis of 138,047 Portable Executable samples, including both malwareand benign files. The dataset comprises 56 static structural features extracted without code execution.Four supervised classifiers—Backpropagation Neural Network, Decision Tree, Random Forest,and Support Vector Machine—were evaluated following the Knowledge Discovery in Databases process.Ensemble-based feature selection methods (Random Forest and Extra Trees) were applied to identify themost informative attributes, while random undersampling was used to mitigate class imbalance. Experimentalresults show that the Random Forest classifier achieved the best performance, reaching 99.45%accuracy and a 0.9909 F1-score on imbalanced data, and 99.32% accuracy on the balanced dataset. Thesefindings highlight the reliability and scalability of tree-based models for static malware detection. Overall,the proposed framework demonstrates that careful feature selection and balance adjustment can significantlyenhance the performance and interpretability of cybersecurity classification systems.

References

National Institute of Standards and Technology,

“Malware definition.” https://csrc.nist.gov/

glossary/term/malware, 2023. Accessed: 2025-

-17.

Symantec, “Internet security threat report

” https://symantec-enterprise-blogs.

security.com, 2023. Accessed: 2025-05-17.

R. Sihwail, K. Omar, and K. A. Z. Ariffin, “A

survey on malware analysis techniques: Static,

dynamic, hybrid and memory analysis,” Inter-

national Journal on Advanced Science, Engineer-

ing and Information Technology, vol. 8, no. 4-2,

pp. 1662–1671, 2018.

U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth,

“The kdd process for extracting useful knowledge

from volumes of data,” Communications of the

ACM, vol. 39, no. 11, pp. 27–34, 1996.

M. Yeasin, “Active learning model used for an-

droid malware detection,” SSRN, 2023.

M. Shakib, “Android malware detection ap-

proach’s based on genetic ai, cnn, rnn, lstm, gru,

and active learning,” CNN, RNN, LSTM, GRU,

and Active Learning.

R. Castro, C. Schmitt, and G. Rodosek, “Armed:

How automatic malware modifications can evade

static detection?,” pp. 20–27, 03 2019.

H. C. Tanuwidjaja and K.-j. Kim, “Enhancing

malware detection by modified deep abstraction

and weighted feature selection,” in Proceedings of

the 2020 Symposium on Cryptography and Infor-

mation Security, Seoul, Republic of Korea, pp. 2–

, 2020.

D. Sharma, “Malware detection using machine

learning,” tech. rep., Department of Computer

Science Engineering and Information Technol-

ogy, Jaypee University of Information Technology,

A. Kamboj, P. Kumar, A. K. Bairwa, and

S. Joshi, “Detection of malware in downloaded

files using various machine learning models,”

Egyptian Informatics Journal, vol. 24, no. 1,

pp. 81–94, 2023.

K. Liu, S. Xu, G. Xu, M. Zhang, D. Sun, and

H. Liu, “A review of android malware detection

approaches based on machine learning,” IEEE ac-

cess, vol. 8, pp. 124579–124607, 2020.

R. K. Roy, “A few approaches in encrypted mal-

ware classifications,” North American Academic

Research, 2022.

R. Baker del Aguila, C. D. Contreras P´erez,

A. G. Silva-Trujillo, J. C. Cuevas-Tello, and

J. Nunez-Varela, “Static malware analysis using

low-parameter machine learning models,” Com-

puters, vol. 13, no. 3, p. 59, 2024.

A. M. Sharifnia, D. E. Kpormegbey, D. K.

Thapa, and M. Cleary, “A primer of data cleaning

in quantitative research: Handling missing val-

ues and outliers,” Journal of Advanced Nursing,

vol. 0, pp. 1–6.

M. Carvalho, A. J. Pinho, and S. Br´as, “Resam-

pling approaches to handle class imbalance: a re-

view from a data perspective,” Journal of Big

Data, vol. 12, no. 1, p. 71, 2025.

R. E. Bellman, Dynamic Programming. Prince-

ton, NJ: Princeton University Press, 1957.

D. Peng, Z. Gui, and H. Wu, “Interpreting

the curse of dimensionality from distance con-

centration and manifold effect,” arXiv preprint

arXiv:2401.00422, 2023.

F. Pedregosa, G. Varoquaux, A. Gramfort,

V. Michel, B. Thirion, O. Grisel, M. Blon-

del, P. Prettenhofer, R. Weiss, V. Dubourg,

J. Vanderplas, A. Passos, D. Cournapeau,

M. Brucher, M. Perrot, and E. Duchesnay,

“Scikit-learn: Machine learning in Python.”

https://scikit-learn.org/stable/, 2011.

Accessed: 2025-05-20.

A. Souri and R. Hosseini, “A state-of-the-art sur-

vey of malware detection approaches using data

mining techniques,” Human-centric Computing

and Information Sciences, vol. 8, no. 1, p. 3, 2018.

Y. LeCun, Y. Bengio, and G. Hinton, “Deep

learning,” nature, vol. 521, no. 7553, pp. 436–444,

A. Kinasih, A. Handayani, J. Ardiansah, and

N. Damanhuri, “Comparative analysis of decision

tree and random forest classifiers for structured

data classification in machine learning,” Science

in Information Technology Letters, vol. 5, pp. 13–

, 11 2024.

T. Admassu, A. Salau, G. Chhabra, K. Kaushik,

and S. Braide, “Evaluation of random forest and

support vector machine models in educational

data mining,” 06 2024.

L. Rokach and O. Maimon, Decision Trees,

pp. 165–192. Boston, MA: Springer US, 2005.

IBM, “What is random forest?.” https:

//www.ibm.com/think/topics/random-forest,

Dec. 2023. Accessed: 2025-05-24.

N. Cristianini and E. Ricci, “Support vector ma-

chines,” in Encyclopedia of algorithms, pp. 928–

, Springer, 2008.

A. Geron, Hands-On Machine Learning with

Scikit-Learn, Keras, and TensorFlow: Concepts,

Tools, and Techniques to Build Intelligent Sys-

tems. O’Reilly Media, Inc., 2nd ed., 2019.

Static Malware Detection through Ensemble Feature Selection and Supervised Classification

Abstract

References

Authors

DOI:

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Information