Static Malware Detection through Ensemble Feature Selection and Supervised Classification
Abstract
In a digital landscape where malicious software evolves faster than traditional defenses, intelligent andproactive detection has become essential. This study presents a machine learning framework for staticmalware detection based on the analysis of 138,047 Portable Executable samples, including both malwareand benign files. The dataset comprises 56 static structural features extracted without code execution.Four supervised classifiers—Backpropagation Neural Network, Decision Tree, Random Forest,and Support Vector Machine—were evaluated following the Knowledge Discovery in Databases process.Ensemble-based feature selection methods (Random Forest and Extra Trees) were applied to identify themost informative attributes, while random undersampling was used to mitigate class imbalance. Experimentalresults show that the Random Forest classifier achieved the best performance, reaching 99.45%accuracy and a 0.9909 F1-score on imbalanced data, and 99.32% accuracy on the balanced dataset. Thesefindings highlight the reliability and scalability of tree-based models for static malware detection. Overall,the proposed framework demonstrates that careful feature selection and balance adjustment can significantlyenhance the performance and interpretability of cybersecurity classification systems.References
National Institute of Standards and Technology,
“Malware definition.” https://csrc.nist.gov/
glossary/term/malware, 2023. Accessed: 2025-
-17.
Symantec, “Internet security threat report
” https://symantec-enterprise-blogs.
security.com, 2023. Accessed: 2025-05-17.
R. Sihwail, K. Omar, and K. A. Z. Ariffin, “A
survey on malware analysis techniques: Static,
dynamic, hybrid and memory analysis,” Inter-
national Journal on Advanced Science, Engineer-
ing and Information Technology, vol. 8, no. 4-2,
pp. 1662–1671, 2018.
U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth,
“The kdd process for extracting useful knowledge
from volumes of data,” Communications of the
ACM, vol. 39, no. 11, pp. 27–34, 1996.
M. Yeasin, “Active learning model used for an-
droid malware detection,” SSRN, 2023.
M. Shakib, “Android malware detection ap-
proach’s based on genetic ai, cnn, rnn, lstm, gru,
and active learning,” CNN, RNN, LSTM, GRU,
and Active Learning.
R. Castro, C. Schmitt, and G. Rodosek, “Armed:
How automatic malware modifications can evade
static detection?,” pp. 20–27, 03 2019.
H. C. Tanuwidjaja and K.-j. Kim, “Enhancing
malware detection by modified deep abstraction
and weighted feature selection,” in Proceedings of
the 2020 Symposium on Cryptography and Infor-
mation Security, Seoul, Republic of Korea, pp. 2–
, 2020.
D. Sharma, “Malware detection using machine
learning,” tech. rep., Department of Computer
Science Engineering and Information Technol-
ogy, Jaypee University of Information Technology,
A. Kamboj, P. Kumar, A. K. Bairwa, and
S. Joshi, “Detection of malware in downloaded
files using various machine learning models,”
Egyptian Informatics Journal, vol. 24, no. 1,
pp. 81–94, 2023.
K. Liu, S. Xu, G. Xu, M. Zhang, D. Sun, and
H. Liu, “A review of android malware detection
approaches based on machine learning,” IEEE ac-
cess, vol. 8, pp. 124579–124607, 2020.
R. K. Roy, “A few approaches in encrypted mal-
ware classifications,” North American Academic
Research, 2022.
R. Baker del Aguila, C. D. Contreras P´erez,
A. G. Silva-Trujillo, J. C. Cuevas-Tello, and
J. Nunez-Varela, “Static malware analysis using
low-parameter machine learning models,” Com-
puters, vol. 13, no. 3, p. 59, 2024.
A. M. Sharifnia, D. E. Kpormegbey, D. K.
Thapa, and M. Cleary, “A primer of data cleaning
in quantitative research: Handling missing val-
ues and outliers,” Journal of Advanced Nursing,
vol. 0, pp. 1–6.
M. Carvalho, A. J. Pinho, and S. Br´as, “Resam-
pling approaches to handle class imbalance: a re-
view from a data perspective,” Journal of Big
Data, vol. 12, no. 1, p. 71, 2025.
R. E. Bellman, Dynamic Programming. Prince-
ton, NJ: Princeton University Press, 1957.
D. Peng, Z. Gui, and H. Wu, “Interpreting
the curse of dimensionality from distance con-
centration and manifold effect,” arXiv preprint
arXiv:2401.00422, 2023.
F. Pedregosa, G. Varoquaux, A. Gramfort,
V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg,
J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay,
“Scikit-learn: Machine learning in Python.”
https://scikit-learn.org/stable/, 2011.
Accessed: 2025-05-20.
A. Souri and R. Hosseini, “A state-of-the-art sur-
vey of malware detection approaches using data
mining techniques,” Human-centric Computing
and Information Sciences, vol. 8, no. 1, p. 3, 2018.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep
learning,” nature, vol. 521, no. 7553, pp. 436–444,
A. Kinasih, A. Handayani, J. Ardiansah, and
N. Damanhuri, “Comparative analysis of decision
tree and random forest classifiers for structured
data classification in machine learning,” Science
in Information Technology Letters, vol. 5, pp. 13–
, 11 2024.
T. Admassu, A. Salau, G. Chhabra, K. Kaushik,
and S. Braide, “Evaluation of random forest and
support vector machine models in educational
data mining,” 06 2024.
L. Rokach and O. Maimon, Decision Trees,
pp. 165–192. Boston, MA: Springer US, 2005.
IBM, “What is random forest?.” https:
//www.ibm.com/think/topics/random-forest,
Dec. 2023. Accessed: 2025-05-24.
N. Cristianini and E. Ricci, “Support vector ma-
chines,” in Encyclopedia of algorithms, pp. 928–
, Springer, 2008.
A. Geron, Hands-On Machine Learning with
Scikit-Learn, Keras, and TensorFlow: Concepts,
Tools, and Techniques to Build Intelligent Sys-
tems. O’Reilly Media, Inc., 2nd ed., 2019.
DOI:
https://doi.org/10.31449/inf.v49i37.10728Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







