Performance Evaluation of the Filter, Wrapper, Mutual Information Theory, and Machine Learning Feature Selection Methods for XGBoost-Based Classification Tasks

Jasim Mohammed Dahr; Alaa Sahl Gaafar

doi:10.31449/inf.v49i24.8241

Performance Evaluation of the Filter, Wrapper, Mutual Information Theory, and Machine Learning Feature Selection Methods for XGBoost-Based Classification Tasks

Jasim Mohammed Dahr, Alaa Sahl Gaafar

Abstract

Feature selection is a model for mining datasets to obtain and choose sensible and meaningful parameters and values required for building high-performance classification or regression tasks. Even more worthy of note is the fact that relevance, interactions of features, and reduction of noise and redundancy through the use of associations with ground truth values. The concept of feature selection is most appreciated for large size and complex datasets in which a set of attributes and matching values as contributing significantly to the determination of decisions made by machines or human agents. This paper compares the performances of machine learning algorithms, wrapper, filter and mutual information methods for features selection in data. The Diabetes dataset acquired from the Pima Indians Diabetes Database hosted by the National Institute of Diabetes and Digestive and Kidney Diseases was adopted for the validation. The outcomes revealed that, the XGBoost model classification’s accuracy of 75.76%, precision of 64.63%, and F1-score of 65.43% were best due to others. Also, the mutual information theory or embedded technique offers the best recall score of 71.25% trailed by the filter technique. The mutual information provided the least false-positive of 23 followed by the filter technique at 27. The filter technique outcomes with two-tailed significance test score of p(0.059)<0.05), which are statistically significant at confidence value of 95%. Also, the filter feature selection technique further reduces the dimensionality, redundancy in the variables, and maintain the data variance. Moreso, the overfitting of model is minimized, but raising degree of freedom of the base model during classification tasks

Full Text:

PDF

References

D. A. Elmanakhly, M. M. Saleh, and E. A. Rashed, “An improved equilibrium optimizer algorithm for features selection: methods and analysis,” IEEE Access, vol. 9, pp. 120309–120327, 2021, doi: 10.1109/access.2021.3108097.

J. Zacharias, M. von Zahn, J. Chen, and O. Hinz, “Designing a feature selection method based on explainable artificial intelligence,” Electronic Markets, vol. 32, no. 4, pp. 2159–2184, 2022, doi: 10.1007/s12525-022-00608-1.

N. Jameel and H. S Abdullah, “Intelligent feature selection methods: A survey,” Engineering and Technology Journal, vol. 39, no. 1B, pp. 175–183, 2021, doi: 10.30684/etj.v39i1b.1623.

Y.-E. Kim, Y.-S. Kim, and H. Kim, “Effective feature selection methods to detect IoT DDoS attack in 5G core network,” Sensors, vol. 22, no. 10, p. 3819, 2022, doi: 10.3390/s22103819.

A. K. Hamoud et al., “A comparative study of supervised/unsupervised machine learning algorithms with feature selection approaches to predict student performance,” International Journal of Data Mining, Modelling and Management, vol. 15, no. 4, pp. 393–409, 2023, doi: 10.1504/ijdmmm.2023.134590.

M. Rostami, K. Berahmand, E. Nasiri, and S. Forouzandeh, “Review of swarm intelligence-based feature selection methods,” Eng Appl Artif Intell, vol. 100, p. 104210, 2021, doi: 10.1016/j.engappai.2021.104210.

Y. Chen et al., “Comparison of feature selection methods for mapping soil organic matter in subtropical restored forests,” Ecol Indic, vol. 135, p. 108545, 2022, doi: 10.1016/j.ecolind.2022.108545.

F. Saberi-Movahed et al., “Decoding clinical biomarker space of COVID-19: Exploring matrix factorization-based feature selection methods,” Comput Biol Med, vol. 146, p. 105426, 2022, doi: 10.1016/j.compbiomed.2022.105426.

Y.-N. Sun, W. Qin, J.-H. Hu, H.-W. Xu, and P. Z. H. Sun, “A Causal Model-Inspired Automatic Feature-Selection Method for Developing Data-Driven Soft Sensors in Complex Industrial Processes,” Engineering, vol. 22, no. 3, pp. 82–93, 2023, doi: 10.1016/j.eng.2022.06.019.

Y. Zhang, H. Zhang, and B. Zhang, “An effective ensemble automatic feature selection method for network intrusion detection,” Information, vol. 13, no. 7, p. 314, 2022, doi: 10.3390/info13070314.

H. Almazrua and H. Alshamlan, “A comprehensive survey of recent hybrid feature selection methods in cancer microarray gene expression data,” IEEE Access, vol. 10, pp. 71427–71449, 2022, doi: 10.1109/access.2022.3185226.

O. Elsherbiny, Y. Fan, L. Zhou, and Z. Qiu, “Fusion of feature selection methods and regression algorithms for predicting the canopy water content of rice based on hyperspectral data,” Agriculture, vol. 11, no. 1, p. 51, 2021, doi: 10.3390/agriculture11010051.

H. N. Alshaer, M. A. Otair, L. Abualigah, M. Alshinwan, and A. M. Khasawneh, “Feature selection method using improved CHI Square on Arabic text classifiers: analysis and application,” Multimed Tools Appl, vol. 80, pp. 10373–10390, 2021, doi: 10.1007/s11042-020-10074-6.

N. Pudjihartono, T. Fadason, A. W. Kempa-Liehr, and J. M. O’Sullivan, “A review of feature selection methods for machine learning-based disease risk prediction,” Frontiers in Bioinformatics, vol. 2, p. 927312, 2022, doi: 10.3389/fbinf.2022.927312.

F. Bagherzadeh, M.-J. Mehrani, M. Basirifard, and J. Roostaei, “Comparative study on total nitrogen prediction in wastewater treatment plant and effect of various feature selection methods on machine learning algorithms performance,” Journal of Water Process Engineering, vol. 41, p. 102033, 2021, doi: 10.1016/j.jwpe.2021.102033.

S. Jain and A. Saha, “Rank-based univariate feature selection methods on machine learning classifiers for code smell detection,” Evol Intell, vol. 15, no. 1, pp. 609–638, 2022, doi: 10.1007/s12065-020-00536-z.

J. Too and S. Mirjalili, “General learning equilibrium optimizer: a new feature selection method for biological data classification,” Applied Artificial Intelligence, vol. 35, no. 3, pp. 247–263, 2021, doi: 10.1080/08839514.2020.1861407.

G. Ge and J. Zhang, “Feature selection methods and predictive models in CT lung cancer radiomics,” J Appl Clin Med Phys, vol. 24, no. 1, p. e13869, 2023, doi: 10.1002/acm2.13869.

A. Al Marouf, M. M. Mottalib, R. Alhajj, J. Rokne, and O. Jafarullah, “An efficient approach to predict eye diseases from symptoms using machine learning and ranker-based feature selection methods,” Bioengineering, vol. 10, no. 1, p. 25, 2022, doi: 10.3390/bioengineering10010025.

F. Kamalov, H. Sulieman, S. Moussa, J. A. Reyes, and M. Safaraliev, “Nested ensemble selection: An effective hybrid feature selection method,” Heliyon, vol. 9, no. 9, 2023, doi: 10.1016/j.heliyon.2023.e19686.

Y. Zhang, J. Liu, W. Li, and S. Liang, “A proposed ensemble feature selection method for estimating forest aboveground biomass from multiple satellite data,” Remote Sens (Basel), vol. 15, no. 4, p. 1096, 2023, doi: 10.3390/rs15041096.

H. Mamdouh Farghaly and T. Abd El-Hafeez, “A high-quality feature selection method based on frequent and correlated items for text classification,” Soft comput, vol. 27, no. 16, pp. 11259–11274, 2023, doi: 10.1007/s00500-023-08587-x.

DOI: https://doi.org/10.31449/inf.v49i24.8241

This work is licensed under a Creative Commons Attribution 3.0 License.

Username
Password
Remember me