Enhanced Leukemia Subtype Classification Using SMOTE and Hybrid Feature Selection in Microarray Data

Abstract

Blood cancer is a rising issue in the past decade, and early detection is a must for early intervention. Traditional techniques for diagnosing blood cancer include high expense, long processes, and medical professionals and a variety of tests. Hence, an effective prediction model with high accuracy is a must. This study presents a robust leukemia multiclass classification framework leveraging advanced ML (machine learning) techniques. Addressing key challenges such as class imbalance, high-dimensional gene expression data, and feature selection. This study presents an integrated approach for data balancing by combining the Synthetic Minority Oversampling Technique (SMOTE) with nonlinear interpolation. A hybrid feature selection model utilizing Principal Component Analysis (PCA) on Linear Discriminant Analysis (LDA) is implemented to enhance classification performance. Experimental results indicate that using SMOTE with PCA+LDA on Random Forest classifiers outperforms traditional methods, achieving 98% accuracy in leukemia multiclass classification.

Author Biography

Chaitra P C, Computer Science and Engineering, Dayananda Sagar Academy of Technology and Management, Visvesvaraya Technological University, Belagavi, 590018

Computer Science and Engineering

References

Sathishkumar K, Chaturvedi M, Das P, Stephen S, Mathur P. Cancer incidence estimates for 2022 & projection for 2025: Result from National Cancer Registry Programme, India. Indian J Med Res. 2022 Oct-Nov;156(4&5):598-607. doi: 10.4103/ijmr.ijmr_1821_22. PMID: 36510887; PMCID: PMC10231735.

Mabrouk, Besma & Jazzar, Nesrine & Sallemi, Lamia & Hamida, Ahmed. (2024). A Comparative Study of PCA and LDA for Dimensionality Reduction in a 4-Way Classification Framework. 10.21203/rs.3.rs-4020987/v1.

Mallick, P.K., Mohapatra, S.K., Chae, GS. et al. Convergent learning–based model for leukemia classification from gene expression. Pers Ubiquit Comput 27, 1103–1110 (2023). https://doi.org/10.1007/s00779-020-01467-3

Chaitra, P.C, and R. Saravana Kumar. "A review of multi-class classification algorithms." Int. J. Pure Appl. Math 118, no. 14 (2018): 17-26.

Tabassum, N.; Kamal, M.A.S.; Akhand, M.A.H.; Yamada, K. Cancer Classification from Gene Expression Using Ensemble Learning with an Influential Feature Selection Technique. BioMedInformatics 2024, 4,1275-1288. https://doi.org/10.3390/biomedinformatics4020070

Ravindran, U., Gunavathi, C. Deep learning assisted cancer disease prediction from gene expression data using WT-GAN. BMC Med Inform Decis Mak 24, 311 (2024). https://doi.org/10.1186/s12911-024-02712-y

Castillo D, Galvez JM, Herrera LJ, Rojas F, Valenzuela O, Caba O, Prados J, Rojas I. Leukemia multiclass assessment and classification from Microarray and RNA-seq technologies integration at gene expression level. PLoS One. 2019 Feb 12;14(2):e0212127. doi: 10.1371/journal.pone.0212127. PMID: 30753220; PMCID: PMC6372182.

Alabdulqader, E.A., Alarfaj, A.A., Umer, M. et al. Improving prediction of blood cancer using leukemia microarray gene data and Chi2 features with weighted convolutional neural network. Sci Rep 14, 15625 (2024). https://doi.org/10.1038/s41598-024-65315-7

Al-Azani S, Alkhnbashi OS, Ramadan E, Alfarraj M. Gene Expression-Based Cancer Classification for Handling the Class Imbalance Problem and Curse of Dimensionality. Int J Mol Sci. 2024 Feb 9;25(4):2102. doi: 10.3390/ijms25042102. PMID: 38396779; PMCID: PMC10889442.

Abdul Karim, Azhari Azhari, Mobeen Shahroz, Samir Brahim Belhaouri, Khabib Mustofa, LDSVM: Leukemia Cancer Classification Using Machine Learning, Computers, Materials and Continua, Volume 71, Issue 2, 2021, Pages 3887-3903, ISSN 1546-2218, https://doi.org/10.32604/cmc.2022.021218.

Rupapara, V., Rustam, F., Aljedaani, W. et al. Blood cancer prediction using leukemia microarray gene data and hybrid logistic vector trees model. Sci Rep 12, 1000 (2022). https://doi.org/10.1038/s41598-022-04835-6

Elreedy, D., Atiya, A.F. & Kamalov, F. A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Mach Learn 113, 4903–4923 (2024). https://doi.org/10.1007/s10994-022-06296-4

F. Song, D. Mei and H. Li, "Feature Selection Based on Linear Discriminant Analysis," 2010 International Conference on Intelligent System Design and Engineering Application, Changsha, China, 2010, pp. 746-749, doi: 10.1109/ISDEA.2010.311.

Md Faisal Kabir, Tianjie Chen, Simone A. Ludwig, A performance analysis of dimensionality reduction algorithms in machine learning models for cancer prediction, Healthcare Analytics, Volume 3, 2023, 100125, ISSN 2772-4425, https://doi.org/10.1016/j.health.2022.100125.

Xi Chen, Hemant Ishwaran, Random forests for genomic data analysis, Genomics,Volume 99, Issue 6, 2012, Pages 323-329, ISSN 0888-7543, https://doi.org/10.1016/j.ygeno.2012.04.003.

Zeller, M.A., Arendsee, Z.W., Smith, G.J.D., and Anderson, T.K. (2023). classLog: Logistic regression for the classification of genetic sequences. Frontiers in Virology, doi: https://doi.org/10.3389/fviro.2023.1215012

Huang S, Cai N, Pacheco PP, Narrandes S, Wang Y, Xu W. Applications of Support Vector Machine (SVM) Learning in Cancer Genomics. Cancer Genomics Proteomics. 2018 Jan-Feb;15(1):41-51. doi: 10.21873/cgp.20063. PMID: 29275361; PMCID: PMC5822181.

Ahmed, Md & Shahjaman, Md & Rana, Md. Masud & Mollah, Md Nurul. (2017). Robustification of Naïve Bayes Classifier and Its Application for Microarray Gene Expression Data Analysis. BioMed Research International. 2017. 1-17. 10.1155/2017/3020627.

P.C Chaitra; Saravanakumar, R. Preprocessing and Classification Algorithms on Micro Array Data for Cancer Classification. Grenze International Journal of Engineering & Technology (GIJET), 2022, Vol 8, Issue 1, p808, ISSN 2395-5287

https://www.who.int/news-room/fact-sheets/detail/cancer

https://seer.cancer.gov/statfacts/html/leuks.html

https://www.ncbi.nlm.nih.gov/

Authors

  • Chaitra P C Computer Science and Engineering, Dayananda Sagar Academy of Technology and Management, Visvesvaraya Technological University, Belagavi, 590018
  • R Saravana Kumar Dayananda Sagar Academy of Technology and Management, Visvesvaraya Technological University, Belagavi, 590018, India

DOI:

https://doi.org/10.31449/inf.v50i6.11956

Downloads

Published

02/21/2026

How to Cite

P C, C., & Kumar, R. S. (2026). Enhanced Leukemia Subtype Classification Using SMOTE and Hybrid Feature Selection in Microarray Data. Informatica, 50(6). https://doi.org/10.31449/inf.v50i6.11956