A Random Forest-Based Machine Learning Framework with PCA, SMOTE, and SHAP for Efficient and Interpretable Coronary Artery Disease Prediction
Abstract
Given that coronary artery disease (CAD) is a major global cause of morbidity and mortality, there is an urgent need for precise and scalable diagnostic tools. While conventional machine learning (ML) models such as XGBoost and Gradient Boosting have demonstrated good predictive performance, they suffer from limitations, including weak handling of class imbalance, redundant feature spaces, and lack of interpretability. This work proposes an optimized Random Forest-based framework for CAD prediction to address these gaps, integrating advanced feature engineering and optimization techniques. Specifically, dimensionality reduction is achieved using principal component analysis (PCA), class imbalance is handled through the Synthetic Minority Oversampling Technique (SMOTE), and hyperparameter optimization is performed via GridSearchCV, tuning parameters such as the number of estimators, maximum depth, and minimum samples split. Additionally, SHAP (Shapley Additive exPlanations) values enhance interpretability by illustrating the contribution of each feature to the model's predictions; for example, features such as chest pain type and cholesterol level are shown to influence CAD outcomes significantly. The proposed framework is evaluated on the UCI Heart Disease dataset comprising 303 samples. Experimental results demonstrate that the optimized Random Forest model achieves an accuracy of 95.0%, outperforming Gradient Boosting (93.08%) and XGBoost (92.4%) classifiers. This framework provides a clinically relevant, interpretable, and scalable solution for CAD prediction, bridging the gap between technical advancements and their practical deployment in healthcare environments.DOI:
https://doi.org/10.31449/inf.v49i22.7998Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika







