Interpretable Machine Learning Framework for Early Depression Detection Using Socio-Demographic Features with Dual Feature Selection and SMOTE
Abstract
Depression is the most widespread psychological disorder globally, impacting individuals across all age groups; when left undiagnosed or untreated, it significantly elevates the risk of severe outcomes, including suicidality. This study explores the efficacy of eight machine learning (ML) classifiers utilizing socio-demographic and psychosocial data to discern signs of depression. A depression dataset available on GitHub was acquired, comprising 604 instances with 30 predictors and 1 target variable indicating depression status. Preprocessing included normalization, handling missing values, and encoding categorical variables. Two feature selection methodologies, Analysis of Variance (ANOVA) and Boruta were employed to extract pertinent features. ANOVA selected 19 features, while Boruta retained 13 for model training. To address class imbalance, the Synthetic Minority Oversampling Technique (SMOTE) was utilized to enhance prediction accuracy (ACC). Results demonstrate that Logistic Regression (LR), combined with ANOVA feature selection, exhibits superior performance, achieving an ACC of 92.56% and an AUC of 92.69%. With Boruta, LR achieved an ACC of 91.74% and an AUC of 91.65%. Without feature selection, LR yielded an ACC of 87.75%, a precision of 91.73%, and an AUC of 89.98%. SHapley Additive exPlanations (SHAP) analysis revealed that anxiety (ANXI) is the most influential predictor within the ML model designed for depression prediction. This study identifies the most effective model for predicting depression through evaluation metrics, while also addressing societal biases and supporting clinicians with interpretable insights for early intervention.DOI:
https://doi.org/10.31449/inf.v49i4.10245Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







