SAGE: A Unified Evaluation Framework for Data Augmentation and Few-Shot Learning on Small and Imbalanced Tabular Datasets

Abstract

In critical domains such as healthcare and finance, structured data often suffers from sample scarcity and class imbalance, undermining the traditional machine learning assumption that training data adequately reflects the true distribution. To address this challenge, this study proposes SAGE (Small-sample Adaptive Generalization Evaluation), a unified framework for systematically comparing data-driven augmentation methods with model-driven few-shot learning (FSL) approaches. The framework integrates a standardized data conditioning pipeline, a comprehensive spectrum of 12 models (including 6 classical classifiers and 6 FSL architectures), and multi-dimensional evaluation metrics. Experimental validation was conducted on three diverse datasets: UCI Heart Disease (297 samples), Hepatitis (155 samples), and Glass Identification (214 samples), covering medical and forensic domains. Results demonstrate the complementary strengths of both paradigms. For data-driven methods, CatBoost augmented with Large Language Models (LLMs) achieved a Macro-F1 of 0.4219 on the heart disease dataset, significantly outperforming traditional oversampling methods like SMOTE (p<0.001). However, for extreme scarcity, model-driven approaches proved superior; Siamese Networks achieved the highest Macro-F1 of 0.5959 on heart disease and maintained robustness across datasets, specifically attaining an F1-score of 0.64 on the rarest class (Class 4). Furthermore, SHAP analysis confirmed that the best-performing models successfully captured clinically relevant features, such as albumin levels in hepatitis prediction. The SAGE framework thus provides empirical evidence to guide paradigm selection: FSL for extreme scarcity and LLM-based augmentation for enhancing ensemble classifiers.

Authors

  • Yuhao Yan
  • Linlu Chen
  • Houyan Zhang
  • Chong Chen
  • Leran Liang
  • Meng Yang

DOI:

https://doi.org/10.31449/inf.v50i13.12581

Downloads

Published

05/18/2026

How to Cite

Yan, Y., Chen, L., Zhang, H., Chen, C., Liang, L., & Yang, M. (2026). SAGE: A Unified Evaluation Framework for Data Augmentation and Few-Shot Learning on Small and Imbalanced Tabular Datasets. Informatica, 50(13). https://doi.org/10.31449/inf.v50i13.12581