Supervised Audio-Based Classification Framework for Fluency and Pronunciation Evaluation in Non-Native English Speech

Beifeng Wu, Yongjin Hu

Abstract


This paper presents a supervised machine learning framework for automated assessment of second-language (L2) English fluency and pronunciation, using audio recordings from Latin-American English learners. The framework extracts key acoustic features, including Mel-Frequency Cepstral Coefficients (MFCCs), Zero Crossing Rate (ZCR), root mean square energy, and spectral features, from segmented audio samples. Multiple classification models — such as Support Vector Machines (SVM), Random Forests (RF), k-Nearest Neighbors (kNN), and Convolutional Neural Networks (CNN), were trained to classify proficiency levels (basic, intermediate, advanced) and detect specific pronunciation errors. In addition, regression models, including Random Forest Regressor, were applied to predict continuous pronunciation quality scores. The study used a carefully curated dataset comprising over 18,000 audio segments, expanded through data augmentation techniques such as time shifting and playback speed variation. Experimental results show that the SVM classifier achieved over 94% accuracy in fluency classification, while the kNN model reached up to 99.9% accuracy in pronunciation evaluation. The Random Forest Regressor achieved a coefficient of determination (R²) exceeding 0.92 for predicting continuous pronunciation scores, demonstrating the framework’s robustness and scalability. These findings highlight the potential of data-driven, non-speech-recognition-based approaches for scalable, automated, and accurate L2 speech assessment.


Full Text:

PDF


DOI: https://doi.org/10.31449/inf.v49i5.8667

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.