Human Speech Emotion Recognition Model Based on FCN-LSTM Model

Abstract

With the popularization of smart devices and the growing demand for mental health monitoring, speech emotion recognition (SER) is becoming increasingly important in intelligent interaction. The study proposes the FLA-SER model, a hybrid architecture for robust SER. The proposed architecture consists of three components: an FCN backbone for spectral-spatial feature extraction, a Bi-LSTM for modeling temporal dependencies, and a Transformer enhanced by a dynamic memory pool to capture global dependencies. Adaptive fusion of spatio-temporal features is realized by a hierarchical attention framework. The experimental results on the RAVDESS dataset revealed that the model achieved 95.3% accuracy for 'anger' emotion recognition, representing a 15% improvement over the traditional LSTM model. On the CMU-MOSI cross-lingual dataset, the average accuracy was 94.2%. The FLA-SER model is a robust solution for SER applications across languages and noisy environments. It demonstrates significant practical value in mental health monitoring and intelligent interaction scenarios.

References

Chen W, Xing X, Chen P, Xu X. Vesper: A compact and effective pretrained model for speech emotion recognition. IEEE Transactions on Affective Computing, 2024.

Becker D, Braach L, Clasmeier L, Kaufmann T, Ong O, Ahrens K, Wermter S. Influence of Robots’ Voice Naturalness on Trust and Compliance. ACM Transactions on Human-Robot Interaction, 2025, 14(2): 1-25.

Luo G, Zhang H, Yuan Q, Li J, Wang F. Y. ESTNet: Embedded spatial-temporal network for modeling traffic flow dynamics. IEEE transactions on intelligent transportation systems, 2022, 23(10): 19201-19212.

Falahzadeh M R, Farokhi F, Harimi A, Sabbaghi-Nadooshan R. Deep convolutional neural network and gray wolf optimization algorithm for speech emotion recognition. Circuits, Systems, and Signal Processing, 2023, 42(1): 449-492.

Albadr M A A, Tiun S, Ayob M, AL-Dhief F T, Omar K, Maen M K. Speech emotion recognition using optimized genetic algorithm-extreme learning machine. Multimedia Tools and Applications, 2022, 81(17): 23963-23989.

Chattopadhyay S, Dey A, Singh P K, et al. A feature selection model for speech emotion recognition using clustering-based population generation with hybrid of equilibrium optimizer and atom search optimization algorithm. Multimedia Tools and Applications, 2023, 82(7): 9693-9726.

Kapoor S, Kumar T. Fusing traditionally extracted features with deep learned features from the speech spectrogram for anger and stress detection using convolution neural network. Multimedia Tools and Applications, 2022, 81(21): 31107-31128.

Lui C F, Liu Y, Xie M. A supervised bidirectional long short-term memory network for data-driven dynamic soft sensor modeling. IEEE Transactions on Instrumentation and Measurement, 2022, 71: 1-13.

Gupta V, Juyal S, Hu Y C. Understanding human emotions through speech spectrograms using deep neural network. the Journal of Supercomputing, 2022, 78(5): 6944-6973.

Tejaswini V, Sathya Babu K, Sahoo B. Depression detection from social media text analysis using natural language processing techniques and hybrid deep learning model[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2024, 23(1): 1-20.

Yang L, Wang S, Chen X, Chen W, Saad O M, Chen Y. Deep-learning missing well-log prediction via long short-term memory network with attention-period mechanism. Geophysics, 2023, 88(1): D31-D48.

Warule P, Mishra S P, Deb S. Time-frequency analysis of speech signal using Chirplet transform for automatic diagnosis of Parkinson’s disease. Biomedical Engineering Letters, 2023, 13(4): 613-623.

Göreke V. A novel deep-learning-based CADx architecture for classification of thyroid nodules using ultrasound images. Interdisciplinary Sciences: Computational Life Sciences, 2023, 15(3): 360-373.

Miled M, Messaoud M A B, Bouzid A. Lip reading of words with lip segmentation and deep learning. Multimedia Tools and Applications, 2023, 82(1): 551-571.

Ahmed G, Lawaye A A. CNN-based speech segments endpoints detection framework using short-time signal energy features. International Journal of Information Technology, 2023, 15(8): 4179-4191.

Saxena D G, Farooqui A N, Ali S. Extricate features utilizing Mel frequency cepstral coefficient in automatic speech recognition system. Int. J. Eng. Manuf, 2022, 12(6): 14-21.

Yang Y, Wang L, Gao S, Yu Z, Dong L. Cross-lingual speaker transfer for Cambodian based on feature disentangler and time-frequency attention adaptive normalization. International Journal of Web Information Systems, 2024, 20(2): 113-128.

Gupta A, Purwar A. Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization. Multimedia Tools and Applications, 2024, 83(18): 54433-54448.

Liu C, Zhang Y, Sun J, Cui Z, Wang K. Stacked bidirectional LSTM RNN to evaluate the remaining useful life of supercapacitor. International Journal of Energy Research, 2022, 46(3): 3034-3043.

Xu M, Cai D, Yin W, Wang S, Jin X, Liu X. Resource-efficient algorithms and systems of foundation models: A survey. ACM Computing Surveys, 2025, 57(5): 1-39.

Authors

  • Mingjie Wang
  • Hexi Wang

DOI:

https://doi.org/10.31449/inf.v49i36.12208

Downloads

Published

12/20/2025

How to Cite

Wang, M., & Wang, H. (2025). Human Speech Emotion Recognition Model Based on FCN-LSTM Model. Informatica, 49(36). https://doi.org/10.31449/inf.v49i36.12208