Human Speech Emotion Recognition Model Based on FCN-LSTM Model
Abstract
With the popularization of smart devices and the growing demand for mental health monitoring, speech emotion recognition (SER) is becoming increasingly important in intelligent interaction. The study proposes the FLA-SER model, a hybrid architecture for robust SER. The proposed architecture consists of three components: an FCN backbone for spectral-spatial feature extraction, a Bi-LSTM for modeling temporal dependencies, and a Transformer enhanced by a dynamic memory pool to capture global dependencies. Adaptive fusion of spatio-temporal features is realized by a hierarchical attention framework. The experimental results on the RAVDESS dataset revealed that the model achieved 95.3% accuracy for 'anger' emotion recognition, representing a 15% improvement over the traditional LSTM model. On the CMU-MOSI cross-lingual dataset, the average accuracy was 94.2%. The FLA-SER model is a robust solution for SER applications across languages and noisy environments. It demonstrates significant practical value in mental health monitoring and intelligent interaction scenarios.References
Chen W, Xing X, Chen P, Xu X. Vesper: A compact and effective pretrained model for speech emotion recognition. IEEE Transactions on Affective Computing, 2024.
Becker D, Braach L, Clasmeier L, Kaufmann T, Ong O, Ahrens K, Wermter S. Influence of Robots’ Voice Naturalness on Trust and Compliance. ACM Transactions on Human-Robot Interaction, 2025, 14(2): 1-25.
Luo G, Zhang H, Yuan Q, Li J, Wang F. Y. ESTNet: Embedded spatial-temporal network for modeling traffic flow dynamics. IEEE transactions on intelligent transportation systems, 2022, 23(10): 19201-19212.
Falahzadeh M R, Farokhi F, Harimi A, Sabbaghi-Nadooshan R. Deep convolutional neural network and gray wolf optimization algorithm for speech emotion recognition. Circuits, Systems, and Signal Processing, 2023, 42(1): 449-492.
Albadr M A A, Tiun S, Ayob M, AL-Dhief F T, Omar K, Maen M K. Speech emotion recognition using optimized genetic algorithm-extreme learning machine. Multimedia Tools and Applications, 2022, 81(17): 23963-23989.
Chattopadhyay S, Dey A, Singh P K, et al. A feature selection model for speech emotion recognition using clustering-based population generation with hybrid of equilibrium optimizer and atom search optimization algorithm. Multimedia Tools and Applications, 2023, 82(7): 9693-9726.
Kapoor S, Kumar T. Fusing traditionally extracted features with deep learned features from the speech spectrogram for anger and stress detection using convolution neural network. Multimedia Tools and Applications, 2022, 81(21): 31107-31128.
Lui C F, Liu Y, Xie M. A supervised bidirectional long short-term memory network for data-driven dynamic soft sensor modeling. IEEE Transactions on Instrumentation and Measurement, 2022, 71: 1-13.
Gupta V, Juyal S, Hu Y C. Understanding human emotions through speech spectrograms using deep neural network. the Journal of Supercomputing, 2022, 78(5): 6944-6973.
Tejaswini V, Sathya Babu K, Sahoo B. Depression detection from social media text analysis using natural language processing techniques and hybrid deep learning model[J]. ACM Transactions on Asian and Low-Resource Language Information Processing, 2024, 23(1): 1-20.
Yang L, Wang S, Chen X, Chen W, Saad O M, Chen Y. Deep-learning missing well-log prediction via long short-term memory network with attention-period mechanism. Geophysics, 2023, 88(1): D31-D48.
Warule P, Mishra S P, Deb S. Time-frequency analysis of speech signal using Chirplet transform for automatic diagnosis of Parkinson’s disease. Biomedical Engineering Letters, 2023, 13(4): 613-623.
Göreke V. A novel deep-learning-based CADx architecture for classification of thyroid nodules using ultrasound images. Interdisciplinary Sciences: Computational Life Sciences, 2023, 15(3): 360-373.
Miled M, Messaoud M A B, Bouzid A. Lip reading of words with lip segmentation and deep learning. Multimedia Tools and Applications, 2023, 82(1): 551-571.
Ahmed G, Lawaye A A. CNN-based speech segments endpoints detection framework using short-time signal energy features. International Journal of Information Technology, 2023, 15(8): 4179-4191.
Saxena D G, Farooqui A N, Ali S. Extricate features utilizing Mel frequency cepstral coefficient in automatic speech recognition system. Int. J. Eng. Manuf, 2022, 12(6): 14-21.
Yang Y, Wang L, Gao S, Yu Z, Dong L. Cross-lingual speaker transfer for Cambodian based on feature disentangler and time-frequency attention adaptive normalization. International Journal of Web Information Systems, 2024, 20(2): 113-128.
Gupta A, Purwar A. Speech refinement using Bi-LSTM and improved spectral clustering in speaker diarization. Multimedia Tools and Applications, 2024, 83(18): 54433-54448.
Liu C, Zhang Y, Sun J, Cui Z, Wang K. Stacked bidirectional LSTM RNN to evaluate the remaining useful life of supercapacitor. International Journal of Energy Research, 2022, 46(3): 3034-3043.
Xu M, Cai D, Yin W, Wang S, Jin X, Liu X. Resource-efficient algorithms and systems of foundation models: A survey. ACM Computing Surveys, 2025, 57(5): 1-39.
DOI:
https://doi.org/10.31449/inf.v49i36.12208Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika







