Cross-Modal Transformer with Dynamic Attention Fusion for Emotion Recognition in Music via Audio-Lyrics Alignment
Abstract
Emotion recognition from multimodal signals remains a challenging task due to annotation subjectivity and heterogeneous feature spaces. To address these issues, this study proposes a cross-modal Transformer architecture with dynamic attention fusion for robust emotion classification. Raw acoustic signals are converted into time–frequency spectrograms, from which hierarchical features are extracted via a deep convolutional network. In parallel, textual data (e.g., lyrics or aligned semantic content) are encoded with a pre-trained language model to obtain context-aware embeddings. A cross-modal attention mechanism embedded in the Transformer encoder adaptively models inter-modal associations, enabling semantically guided acoustic representation learning. The fused joint representation is aggregated through pooling and passed to a fully connected classifier, yielding multi-category emotion probabilities. Experimental evaluations demonstrate that the proposed Transformer model outperforms CNN, CRNN, and traditional Transformer models in noisy conditions (average accuracy = 0.58; macro F1 = 0.55 at 0 dB SNR) and exhibits superior generalization capabilities across datasets (AUC = 0.832–0.887). Furthermore, with only 30% labeled data, the model maintains reliable emotion continuity (CCC = 0.635; ICC = 0.584), highlighting its effectiveness in low-resource scenarios. These results confirm the potential of cross-modal Transformer fusion for advancing emotion-aware intelligent systems in multimodal perception applications.References
References
Gómez-Cañón JS, Cano E, Eerola T, et al. Music emotion recognition: Toward new, robust standards in personalized and context-sensitive applications[J]. IEEE Signal Processing Magazine, 2021, 38(6): 106-114.
Gómez-Cañón JS, Gutiérrez-Páez N, Porcaro L, et al. TROMPA-MER: an open dataset for personalized music emotion recognition[J]. Journal of Intelligent Information Systems, 2023, 60(2): 549-570.
Hizlisoy S, Yildirim S, Tufekci Z. Music emotion recognition using convolutional long short term memory deep neural networks[J]. Engineering Science and Technology, an International Journal, 2021, 24(3): 760-767.
Grekow J. Music emotion recognition using recurrent neural networks and pretrained models[J]. Journal of Intelligent Information Systems, 2021, 57(3): 531-546.
Assuncao WG, Piccolo LSG, Zaina LA M. Considering emotions and contextual factors in music recommendation: a systematic literature review[J]. Multimedia Tools and Applications, 2022, 81(6): 8367-8407.
Chaturvedi V, Kaur AB, Varshney V, et al. Music mood and human emotion recognition based on physiological signals: a systematic review[J]. Multimedia Systems, 2022, 28(1): 21-44.
Garg A, Chaturvedi V, Kaur AB, et al. Machine learning model for mapping of music mood and human emotion based on physiological signals[J]. Multimedia Tools and Applications, 2022, 81(4): 5137-5177.
Li JW, Barma S, Mak PU, et al. Single-channel selection for EEG-based emotion recognition using brain rhythm sequencing[J]. IEEE journal of biomedical and health informatics, 2022, 26(6): 2493-2503.
Athavle M, Mudale D, Shrivastav U, et al. Music recommendation based on face emotion recognition[J]. Journal of Informatics Electrical and Electronics Engineering (JIEEE), 2021, 2(2): 1-11.
Cunningham S, Ridley H, Weinel J, et al. Supervised machine learning for audio emotion recognition: Enhancing film sound design using audio features, regression models and artificial neural networks[J]. Personal and Ubiquitous Computing, 2021, 25(4): 637-650.
Yin G, Sun S, Yu D, et al. A multimodal framework for large-scale emotion recognition by fusing music and electrodermal activity signals[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2022, 18(3): 1-23.
Abdullah SMSA, Ameen SYA, Sadeeq MAM, et al. Multimodal emotion recognition using deep learning[J]. Journal of Applied Science and Technology Trends, 2021, 2(01): 73-79.
Kamble K, Sengupta J. A comprehensive survey on emotion recognition based on electroencephalograph (EEG) signals[J]. Multimedia Tools and Applications, 2023, 82(18): 27269-27304.
Zhao S, Jia G, Yang J, et al. Emotion recognition from multiple modalities: Fundamentals and methodologies[J]. IEEE Signal Processing Magazine, 2021, 38(6): 59-73.
Pandeya YR, Lee J. Deep learning-based late fusion of multimodal information for emotion classification of music video[J]. Multimedia Tools and Applications, 2021, 80(2): 2887-2905.
Li X, Zhang Y, Tiwari P, et al. EEG based emotion recognition: A tutorial and review[J]. ACM Computing Surveys, 2022, 55(4): 1-57.
Pawar MD, Kokate R D. Convolution neural network based automatic speech emotion recognition using Mel-frequency Cepstrum coefficients[J]. Multimedia Tools and Applications, 2021, 80(10): 15563-15587.
Zihan D, Alam N, Islam M M. Deep Learning-Driven Music Emotion Recognition: CNN-BiLSTM Networks with Spatial-Temporal Attention[J]. Journal of Platform Technology, 2025, 13(1): 16-30.
Lin Z, Wang Z, Zhu Y, et al. Text sentiment detection and classification based on integrated learning algorithm[J]. Applied Science and Engineering Journal for Advanced Research, 2024, 3(3): 27-33.
Houssein EH, Hammad A, Ali A A. Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review[J]. Neural Computing and Applications, 2022, 34(15): 12527-12557.
Jingjing W, Ru H. Music emotion recognition based on the broad and deep learning network[J]. Journal of East China University of Science and Technology, 2022, 48(3): 373-380.
Bakariya B, Singh A, Singh H, et al. Facial emotion recognition and music recommendation system using CNN-based deep learning techniques[J]. Evolving Systems, 2024, 15(2): 641-658.
Yadav SP, Zaidi S, Mishra A, et al. Survey on machine learning in speech emotion recognition and vision systems using a recurrent neural network (RNN)[J]. Archives of Computational Methods in Engineering, 2022, 29(3): 1753-1770.
Topic A, Russo M. Emotion recognition based on EEG feature maps through deep learning network[J]. Engineering Science and Technology, an International Journal, 2021, 24(6): 1442-1454.
Viñán-Ludeña MS, de Campos L M. Discovering a tourism destination with social media data: BERT-based sentiment analysis[J]. Journal of Hospitality and Tourism Technology, 2022, 13(5): 907-921.
Huang F, Li X, Yuan C, et al. Attention-emotion-enhanced convolutional LSTM for sentiment analysis[J]. IEEE transactions on neural networks and learning systems, 2021, 33(9): 4332-4345.
Zhang Y, Zhang Y, Wang S. An attention-based hybrid deep learning model for EEG emotion recognition[J]. Signal, image and video processing, 2023, 17(5): 2305-2313.
Yi Y, Tian Y, He C, et al. DBT: multimodal emotion recognition based on dual-branch transformer: Y. Yi et al[J]. The Journal of Supercomputing, 2023, 79(8): 8611-8633.
Wu Y, Daoudi M, Amad A. Transformer-based self-supervised multimodal representation learning for wearable emotion recognition[J]. IEEE Transactions on Affective Computing, 2023, 15(1): 157-172.
Arumugam L, Arumugam S, Chidambaram P, et al. A multi-model deep learning approach for human emotion recognition[J]. Cognitive Neurodynamics, 2025, 19(1): 123.
Khan M, Gueaieb W, El Saddik A, et al. MSER: Multimodal speech emotion recognition using cross-attention with deep fusion[J]. Expert Systems with Applications, 2024, 245: 122946.
Deng Z, Lu Y, Liao J, et al. Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion[J]. arXiv preprint arXiv:2507.21395, 2025.
Liu Y, Zhang H, Zhan Y, et al. Noise-resistant multimodal transformer for emotion recognition[J]. International Journal of Computer Vision, 2024: 1-21.
Wafa A A, Eldefrawi M M, Farhan M S. Advancing multimodal emotion recognition in big data through prompt engineering and deep adaptive learning[J]. Journal of Big Data, 2025, 12(1): 1-62.
Savchenko A, Savchenko L. Leveraging Lightweight Facial Models and Textual Modality in Audio-visual Emotional Understanding in-the-Wild[C]//Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 5778-5788.
Goel A, Singh H, Singh A. Emotion-Aware Speech Translation: A Review[C]//2025 International Conference on Intelligent Control, Computing and Communications (IC3). IEEE, 2025: 533-538.
DOI:
https://doi.org/10.31449/inf.v49i28.11516Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







