Advanced Optimal Cross-Modal Fusion Mechanism for Audio-Video Based Artificial Emotion Recognition

Himanshu Kumar; Martin Aruldoss

doi:10.31449/inf.v49i12.7392

Advanced Optimal Cross-Modal Fusion Mechanism for Audio-Video Based Artificial Emotion Recognition

Abstract

The advanced technology of artificial emotional intelligence has greatly contributed to multimodal emption recognition task. Emotion recognition has played a crucial role in many domains, like communication, elearning, mental healthcare, contextual awareness, and customer satisfaction. As real-time data continues to expand, addressing the problem of emotion recognition has become critical and complex. A key challenge lies in recognizing emotions from multimodal heterogeneous input sources, aligning extracted features, and developing robust emotion recognition models. In this study, we explore a cross-modal (audio and video modality) fusion mechanism for emotion recognition, effectively addressing the associated feature complexities. We have used 2D-CNN and 3D-CNN deep learning models for audio and video feature extractions and developed robust models for emotion recognition. This study emphasizes the importance of Compact Bilinear Gated Pooling (CBGP) cross-modal fusion mechanism and highlights the contribution of fusing the features from audio and video modalities for emotion recognition. It also discusses the working principle and comparison performance with other peer cross-modal fusion techniques such as FBP and CBP. The performance of advanced cross-modal fusion is compared to baseline traditional cross-modal fusion mechanisms including EF-LSTM, LF-LSTM, Graph-MFN, hybrid fusion and transformer model based fusion mechanisms such as, attention fusion and transformer fusion. This experiment is performed on benchmark datasets CMU-MOSEI and achieves an accuracy of 80.3%, F1-score of 79.2%, and MAE of 54.2%.

Authors

Himanshu Kumar
Martin Aruldoss

DOI:

https://doi.org/10.31449/inf.v49i12.7392

Downloads

Published

02/23/2025

How to Cite

Kumar, H., & Aruldoss, M. (2025). Advanced Optimal Cross-Modal Fusion Mechanism for Audio-Video Based Artificial Emotion Recognition. Informatica, 49(12). https://doi.org/10.31449/inf.v49i12.7392

Download Citation

Issue

Vol. 49 No. 12 (2025): Online-only issue

Section

Online-only

License

Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.

All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.

Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.

Advanced Optimal Cross-Modal Fusion Mechanism for Audio-Video Based Artificial Emotion Recognition

Abstract

Authors

DOI:

Downloads

Published

How to Cite

Issue

Section

License

Developed By

Information