CMMF and STAM-FNet: Multimodal Fusion Architectures for Complex Scene Understanding in Dynamic Environments
Abstract
Multimodal perception has emerged as a vital strategy for understanding complex and dynamic environments, where traditional unimodal approaches fail to handle data heterogeneity and occlusion. This paper proposes two multimodal fusion frameworks—CMMF (Cross-Modal Matching Fusion) and STAM-FNet (Spatio-Temporal Attention Multimodal Fusion Network)—to address structural and temporal challenges in complex scene understanding. The CMMF model adopts a three-stage architecture with cross-modal semantic alignment and dynamic weighting, while STAM-FNet introduces spatio-temporal attention layers and 3D convolutions to enhance feature discrimination in dynamic environments. Experiments are conducted on a dataset of 120000 samples covering three application scenarios: urban monitoring, indoor interaction, and transportation hubs. Evaluation is based on standardized metrics including Top-1 Accuracy, F1-score, AUC, Modal Gain Index, and Inference Delay. Compared to SOTA baselines such as ResNet50, Two-Stream Transformer, and MMBT, STAM-FNet achieves up to 15.8% improvement in accuracy and 20% robustness gain under high-occlusion conditions. CMMF maintains superior performance in static tasks while preserving low parameter count (24.3M). This work demonstrates the effectiveness of adaptive multimodal fusion in improving accuracy, efficiency, and fault tolerance in real-world perception systems.References
Zhang JP, Geng Q, Jin J. EKLI-Attention: An integrated attention mechanism for classifying citizen requests in government-citizen interactions. Inf Process Manag. 2025 Nov; 62(6):104237. doi:10.1016/j.ipm.2025.104237.
Choi YM, Chiu TY, Ferreira J, Golomb JD. Maintaining visual stability in naturalistic scenes: The roles of trans-saccadic memory and default assumptions. Cognition. 2025 Sep; 262:106165. doi:10.1016/j.cognition.2025.106165.
Zhang LT, Zhang XM, Han LF, Yu ZL, Liu Y, Li ZJ. Multi-task Hierarchical Heterogeneous Fusion Framework for multimodal summarization. Inf Process Manag. 2024 Jul; 61(4):103693. doi:10.1016/j.ipm.2024.103693.
Lu Q, Sun X, Gao ZZZ, Long YF, Feng J, Zhang H. Coordinated-joint translation fusion framework with sentiment-interactive graph convolutional networks for multimodal sentiment analysis. Inf Process Manag. 2024 Jan; 61(1):103538. doi:10.1016/j.ipm.2023.103538.
Man KW. Multimodal Data Fusion to Detect Preknowledge Test-Taking Behavior Using Machine Learning. Educ Psychol Meas. 2024 Aug; 84(4):753-779. doi:10.1177/00131644231193625.
Wang WD, Zhang HY, Zhang ZB. Research on Emotion Recognition Method of Flight Training Based on Multimodal Fusion. Int J Hum Comput Interact. 2024 Oct 17; 40(20):6478-6491. doi:10.1080/10447318.2023.2254644.
Yang C, Gan XL, Peng AT, Yuan XY. ResNet Based on Multi-Feature Attention Mechanism for Sound Classification in Noisy Environments. Sustainability. 2023 Jul; 15(14):10762. doi:10.3390/su151410762.
Tang JJ, Hou M, Jin XY, Zhang JH, Zhao QB, Kong WZ. Tree-Based Mix-Order Polynomial Fusion Network for Multimodal Sentiment Analysis. Systems. 2023 Jan; 11(1):44. doi:10.3390/systems11010044.
Lin H, Zhang PL, Ling JD, Yang ZG, Lee LK, Liu WY. PS-Mixer: A Polar-Vector and Strength-Vector Mixer Model for Multimodal Sentiment Analysis. Inf Process Manag. 2023 Mar; 60(2):103229. doi:10.1016/j.ipm.2022.103229.
Luo ZZ, Zheng CY, Gong J, Chen SL, Luo Y, Yi YG. 3DLIM: Intelligent analysis of students' learning interest by using multimodal fusion technology. Educ Inf Technol. 2023 Jul; 28(7):7975-7995. doi:10.1007/s10639-022-11485-8.
Chen L, Zhang SP, Wang HH, Ma PJ, Ma ZW, Duan GH. Deep USRNet Reconstruction Method Based on Combined Attention Mechanism. Sustainability. 2022 Nov; 14(21):14151. doi:10.3390/su142114151.
Zhao C, Liu RJ, Su B, Zhao L, Han ZY, Zheng W. Traffic Flow Prediction with Attention Mechanism Based on TS-NAS. Sustainability. 2022 Oct; 14(19):12232. doi:10.3390/su141912232.
Zhao L, Zhang YY, Zhang CZ. Does attention mechanism possess the feature of human reading? A perspective of sentiment classification task. Aslib J Inf Manag. 2023 Jan 6; 75(1):20-43. doi:10.1108/AJIM-12-2021-0385.
Leroy A, Spotorno S, Faure S. Processing of complex visual scenes: Between semantic and emotion understanding. Annee Psychol. 2021 Mar; 121(1):101-139.
Zhang H, Anderson NC, Miller KF. Refixation Patterns of Mind-Wandering During Real-World Scene Perception. J Exp Psychol Hum Percept Perform. 2021 Jan; 47(1):36-52. doi:10.1037/xhp0000877.
Maier A, Tsuchiya N. Growing evidence for separate neural mechanisms for attention and consciousness. Atten Percept Psychophys. 2021 Feb; 83(2):558-576. doi:10.3758/s13414-020-02146-4.
der Nederlanden CMV, Zaragoza C, Rubio-Garcia A, Clarkson E, Snyder JS. Change detection in complex auditory scenes is predicted by auditory memory, pitch perception, and years of musical training. Psychol Res. 2020 Apr; 84(3):585-601. doi:10.1007/s00426-018-1072-x.
Liu MF, Li LJ, Hu HJ, Guan WL, Tian J. Image caption generation with dual attention mechanism. Inf Process Manag. 2020 Mar; 57(2):102178. doi:10.1016/j.ipm.2019.102178.
Meir A, Oron-Gilad T. Understanding complex traffic road scenes: The case of child-pedestrians' hazard perception. J Safety Res. 2020 Feb; 72:111-126. doi:10.1016/j.jsr.2019.12.014.
Luo ZZ, Chen JY, Wang GS, Liao MY. A three-dimensional model of student interest during learning using multimodal fusion with natural sensing technology. Interact Learn Environ. 2022 Jul 1; 30(6):1117-1130. doi:10.1080/10494820.2019.1710852.
Loschky LC, Larson AM, Smith TJ, Magliano JP. The Scene Perception & Event Comprehension Theory (SPECT) Applied to Visual Narratives. Top Cogn Sci. 2020 Jan; 12(1):311-351. doi:10.1111/tops.12455.
DOI:
https://doi.org/10.31449/inf.v49i9.9830Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika







