Graph-Attention Fusion with VAE Cross-Modal Mapping and Reinforcement-Learning Visualization for Real-Time AR
Abstract
In AR scenarios, the intelligent generation and visualization of multimodal perception information face challenges such as feature heterogeneity, insufficient semantic alignment, and unstable real-time performance. To address these issues, this study proposes a feature modeling method that integrates an Attention-GCN for multimodal fusion, a variational autoencoder (VAE) with geometric/temporal constraints for cross-modal mapping, and a reinforcement learning (PPO) driven optimization mechanism to form a "perception–generation–presentation–feedback" closed-loop system. Experiments are conducted on a self-built multimodal dataset of 28,000 sequences, with results evaluated on a held-out test set to ensure reliability. Baseline comparisons include a unimodal CNN and a heuristic fusion model under the same computational conditions. Results demonstrate that the proposed framework achieves an average delay of 1.42 ± 0.08 s, frame rate of 57 ± 1.5 fps, semantic alignment rate of 92.4% ± 1.1, and interaction interruption rate of 3.5% ± 0.4, outperforming baselines in efficiency, semantic consistency, and rendering stability. These findings highlight the framework’s feasibility for real-time multimodal interaction in AR scenarios and its scalability across mid-range devices.References
Ismail A W , Sunar M S .Multimodal Fusion: Gesture and Speech Input in Augmented Reality Environment[J].Advances in Intelligent Systems and Computing, 2015, 331:245-254.https://doi.org/10.1007/978-3-319-13153-5_24.
Yong J , Wei J , Lei X ,et al.Intervention and regulatory mechanism of multimodal fusion natural interactions on AR embodied cognition[J].Information Fusion, 2025,117.https://doi.org/10.1016/j.inffus.2024.102910.
Chen L, Zhao H, Shi C, et al. Enhancing multimodal perception and interaction: an augmented reality visualization system for complex decision making[J]. Systems, 2024,12(1):7.https://doi.org/10.3390/systems12010007.
Lee G-A, Sedlmair M, Schmalstieg D. Design patterns for situated visualization in augmented reality[J]. arXiv preprint, 2023,arXiv:2307.09157.https://doi.org/10.48550/arXiv.2307.09157.
Zollmann S , Langlotz T , Grasset R ,et al.Visualization Techniques in Augmented Reality: A Taxonomy, Methods and Patterns.[J].IEEE transactions on visualization and computer graphics, 2021, 27(9):3808-3825.https://doi.org/10.1109/TVCG.2020.2986247.
Zheng M , Lillis D , Campbell A G .Current state of the art and future directions: Augmented reality data visualization to support decision-making[J].Visual Informatics,2024,8(2):80-105.https://doi.org/10.1016/j.visinf.2024.05.001.
Friske MD. Integration of Augmented Reality and Mobile Robot Indoor SLAM for Enhanced Spatial Awareness[J]. arXiv preprint,2024,arXiv:2409.01915.https://doi.org/10.48550/arXiv.2409.01915.
Al-Tawil B. A review of visual SLAM for robotics: evolution, properties, and relevance to augmented reality[J]. Frontiers in Robotics and AI, 2024, 11: 1347985.https://doi.org/10.3389/frobt.2024.1347985.
Sheng X, Mao S, Yan Y, et al. Review on SLAM algorithms for augmented reality[J]. Displays, 2024, 84(2): 102806. https://doi.org/10.1016/j.displa.2024.102806.
Barros AM. A comprehensive survey of visual SLAM algorithms[J]. Robotics, 2022,11(1):24.https://doi.org/10.3390/robotics11010024.
Taketomi T, Uchiyama H, Ikeda S. Visual SLAM algorithms: a survey from 2010 to 2016[J]. IPSJ Transactions on Computer VisionandApplications,2017,9:1.https://doi.org/10.1186/s41074-017-0027-2.
Xu C , Kumaran R , Stier N ,et al.Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI[J]. IEEEISMAR2024.https://doi.org/10.1109/ISMAR62088.2024.00063.
Zhao F, Wang J, Li S, et al. Deep multimodal data fusion: a survey[J]. ACM Computing Surveys, 2024, 56(5): 1–36. https://doi.org/10.1145/3649447.
José Morano, Aresta G , Grechenig C ,et al.Deep Multimodal Fusion of Data With Heterogeneous Dimensionality via Projective Networks[J].Journal on Biomedical and Health Informatics (J-BHI),2024,28(4):12.https://doi.org/10.1109/JBHI.2024.3352970.
Ni J, Chen X, Yang Y, et al. Deep equilibrium multimodal fusion[J]. arXiv preprint, 2023, arXiv:2306.16645. https://doi.org/10.48550/arXiv.2306.16645.
Xue Z, Marculescu R. Dynamic multimodal fusion[J]. arXiv preprint, 2022,arXiv:2204.00102.https://doi.org/10.48550/arXiv.2204.00102.
Lee M, Billinghurst M, O’Brien D, et al. A usability study of multimodal input in an augmented reality multimodal interface[J]. Virtual Reality, 2013, 17(2–3): 119–135. https://doi.org/10.1007/s10055-013-0230-0.
Gao J , Li P , Chen Z ,et al.A Survey on Deep Learning for Multimodal Data Fusion[J].Neural Computation, 2020, 32(1):1-36.https://doi.org/10.1162/neco_a_01273.
Zhong R, Hu B, Feng Y, et al. Construction of human digital twin model based on multimodal data and its application in locomotion mode identification[J]. Chinese Journal of Mechanical Engineering, 2023, 36: 126. https://doi.org/10.1186/s10033-023-00951-0.
Cao C, Jiang Z, Wu H, et al. Study of deep multimodal information fusion–based digital twin method for gearbox fault diagnosis[J]. The International Journal of Advanced Manufacturing Technology, 2025,138:3529-3542.https://doi.org/10.1007/s00170-025-15673-x.
DOI:
https://doi.org/10.31449/inf.v49i14.11191Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika







