Graph-Attention Fusion with VAE Cross-Modal Mapping and Reinforcement-Learning Visualization for Real-Time AR
Abstract
In AR scenarios, the intelligent generation and visualization of multimodal perception information face challenges such as feature heterogeneity, insufficient semantic alignment, and unstable real-time performance. To address these issues, this study proposes a feature modeling method that integrates an Attention-GCN for multimodal fusion, a variational autoencoder (VAE) with geometric/temporal constraints for cross-modal mapping, and a reinforcement learning (PPO) driven optimization mechanism to form a "perception–generation–presentation–feedback" closed-loop system. Experiments are conducted on a self-built multimodal dataset of 28,000 sequences, with results evaluated on a held-out test set to ensure reliability. Baseline comparisons include a unimodal CNN and a heuristic fusion model under the same computational conditions. Results demonstrate that the proposed framework achieves an average delay of 1.42 ± 0.08 s, frame rate of 57 ± 1.5 fps, semantic alignment rate of 92.4% ± 1.1, and interaction interruption rate of 3.5% ± 0.4, outperforming baselines in efficiency, semantic consistency, and rendering stability. These findings highlight the framework’s feasibility for real-time multimodal interaction in AR scenarios and its scalability across mid-range devices.References
Ismail A W , Sunar M S .Multimodal Fusion: Gesture and Speech Input in Augmented Reality Environment[J].Advances in Intelligent Systems and Computing, 2015, 331:245-254.https://doi.org/10.1007/978-3-319-13153-5_24.
Yong J , Wei J , Lei X ,et al.Intervention and regulatory mechanism of multimodal fusion natural interactions on AR embodied cognition[J].Information Fusion, 2025,117.https://doi.org/10.1016/j.inffus.2024.102910.
Chen L, Zhao H, Shi C, et al. Enhancing multimodal perception and interaction: an augmented reality visualization system for complex decision making[J]. Systems, 2024,12(1):7.https://doi.org/10.3390/systems12010007.
Lee G-A, Sedlmair M, Schmalstieg D. Design patterns for situated visualization in augmented reality[J]. arXiv preprint, 2023,arXiv:2307.09157.https://doi.org/10.48550/arXiv.2307.09157.
Zollmann S , Langlotz T , Grasset R ,et al.Visualization Techniques in Augmented Reality: A Taxonomy, Methods and Patterns.[J].IEEE transactions on visualization and computer graphics, 2021, 27(9):3808-3825.https://doi.org/10.1109/TVCG.2020.2986247.
Zheng M , Lillis D , Campbell A G .Current state of the art and future directions: Augmented reality data visualization to support decision-making[J].Visual Informatics,2024,8(2):80-105.https://doi.org/10.1016/j.visinf.2024.05.001.
Friske MD. Integration of Augmented Reality and Mobile Robot Indoor SLAM for Enhanced Spatial Awareness[J]. arXiv preprint,2024,arXiv:2409.01915.https://doi.org/10.48550/arXiv.2409.01915.
Al-Tawil B. A review of visual SLAM for robotics: evolution, properties, and relevance to augmented reality[J]. Frontiers in Robotics and AI, 2024, 11: 1347985.https://doi.org/10.3389/frobt.2024.1347985.
Sheng X, Mao S, Yan Y, et al. Review on SLAM algorithms for augmented reality[J]. Displays, 2024, 84(2): 102806. https://doi.org/10.1016/j.displa.2024.102806.
Barros AM. A comprehensive survey of visual SLAM algorithms[J]. Robotics, 2022,11(1):24.https://doi.org/10.3390/robotics11010024.
Taketomi T, Uchiyama H, Ikeda S. Visual SLAM algorithms: a survey from 2010 to 2016[J]. IPSJ Transactions on Computer VisionandApplications,2017,9:1.https://doi.org/10.1186/s41074-017-0027-2.
Xu C , Kumaran R , Stier N ,et al.Multimodal 3D Fusion and In-Situ Learning for Spatially Aware AI[J]. IEEEISMAR2024.https://doi.org/10.1109/ISMAR62088.2024.00063.
Zhao F, Wang J, Li S, et al. Deep multimodal data fusion: a survey[J]. ACM Computing Surveys, 2024, 56(5): 1–36. https://doi.org/10.1145/3649447.
José Morano, Aresta G , Grechenig C ,et al.Deep Multimodal Fusion of Data With Heterogeneous Dimensionality via Projective Networks[J].Journal on Biomedical and Health Informatics (J-BHI),2024,28(4):12.https://doi.org/10.1109/JBHI.2024.3352970.
Ni J, Chen X, Yang Y, et al. Deep equilibrium multimodal fusion[J]. arXiv preprint, 2023, arXiv:2306.16645. https://doi.org/10.48550/arXiv.2306.16645.
Xue Z, Marculescu R. Dynamic multimodal fusion[J]. arXiv preprint, 2022,arXiv:2204.00102.https://doi.org/10.48550/arXiv.2204.00102.
Lee M, Billinghurst M, O’Brien D, et al. A usability study of multimodal input in an augmented reality multimodal interface[J]. Virtual Reality, 2013, 17(2–3): 119–135. https://doi.org/10.1007/s10055-013-0230-0.
Gao J , Li P , Chen Z ,et al.A Survey on Deep Learning for Multimodal Data Fusion[J].Neural Computation, 2020, 32(1):1-36.https://doi.org/10.1162/neco_a_01273.
Zhong R, Hu B, Feng Y, et al. Construction of human digital twin model based on multimodal data and its application in locomotion mode identification[J]. Chinese Journal of Mechanical Engineering, 2023, 36: 126. https://doi.org/10.1186/s10033-023-00951-0.
Cao C, Jiang Z, Wu H, et al. Study of deep multimodal information fusion–based digital twin method for gearbox fault diagnosis[J]. The International Journal of Advanced Manufacturing Technology, 2025,138:3529-3542.https://doi.org/10.1007/s00170-025-15673-x.
DOI:
https://doi.org/10.31449/inf.v49i14.11191Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







