RT-AVTC: A Real-Time Audio-Visual Tone Correction Network Using Multimodal Deep Learning and Causal Convolution
Abstract
Real-time feedback and robustness against environmental noise are critical challenges in computer-aided Chinese tone learning. This paper introduces RT-AVTC, a novel real-time audio-visual tone correction network designed to address those limitations through multimodal deep learning. The proposed architecture integrates a Multi-task Cascaded Convolutional Network (MTCNN) for audio-visual feature alignment, a causal convolution module and a Bidirectional Long Short-Term Memory (BiLSTM) network for robust temporal sequence classification, and a feedback module incorporating Dynamic Time Warping (DTW) for quantitative error analysis. The model was rigorously evaluated on the public LRS3 dataset. Experimental results demonstrated that the RT-AVTC model achieved a state-of-the-art accuracy of 94.26%, significantly outperforming strong baselines including Conformer and Whisper. Notably, in a challenging -5 dB signal-to-noise ratio environment, the multimodal approach yielded a 17.41% performance improvement over its audio-only counterpart. Furthermore, the error correction module provided targeted feedback, improving the average fundamental frequency (F0) curve deviation by over 55% for learners. With its high accuracy, real-time processing capability, and low computational overhead, the RT-AVTC framework presents a valuable and practical solution for effective computer-aided language learning.References
Nawroly S S, Popescu D, Thekekara Antony M C. Category-based and target-based data augmentation for dysarthric speech recognition using transfer learning. Studies in Informatics and Control, 2024, 33(4): 83-93. DOI:10.24846/v33i4y202408.
Xu Z Y. Research on deep learning in natural language processing. Advances in Computer and Communication, 2023. DOI: 10.26855/acc.2023.06.018
Yang Z J. Deep Learning Applications in Natural Language Processing and Optimization Strategies. Journal of Modern Education and Culture, 2024. DOI: 10.70767/jmec.v1i2.257
Li L. Application of deep learning technology in speech recognition and language teaching. Lecture Notes in Education Psychology and Public Media, 2024. DOI: 10.54254/2753-7048/59/20241721
Wang Y., Perrin S. Deep Chinese teaching and learning model based on deep learning. International Journal of Languages, Literature and Linguistics, 2024. DOI: 10.18178/ijlll.2024.10.1.479
Yan F, Wang J, Li W. Research on the application of deep learning in natural language processing. Frontiers in Computing and Intelligent Systems, 2024. DOI: 10.54097/m9sxpv44
Xu Z. Research on deep learning in natural language processing. Advances in Computer and Communication, 2023. DOI: 10.26855/acc.2023.06.018
Yang Z. Deep learning applications in natural language processing and optimization strategies. Journal of Modern Education and Culture, 2024. DOI: 10.70767/jmec.v1i2.257
Arkhangelskaya E O, Nikolenko S I. Deep learning for natural language processing: a survey. Journal of Mathematical Sciences, 2023, 273: 533-582. DOI: 10.1007/s10958-023-06519-6
Li L. Application of deep learning technology in speech recognition and language teaching. Lecture Notes in Education Psychology and Public Media, 2024. DOI:10.54254/2753-7048/59/20241721
Khurana Y, Gupta S, Sathyaraj R, Raja S P. RobinNet: A multimodal speech emotion recognition system with speaker recognition for social interactions. IEEE Transactions on Computational Social Systems, 2022, 11(1): 478-487. DOI:10.1109/TCSS.2022.3228649
Khurana S, Laurent A, Glass J. Samu-xlsr: Semantically-aligned multimodal utterance-level cross-lingual speech representation. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1493-1504. DOI:10.48550/arXiv.2205.08180
Choi D, Yeung H H, Werker J F. Sensorimotor foundations of speech perception in infancy. Trends in Cognitive Sciences, 2023, 27(8): 773-784. DOI:10.1016/j.tics.2023.05.007
Zhong Y, Tang J, Li X, Liang X, Liu Z, Li Y, et al. A memristor-based analogue reservoir computing system for real-time and power-efficient signal processing. Nature Electronics, 2022, 5(10): 672-681. DOI:10.1038/s41928-022-00838-3
Mnasri Z, Rovetta S, Masulli F. Anomalous sound event detection: a survey of machine learning based methods and applications. Multimedia Tools and Applications, 2022, 81(4): 5537-5586. DOI:10.1007/s11042-021-11817-9
Geng L, Liang Y, Shan H, Xiao Z, Wang W, Wei M. Pathological voice detection and classification based on multimodal transmission network. Journal of Voice, 2025, 39(3): 591-601. DOI:10.1016/j.jvoice.2022.11.018
Yan J, Cheng Y, Wang Q, Liu L, Zhang W, Jin B. Transformer and graph convolution-based unsupervised detection of machine anomalous sound under domain shifts. IEEE Transactions on Emerging Topics in Computational Intelligence, 2024, 8(4): 2827-2842. DOI:10.1109/TETCI.2024.3377728
Akram A, Sabir A. Fine-Tuning BERT for aspect extraction in multi-domain ABSA. Informatica, 2023, 47(9): 123-132. DOI:10.31449/inf.v47i9.5217
Prasad S, Gupta H, Ghosh A. Leveraging the potential of large language models. Informatica, 2024, 48(8): 1-16. DOI:10.31449/inf.v48i8.5635
Arkhangelskaya E. O., Nikolenko S. I. Deep learning for natural language processing: a survey. Journal of Mathematical Sciences, 2023, 273: 533-582. DOI:10.1007/s10958-023-06519-6
DOI:
https://doi.org/10.31449/inf.v49i29.10582Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika







