RT-AVTC: A Real-Time Audio-Visual Tone Correction Network Using Multimodal Deep Learning and Causal Convolution

Min Wang, Yehui Duan

Abstract


Real-time feedback and robustness against environmental noise are critical challenges in computer-aided Chinese tone learning. This paper introduces RT-AVTC, a novel real-time audio-visual tone correction network designed to address those limitations through multimodal deep learning. The proposed architecture integrates a Multi-task Cascaded Convolutional Network (MTCNN) for audio-visual feature alignment, a causal convolution module and a Bidirectional Long Short-Term Memory (BiLSTM) network for robust temporal sequence classification, and a feedback module incorporating Dynamic Time Warping (DTW) for quantitative error analysis. The model was rigorously evaluated on the public LRS3 dataset. Experimental results demonstrated that the RT-AVTC model achieved a state-of-the-art accuracy of 94.26%, significantly outperforming strong baselines including Conformer and Whisper. Notably, in a challenging -5 dB signal-to-noise ratio environment, the multimodal approach yielded a 17.41% performance improvement over its audio-only counterpart. Furthermore, the error correction module provided targeted feedback, improving the average fundamental frequency (F0) curve deviation by over 55% for learners. With its high accuracy, real-time processing capability, and low computational overhead, the RT-AVTC framework presents a valuable and practical solution for effective computer-aided language learning.


Full Text:

PDF

References


Nawroly S S, Popescu D, Thekekara Antony M C. Category-based and target-based data augmentation for dysarthric speech recognition using transfer learning. Studies in Informatics and Control, 2024, 33(4): 83-93. DOI:10.24846/v33i4y202408.

Xu Z Y. Research on deep learning in natural language processing. Advances in Computer and Communication, 2023. DOI: 10.26855/acc.2023.06.018

Yang Z J. Deep Learning Applications in Natural Language Processing and Optimization Strategies. Journal of Modern Education and Culture, 2024. DOI: 10.70767/jmec.v1i2.257

Li L. Application of deep learning technology in speech recognition and language teaching. Lecture Notes in Education Psychology and Public Media, 2024. DOI: 10.54254/2753-7048/59/20241721

Wang Y., Perrin S. Deep Chinese teaching and learning model based on deep learning. International Journal of Languages, Literature and Linguistics, 2024. DOI: 10.18178/ijlll.2024.10.1.479

Yan F, Wang J, Li W. Research on the application of deep learning in natural language processing. Frontiers in Computing and Intelligent Systems, 2024. DOI: 10.54097/m9sxpv44

Xu Z. Research on deep learning in natural language processing. Advances in Computer and Communication, 2023. DOI: 10.26855/acc.2023.06.018

Yang Z. Deep learning applications in natural language processing and optimization strategies. Journal of Modern Education and Culture, 2024. DOI: 10.70767/jmec.v1i2.257

Arkhangelskaya E O, Nikolenko S I. Deep learning for natural language processing: a survey. Journal of Mathematical Sciences, 2023, 273: 533-582. DOI: 10.1007/s10958-023-06519-6

Li L. Application of deep learning technology in speech recognition and language teaching. Lecture Notes in Education Psychology and Public Media, 2024. DOI:10.54254/2753-7048/59/20241721

Khurana Y, Gupta S, Sathyaraj R, Raja S P. RobinNet: A multimodal speech emotion recognition system with speaker recognition for social interactions. IEEE Transactions on Computational Social Systems, 2022, 11(1): 478-487. DOI:10.1109/TCSS.2022.3228649

Khurana S, Laurent A, Glass J. Samu-xlsr: Semantically-aligned multimodal utterance-level cross-lingual speech representation. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1493-1504. DOI:10.48550/arXiv.2205.08180

Choi D, Yeung H H, Werker J F. Sensorimotor foundations of speech perception in infancy. Trends in Cognitive Sciences, 2023, 27(8): 773-784. DOI:10.1016/j.tics.2023.05.007

Zhong Y, Tang J, Li X, Liang X, Liu Z, Li Y, et al. A memristor-based analogue reservoir computing system for real-time and power-efficient signal processing. Nature Electronics, 2022, 5(10): 672-681. DOI:10.1038/s41928-022-00838-3

Mnasri Z, Rovetta S, Masulli F. Anomalous sound event detection: a survey of machine learning based methods and applications. Multimedia Tools and Applications, 2022, 81(4): 5537-5586. DOI:10.1007/s11042-021-11817-9

Geng L, Liang Y, Shan H, Xiao Z, Wang W, Wei M. Pathological voice detection and classification based on multimodal transmission network. Journal of Voice, 2025, 39(3): 591-601. DOI:10.1016/j.jvoice.2022.11.018

Yan J, Cheng Y, Wang Q, Liu L, Zhang W, Jin B. Transformer and graph convolution-based unsupervised detection of machine anomalous sound under domain shifts. IEEE Transactions on Emerging Topics in Computational Intelligence, 2024, 8(4): 2827-2842. DOI:10.1109/TETCI.2024.3377728

Akram A, Sabir A. Fine-Tuning BERT for aspect extraction in multi-domain ABSA. Informatica, 2023, 47(9): 123-132. DOI:10.31449/inf.v47i9.5217

Prasad S, Gupta H, Ghosh A. Leveraging the potential of large language models. Informatica, 2024, 48(8): 1-16. DOI:10.31449/inf.v48i8.5635

Arkhangelskaya E. O., Nikolenko S. I. Deep learning for natural language processing: a survey. Journal of Mathematical Sciences, 2023, 273: 533-582. DOI:10.1007/s10958-023-06519-6




DOI: https://doi.org/10.31449/inf.v49i29.10582

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.