Two-Way Classroom Interaction Analysis via a Coupled ConvNeXt–Multimodal Transformer for Fine-grained Behavior Recognition
Abstract
With the deepening of the digital transformation of education, intelligent analysis of classroom teaching behavior has become the key to improving teaching quality. Traditional methods are difficult to effectively integrate multi-source heterogeneous data in the classroom, and there are limitations in the joint modeling of spatiotemporal features. To this end, a bidirectional analysis framework coupling multimodal transformer and convolutional neural network (CNN) is proposed: ConvNeXt-T is used as the CNN backbone to extract the spatial features of teachers' body movements, students' postures and scene layouts, and the time dependence and cross-modal global correlation of teacher-student language interaction are captured with the help of multimodal transformers. The study uses 500 minutes of multimodal data from 10 real classrooms (4K camera 30 frames per second, total frames 900,000 frames) as the core dataset, annotates 7 types of behaviors such as teacher teaching, questioning, and student answering, and uses the PyTorch framework to train on NVIDIA GTX 4090 GPU, using AdamW as the optimizer, mixed loss function to process 8 batches of data, and the loss stabilizes at about 0.17 after 80 rounds of training. The results show that the accuracy of the multimodal fusion model is 90.2% in the behavior recognition task, which is significantly higher than that of the single-modal model. The spatio-temporal feature interaction module increases the detection rate of cross-modal correlation by 6.0%, and effectively identifies the linkage relationship between teachers' gesture pointing and students' responses. In the classification of teacher-student interaction, the F1 value of the model reached 88.4%, which was significantly higher than that of the benchmark model. In addition, the model has excellent generalization on public datasets, with an accuracy of 96.54% for NTU60-CV (cross-viewing angle), 98.30% for behavior recognition of UTD-MHAD, and an AUC value of 0.7478. This framework provides new ideas for solving fine-grained behavior analysis in educational scenarios and provides technical support for intelligent teaching evaluation.DOI:
https://doi.org/10.31449/inf.v49i20.10585Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika







