Two-Way Classroom Interaction Analysis via a Coupled ConvNeXt–Multimodal Transformer for Fine-grained Behavior Recognition
Abstract
With the deepening of the digital transformation of education, intelligent analysis of classroom teaching behavior has become the key to improving teaching quality. Traditional methods are difficult to effectively integrate multi-source heterogeneous data in the classroom, and there are limitations in the joint modeling of spatiotemporal features. To this end, a bidirectional analysis framework coupling multimodal transformer and convolutional neural network (CNN) is proposed: ConvNeXt-T is used as the CNN backbone to extract the spatial features of teachers' body movements, students' postures and scene layouts, and the time dependence and cross-modal global correlation of teacher-student language interaction are captured with the help of multimodal transformers. The study uses 500 minutes of multimodal data from 10 real classrooms (4K camera 30 frames per second, total frames 900,000 frames) as the core dataset, annotates 7 types of behaviors such as teacher teaching, questioning, and student answering, and uses the PyTorch framework to train on NVIDIA GTX 4090 GPU, using AdamW as the optimizer, mixed loss function to process 8 batches of data, and the loss stabilizes at about 0.17 after 80 rounds of training. The results show that the accuracy of the multimodal fusion model is 90.2% in the behavior recognition task, which is significantly higher than that of the single-modal model. The spatio-temporal feature interaction module increases the detection rate of cross-modal correlation by 6.0%, and effectively identifies the linkage relationship between teachers' gesture pointing and students' responses. In the classification of teacher-student interaction, the F1 value of the model reached 88.4%, which was significantly higher than that of the benchmark model. In addition, the model has excellent generalization on public datasets, with an accuracy of 96.54% for NTU60-CV (cross-viewing angle), 98.30% for behavior recognition of UTD-MHAD, and an AUC value of 0.7478. This framework provides new ideas for solving fine-grained behavior analysis in educational scenarios and provides technical support for intelligent teaching evaluation.DOI:
https://doi.org/10.31449/inf.v49i20.10585Downloads
Published
How to Cite
Issue
Section
License
Authors retain copyright in their work. By submitting to and publishing with Informatica, authors grant the publisher (Slovene Society Informatika) the non-exclusive right to publish, reproduce, and distribute the article and to identify itself as the original publisher.
All articles are published under the Creative Commons Attribution license CC BY 3.0. Under this license, others may share and adapt the work for any purpose, provided appropriate credit is given and changes (if any) are indicated.
Authors may deposit and share the submitted version, accepted manuscript, and published version, provided the original publication in Informatica is properly cited.







