Multi-modal Video Forgery Detection via Improved Efficient-Net With Attention and Transformer Fusion

Zheng Ji, Luhao Cao

Abstract


With the continuous advancement of deep learning technology, video forgery technology brings serious negative social impacts. However, existing video forgery detection technologies suffer from low detection accuracy, poor feature extraction capabilities, and insufficient robustness. Therefore, the study proposes two video forgery detection models based on Improved Efficient-Net and multi-modal feature fusion. The Improved Efficient-Net model utilizes structural similarity coefficients to enhance the video images and introduces a hybrid attention module in the Efficient-Net. The multi-modal feature fusion model uses the red, green, and blue domains of the image, the frequency domain, and the optical flow field features for fusion, and uses a hybrid loss function to weight all the loss function errors. The experiment shows that the maximum recognition accuracy of the improved Efficient-Net in the FaceForensics++ dataset is 98.57%, which is 6.24% as well as 9.53% higher than the baseline Efficient-Net and Convolutional Visual Transformer models, respectively. In the FaceForensics++ dataset, the multi-modal feature fusion model is able to achieve a recognition accuracy of 99.26%. In the BioDeepAV dataset, the multi-modal feature fusion model has a maximum decrease in recognition accuracy of 20.57%, which is 2.81% less than the benchmark Efficient-Net model, and the recognition accuracy is still the highest among all models. Therefore, the improved model can validly improve the accuracy of forged video identification, improve the efficiency of Internet supervision, and reduce the social harm of video forgery.


Full Text:

PDF


DOI: https://doi.org/10.31449/inf.v49i30.8831

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.