Multi-modal Video Forgery Detection via Improved Efficient-Net With Attention and Transformer Fusion

Zheng Ji; Luhao Cao

doi:10.31449/inf.v49i30.8831

Contact Editors Europe, Africa:
Matjaz Gams
N. and S. America:
Karthick Gunasekaran
Asia, Australia:
Vinay Singh
Overview papers:
Maria Ganzha
Wiesław Pawlowski
Aleksander Denisiuk Abstacting / Indexing

Informatica is surveyed by:

ACM Digital Library
Citeseer
COBISS
Compendex
Computer & Information Systems Abstracts
Computer Database
Computer Science Index
dLib.si
DBLP Computer Science Bibliography
Directory of Open Access Journals
Google Scholar
InfoTrac OneFile
Inspec
Linguistic and Language Behaviour Abstracts
Mathematical Reviews, MatSciNet, MatSci on SilverPlatter and Current Mathematical Publications
Scopus Publishing

Informatica is published by:

Support

Informatica is supported by:

ACM Slovenia
Slovenian Society for Pattern Recognition
Slovenian Artificial Intelligence Society
Slovenian Society for Cognitive Science
Slovenian Society of Mathematicians, Physicists and Astronomers
Automatic Control Society of Slovenia
Slovenian Academy of Engineering
International Federation for Information Processing

Journal Help

User

Journal Content Search
Browse

Information

Notifications

About The Authors

Zheng Ji

China

Luhao Cao

Support & Indexing

Multi-modal Video Forgery Detection via Improved Efficient-Net With Attention and Transformer Fusion

Zheng Ji, Luhao Cao

Abstract

With the continuous advancement of deep learning technology, video forgery technology brings serious negative social impacts. However, existing video forgery detection technologies suffer from low detection accuracy, poor feature extraction capabilities, and insufficient robustness. Therefore, the study proposes two video forgery detection models based on Improved Efficient-Net and multi-modal feature fusion. The Improved Efficient-Net model utilizes structural similarity coefficients to enhance the video images and introduces a hybrid attention module in the Efficient-Net. The multi-modal feature fusion model uses the red, green, and blue domains of the image, the frequency domain, and the optical flow field features for fusion, and uses a hybrid loss function to weight all the loss function errors. The experiment shows that the maximum recognition accuracy of the improved Efficient-Net in the FaceForensics++ dataset is 98.57%, which is 6.24% as well as 9.53% higher than the baseline Efficient-Net and Convolutional Visual Transformer models, respectively. In the FaceForensics++ dataset, the multi-modal feature fusion model is able to achieve a recognition accuracy of 99.26%. In the BioDeepAV dataset, the multi-modal feature fusion model has a maximum decrease in recognition accuracy of 20.57%, which is 2.81% less than the benchmark Efficient-Net model, and the recognition accuracy is still the highest among all models. Therefore, the improved model can validly improve the accuracy of forged video identification, improve the efficiency of Internet supervision, and reduce the social harm of video forgery.

Full Text:

PDF

DOI: https://doi.org/10.31449/inf.v49i30.8831

This work is licensed under a Creative Commons Attribution 3.0 License.

Informatica is financially supported by the Slovenian research agency from the Call for co-financing of scientific periodical publications.

Webmaster: Mario Konecki

Username
Password
Remember me