ISTVT: Interpretable Spatial-Temporal Video Transformer for Deepfake Detection

被引:52
|
作者
Zhao, Cairong [1 ]
Wang, Chutian
Hu, Guosheng [2 ]
Chen, Haonan [3 ]
Liu, Chun [4 ]
Tang, Jinhui [5 ]
机构
[1] Tongji Univ, Dept Comp Sci & Technol, Shanghai 201804, Peoples R China
[2] Oosto, Belfast 38330, North Ireland
[3] Alibaba Grp, Hangzhou 310052, Peoples R China
[4] Tongji Univ, Coll Surveying & Geoinformat, Shanghai 200092, Peoples R China
[5] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China
关键词
Deepfakes; Transformers; Visualization; Task analysis; Electronic mail; Shape; Robustness; Deepfake detection; video transformer; deep learning interpretability;
D O I
10.1109/TIFS.2023.3239223
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the rapid development of Deepfake synthesis technology, our information security and personal privacy have been severely threatened in recent years. To achieve a robust Deepfake detection, researchers attempt to exploit the joint spatial-temporal information in the videos, like using recurrent networks and 3D convolutional networks. However, these spatial-temporal models remain room to improve. Another general challenge for spatial-temporal models is that people do not clearly understand what these spatial-temporal models really learn. To address these two challenges, in this paper, we propose an Interpretable Spatial-Temporal Video Transformer (ISTVT), which consists of a novel decomposed spatial-temporal self-attention and a self-subtract mechanism to capture spatial artifacts and temporal inconsistency for robust Deepfake detection. Thanks to this decomposition, we propose to interpret ISTVT by visualizing the discriminative regions for both spatial and temporal dimensions via the relevance (the pixel-wise importance on the input) propagation algorithm. We conduct extensive experiments on large-scale datasets, including FaceForensics++, FaceShifter, DeeperForensics, Celeb-DF, and DFDC datasets. Our strong performance of intra-dataset and cross-dataset Deepfake detection demonstrates the effectiveness and robustness of our method, and our visualization-based interpretability offers people insights into our model.
引用
收藏
页码:1335 / 1348
页数:14
相关论文
共 50 条
  • [1] Exploring spatial-temporal features fusion model for Deepfake video detection
    Wu, Jiujiu
    Zhou, Jiyu
    Wang, Danyu
    Wang, Lin
    JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (06)
  • [2] Deepfake Video Detection Model Based on Consistency of Spatial-Temporal Features
    Zhao L.
    Ge W.
    Mao Y.
    Han M.
    Li W.
    Li X.
    Gongcheng Kexue Yu Jishu/Advanced Engineering Sciences, 2020, 52 (04): : 243 - 250
  • [3] Learning Complementary Spatial-Temporal Transformer for Video Salient Object Detection
    Liu, Nian
    Nan, Kepan
    Zhao, Wangbo
    Yao, Xiwen
    Han, Junwei
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (08) : 10663 - 10673
  • [4] Collaborative spatial-temporal video salient object detection with cross attention transformer
    Su, Yuting
    Wang, Weikang
    Liu, Jing
    Jing, Peiguang
    SIGNAL PROCESSING, 2024, 224
  • [5] HierGAT: hierarchical spatial-temporal network with graph and transformer for video HOI detection
    Wu, Junxian
    Zhang, Yujia
    Kampffmeyer, Michael
    Pan, Yi
    Zhang, Chenyu
    Sun, Shiying
    Chang, Hui
    Zhao, Xiaoguang
    MULTIMEDIA SYSTEMS, 2025, 31 (01)
  • [6] Spatial-Temporal Transformer for Video Snapshot Compressive Imaging
    Wang, Lishun
    Cao, Miao
    Zhong, Yong
    Yuan, Xin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (07) : 9072 - 9089
  • [7] ShiftFormer: Spatial-Temporal Shift Operation in Video Transformer
    Yang, Beiying
    Zhu, Guibo
    Ge, Guojing
    Luo, Jinzhao
    Wang, Jinqiao
    2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME, 2023, : 1895 - 1900
  • [8] Learning a spatial-temporal texture transformer network for video inpainting
    Ma, Pengsen
    Xue, Tao
    FRONTIERS IN NEUROROBOTICS, 2022, 16
  • [9] Spatial-temporal Graph Transformer Network for Spatial-temporal Forecasting
    Dao, Minh-Son
    Zetsu, Koji
    Hoang, Duy-Tang
    Proceedings - 2024 IEEE International Conference on Big Data, BigData 2024, 2024, : 1276 - 1281
  • [10] Deepfake Video Detection with Spatiotemporal Dropout Transformer
    Zhang, Daichi
    Lin, Fanzhao
    Hua, Yingying
    Wang, Pengju
    Zeng, Dan
    Ge, Shiming
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5833 - 5841