Spatio-Temporal Catcher: a Self-Supervised Transformer for Deepfake Video Detection

被引:3
|
作者
Li, Maosen [1 ,2 ]
Li, Xurong [2 ]
Yu, Kun [2 ]
Deng, Cheng [1 ]
Huang, Heng [3 ]
Mao, Feng [2 ]
Xue, Hui [2 ]
Li, Minghao [2 ]
机构
[1] Xidian Univ, Xian, Shaanxi, Peoples R China
[2] Alibaba Grp, Hangzhou, Zhejiang, Peoples R China
[3] Univ Maryland, College Pk, MD USA
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
关键词
deepfake video detection; self-supervised learning; video analysis;
D O I
10.1145/3581783.3613842
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As deepfake technology has become increasingly sophisticated and accessible, making it easier for individuals with malicious intent to create convincing fake content, which has raised considerable concern in the multimedia and computer vision community. Despite significant advances in deepfake video detection, most existing methods mainly focused on model architecture and training processes with little focus on data perspectives. In this paper, we argue that data quality has become the main bottleneck of current research. To be specific, in the pre-training phase, the domain shift between pre-training and target datasets may lead to poor generalization ability. Meanwhile, in the training phase, the low fidelity of the existing datasets leads to detectors relying on specific low-level visual artifacts or inconsistency. To overcome the shortcomings, (1). In the pre-training phase, pre-train our model on high-quality facial videos by utilizing data-efficient reconstruction-based self-supervised learning to solve domain shift. (2). In the training phase, we develop a novel spatio-temporal generator that can synthesize various high-quality "fake" videos in large quantities at a low cost, which enables our model to learn more general spatio-temporal representations in a self-supervised manner. (3). Additinally, to take full advantage of synthetic "fake" videos, we adopt diversity losses at both frame and video levels to explore the diversity of clues in "fake" videos. Our proposed framework is data-efficient and does not require any real-world deepfake videos. Extensive experiments demonstrate that our method significantly improves the generalization capability. Particularly on the most challenging CDF and DFDC datasets, our method outperforms the baselines by 8.88% and 7.73% points, respectively. Our code and Appendix can be found in github.com/llosta/STC.
引用
收藏
页码:8707 / 8718
页数:12
相关论文
共 50 条
  • [1] Self-Supervised Graph Transformer for Deepfake Detection
    Khormali, Aminollah
    Yuan, Jiann-Shiun
    IEEE ACCESS, 2024, 12 : 58114 - 58127
  • [2] Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics
    Wang, Jiangliu
    Jiao, Jianbo
    Bao, Linchao
    He, Shengfeng
    Liu, Wei
    Liu, Yun-hui
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (07) : 3791 - 3806
  • [3] SELF-SUPERVISED SPATIO-TEMPORAL REPRESENTATION LEARNING OF SATELLITE IMAGE TIME SERIES
    Dumeur, Iris
    Valero, Silvia
    Inglada, Jordi
    IGARSS 2023 - 2023 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, 2023, : 642 - 645
  • [4] CONTRASTIVE SELF-SUPERVISED LEARNING FOR SPATIO-TEMPORAL ANALYSIS OF LUNG ULTRASOUND VIDEOS
    Chen, Li
    Rubin, Jonathan
    Ouyang, Jiahong
    Balaraju, Naveen
    Patil, Shubham
    Mehanian, Courosh
    Kulhare, Sourabh
    Millin, Rachel
    Gregory, Kenton W.
    Gregory, Cynthia R.
    Zhu, Meihua
    Kessler, David O.
    Malia, Laurie
    Dessie, Almaz
    Rabiner, Joni
    Coneybeare, Di
    Shopsin, Bo
    Hersh, Andrew
    Madar, Cristian
    Shupp, Jeffrey
    Johnson, Laura S.
    Avila, Jacob
    Dwyer, Kristin
    Weimersheimer, Peter
    Raju, Balasundar
    Kruecker, Jochen
    Chen, Alvin
    2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,
  • [5] Self-Supervised Spatio-Temporal Graph Learning for Point-of-Interest Recommendation
    Liu, Jiawei
    Gao, Haihan
    Shi, Chuan
    Cheng, Hongtao
    Xie, Qianlong
    APPLIED SCIENCES-BASEL, 2023, 13 (15):
  • [6] Implicitly using Human Skeleton in Self-supervised Learning: Influence on Spatio-temporal Puzzle Solving and on Video Action Recognition
    Riand, Mathieu
    Dolle, Laurent
    Le Callet, Patrick
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON ROBOTICS, COMPUTER VISION AND INTELLIGENT SYSTEMS (ROBOVIS), 2021, : 128 - 135
  • [7] Self-supervised dynamic stochastic graph network for spatio-temporal wind speed forecasting
    Wu, Tangjie
    Ling, Qiang
    ENERGY, 2024, 304
  • [8] Temporal Scene Montage for Self-Supervised Video Scene Boundary Detection
    Tan, Jiawei
    Yang, Pingan
    Chen, Lu
    Wang, Hongxing
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (07)
  • [9] Self-Supervised Video-Centralised Transformer for Video Face Clustering
    Wang, Yujiang
    Dong, Mingzhi
    Shen, Jie
    Luo, Yiming
    Lin, Yiming
    Ma, Pingchuan
    Petridis, Stavros
    Pantic, Maja
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 12944 - 12959
  • [10] Self-Supervised Global Spatio-Temporal Interaction Pre-Training for Group Activity Recognition
    Du, Zexing
    Wang, Xue
    Wang, Qing
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5076 - 5088