Detect Text Forgery with Non-forged Image Features: A Framework for Detection and Grounding of Image-Text Manipulation

被引:0
作者
Wang, Yangyang [1 ]
Miao, Changtao [1 ]
Chu, Qi [1 ]
Gong, Tao [1 ]
Sheng, Dianmo [1 ]
Wang, Jiazhen [1 ]
Liu, Bin [1 ]
Yu, Nenghai [1 ]
机构
[1] Univ Sci & Technol China, CAS Key Lab Electromagnet Space Informat, Hefei, Peoples R China
来源
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI | 2025年 / 15041卷
基金
中国国家自然科学基金;
关键词
Multi-modal; Manipulation detection and grounding; Feature fusion; Gated network;
D O I
10.1007/978-981-97-8795-1_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rapid development of generative models, multimodal fake media has proliferated across the Internet. Detecting and Grounding forgery images and text is crucial in advancing cybersecurity. Most existing approaches utilize image-text inconsistency to detect and ground the multi-modal forgery. However, simultaneous manipulations in visual and textual modalities may still maintain the consistency between forged images and forged text, making detecting forgery challenging. To address this problem, we divide the task of detecting multi-modal forgery into two sub-tasks: detecting image forgery and detecting text forgery with non-forged image areas. Specifically, we propose a novel progressive reasoning framework that only fuses the feature between text and authentic image area controlled by the Forgery-Aware Feature Gate (FAFG). Additionally, we introduce Multi-Scale Feature Aggregation (MSFA) to enhance image forgery detection by aggregating multi-scale image features. Experimental results demonstrate that our method outperforms previous state-of-the-art methods with even fewer training epochs.
引用
收藏
页码:366 / 380
页数:15
相关论文
共 25 条
  • [1] Boididou C., 2015, MEDIAEVAL 2015, V1436
  • [2] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [3] Chakraborty A., 2023, P AAAI C ART INT, V37, P16178
  • [4] Cross-modal Ambiguity Learning for Multimodal Fake News Detection
    Chen, Yixuan
    Li, Dongsheng
    Zhang, Peng
    Sui, Jie
    Lv, Qin
    Lu, Tun
    Shang, Li
    [J]. PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 2897 - 2905
  • [5] Dynamic Head: Unifying Object Detection Heads with Attentions
    Dai, Xiyang
    Chen, Yinpeng
    Xiao, Bin
    Chen, Dongdong
    Liu, Mengchen
    Yuan, Lu
    Zhang, Lei
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7369 - 7378
  • [6] An Empirical Study of Training End-to-End Vision-and-Language Transformers
    Dou, Zi-Yi
    Xu, Yichong
    Gan, Zhe
    Wang, Jianfeng
    Wang, Shuohang
    Wang, Lijuan
    Zhu, Chenguang
    Zhang, Pengchuan
    Yuan, Lu
    Peng, Nanyun
    Liu, Zicheng
    Zeng, Michael
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18145 - 18155
  • [7] Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs
    Jin, Zhiwei
    Cao, Juan
    Guo, Han
    Zhang, Yongdong
    Luo, Jiebo
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 795 - 803
  • [8] Kim W, 2021, PR MACH LEARN RES, V139
  • [9] Li LZ, 2020, Arxiv, DOI arXiv:1912.13457
  • [10] Towards Multimodal Disinformation Detection by Vision-language Knowledge Interaction
    Li, Qilei
    Gao, Mingliang
    Zhang, Guisheng
    Zhai, Wenzhe
    Chen, Jinyong
    Jeon, Gwanggil
    [J]. INFORMATION FUSION, 2024, 102