Detect Text Forgery with Non-forged Image Features: A Framework for Detection and Grounding of Image-Text Manipulation

被引：0

作者：

Wang, Yangyang ^{[1
]}

Miao, Changtao ^{[1
]}

Chu, Qi ^{[1
]}

Gong, Tao ^{[1
]}

Sheng, Dianmo ^{[1
]}

Wang, Jiazhen ^{[1
]}

Liu, Bin ^{[1
]}

Yu, Nenghai ^{[1
]}

机构：

[1] Univ Sci & Technol China, CAS Key Lab Electromagnet Space Informat, Hefei, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI | 2025年 / 15041卷

基金：

中国国家自然科学基金;

关键词：

Multi-modal; Manipulation detection and grounding; Feature fusion; Gated network;

D O I：

10.1007/978-981-97-8795-1_25

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the rapid development of generative models, multimodal fake media has proliferated across the Internet. Detecting and Grounding forgery images and text is crucial in advancing cybersecurity. Most existing approaches utilize image-text inconsistency to detect and ground the multi-modal forgery. However, simultaneous manipulations in visual and textual modalities may still maintain the consistency between forged images and forged text, making detecting forgery challenging. To address this problem, we divide the task of detecting multi-modal forgery into two sub-tasks: detecting image forgery and detecting text forgery with non-forged image areas. Specifically, we propose a novel progressive reasoning framework that only fuses the feature between text and authentic image area controlled by the Forgery-Aware Feature Gate (FAFG). Additionally, we introduce Multi-Scale Feature Aggregation (MSFA) to enhance image forgery detection by aggregating multi-scale image features. Experimental results demonstrate that our method outperforms previous state-of-the-art methods with even fewer training epochs.

引用

页码：366 / 380

页数：15

共 25 条

[1] Boididou C., 2015, MEDIAEVAL 2015, V1436
[2] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[3] Chakraborty A., 2023, P AAAI C ART INT, V37, P16178
[4] Cross-modal Ambiguity Learning for Multimodal Fake News Detection
Chen, Yixuan
Li, Dongsheng
Zhang, Peng
Sui, Jie
Lv, Qin
Lu, Tun
Shang, Li
[J]. PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 2897 - 2905
[5] Dynamic Head: Unifying Object Detection Heads with Attentions
Dai, Xiyang
Chen, Yinpeng
Xiao, Bin
Chen, Dongdong
Liu, Mengchen
Yuan, Lu
Zhang, Lei
[J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 7369 - 7378
[6] An Empirical Study of Training End-to-End Vision-and-Language Transformers
Dou, Zi-Yi
Xu, Yichong
Gan, Zhe
Wang, Jianfeng
Wang, Shuohang
Wang, Lijuan
Zhu, Chenguang
Zhang, Pengchuan
Yuan, Lu
Peng, Nanyun
Liu, Zicheng
Zeng, Michael
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18145 - 18155
[7] Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs
Jin, Zhiwei
Cao, Juan
Guo, Han
Zhang, Yongdong
Luo, Jiebo
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 795 - 803
[8] Kim W, 2021, PR MACH LEARN RES, V139
[9] Li LZ, 2020, Arxiv, DOI arXiv:1912.13457
[10] Towards Multimodal Disinformation Detection by Vision-language Knowledge Interaction
Li, Qilei
Gao, Mingliang
Zhang, Guisheng
Zhai, Wenzhe
Chen, Jinyong
Jeon, Gwanggil
[J]. INFORMATION FUSION, 2024, 102

← 1 2 3 →