Towards Multimodal Disinformation Detection by Vision-language Knowledge Interaction

被引：13

作者：

Li, Qilei ^{[1
,2
]}

Gao, Mingliang ^{[1
]}

Zhang, Guisheng ^{[1
]}

Zhai, Wenzhe ^{[1
]}

Chen, Jinyong ^{[1
]}

Jeon, Gwanggil ^{[3
]}

机构：

[1] Shandong Univ Technol, Sch Elect & Elect Engn, Zibo 255000, Peoples R China

[2] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London E1 4NS, England

[3] Incheon Natl Univ, Dept Embedded Syst Engn, Incheon 22012, South Korea

来源：

INFORMATION FUSION | 2024年 / 102卷

关键词：

Multimodal disinformation; Deepfake detection; Manipulation grounding; Information aggregation; Cross-modality interaction; CONVOLUTIONAL NEURAL-NETWORKS;

D O I：

10.1016/j.inffus.2023.102037

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Disinformation created by artificial neural networks has been widespread along with the recent rapid progress in multimodal learning, and the arising of vision-language foundation models. This disinformation caused a substantial negative impact on society. To solve this pressing issue, numerous efforts have been made to detect either image deepfake or text manipulation. These methods generally focus on a single modality while ignoring the complementary knowledge provided by the counterpart in the other modalities. In this paper, we aim to detect multimodal disinformation and further identify manipulated image areas or text tokens. To this aim, a novel framework termed Vision-language Knowledge Interaction (ViKI) is designed to explore the semantic correlation of an object in different modalities. Specifically, we propose a vision-language embedding regulator to build a joint feature space in which the embeddings of the same semantic are well-aligned. Besides, we perform cross-modality knowledge interaction so as to aggregate uni-modality embedding by adaptively injecting cross-modality information. By exploring vision-language knowledge jointly, ViKI produces accurate predictions for detecting and grounding disinformation. We demonstrate the superiority of ViKI by ablation studies and comparisons with the state-of-the-art methods on large-scale benchmarks. Notably, ViKI outperforms the state-of-the-art works by a rise of 3.71% in precision and 2.14% in CF1 respectively.

引用

页数：11

共 61 条

[11] IMAGEBIND: One Embedding Space To Bind Them All [J].

Girdhar, Rohit ;

El-Nouby, Alaaeldin ;

Liu, Zhuang ;

Singh, Mannat ;

Alwala, Kalyan Vasudev ;

Joulin, Armand ;

Misra, Ishan .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :15180-15190

[12] Will Deepfakes Do Deep Damage? [J].

Greengard, Samuel .

COMMUNICATIONS OF THE ACM, 2020, 63 (01) :17-19

[13] From Images to Textual Prompts: Zero-shot Visual Question Answering with Frozen Large Language Models [J].

Guo, Jiaxian ;

Li, Junnan ;

Li, Dongxu ;

Tiong, Anthony Meng Huat ;

Li, Boyang ;

Tao, Dacheng ;

Hoi, Steven .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :10867-10877

[14]

Hadsell R., 2006, P IEEE COMP SOC C CO, V2, P1735

[15] Momentum Contrast for Unsupervised Visual Representation Learning [J].

He, Kaiming ;

Fan, Haoqi ;

Wu, Yuxin ;

Xie, Saining ;

Girshick, Ross .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9726-9735

[16] Evading DeepFake Detectors via Adversarial Statistical Consistency [J].

Hou, Yang ;

Guo, Qing ;

Huang, Yihao ;

Xie, Xiaofei ;

Ma, Lei ;

Zhao, Jianjun .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :12271-12280

[17]

Tiong AMH, 2022, Arxiv, DOI arXiv:2210.08773

[18] Single-Side Domain Generalization for Face Anti-Spoofing [J].

Jia, Yunpei ;

Zhang, Jie ;

Shan, Shiguang ;

Chen, Xilin .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :8481-8490

[19] Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs [J].

Jin, Zhiwei ;

Cao, Juan ;

Guo, Han ;

Zhang, Yongdong ;

Luo, Jiebo .

PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, :795-803

[20] MVAE: Multimodal Variational Autoencoder for Fake News Detection [J].

Khattar, Dhruv ;

Goud, Jaipal Singh ;

Gupta, Manish ;

Varma, Vasudeva .

WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, :2915-2921

← 1 2 3 4 5 6 7 →