Towards Multimodal Disinformation Detection by Vision-language Knowledge Interaction

被引:13
作者
Li, Qilei [1 ,2 ]
Gao, Mingliang [1 ]
Zhang, Guisheng [1 ]
Zhai, Wenzhe [1 ]
Chen, Jinyong [1 ]
Jeon, Gwanggil [3 ]
机构
[1] Shandong Univ Technol, Sch Elect & Elect Engn, Zibo 255000, Peoples R China
[2] Queen Mary Univ London, Sch Elect Engn & Comp Sci, London E1 4NS, England
[3] Incheon Natl Univ, Dept Embedded Syst Engn, Incheon 22012, South Korea
关键词
Multimodal disinformation; Deepfake detection; Manipulation grounding; Information aggregation; Cross-modality interaction; CONVOLUTIONAL NEURAL-NETWORKS;
D O I
10.1016/j.inffus.2023.102037
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Disinformation created by artificial neural networks has been widespread along with the recent rapid progress in multimodal learning, and the arising of vision-language foundation models. This disinformation caused a substantial negative impact on society. To solve this pressing issue, numerous efforts have been made to detect either image deepfake or text manipulation. These methods generally focus on a single modality while ignoring the complementary knowledge provided by the counterpart in the other modalities. In this paper, we aim to detect multimodal disinformation and further identify manipulated image areas or text tokens. To this aim, a novel framework termed Vision-language Knowledge Interaction (ViKI) is designed to explore the semantic correlation of an object in different modalities. Specifically, we propose a vision-language embedding regulator to build a joint feature space in which the embeddings of the same semantic are well-aligned. Besides, we perform cross-modality knowledge interaction so as to aggregate uni-modality embedding by adaptively injecting cross-modality information. By exploring vision-language knowledge jointly, ViKI produces accurate predictions for detecting and grounding disinformation. We demonstrate the superiority of ViKI by ablation studies and comparisons with the state-of-the-art methods on large-scale benchmarks. Notably, ViKI outperforms the state-of-the-art works by a rise of 3.71% in precision and 2.14% in CF1 respectively.
引用
收藏
页数:11
相关论文
共 61 条
[1]   Open-Domain, Content-based, Multi-modal Fact-checking of Out-of-Context Images via Online Resources [J].
Abdelnabi, Sahar ;
Hasan, Rakibul ;
Fritz, Mario .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :14920-14929
[2]  
Aneja Shivangi, 2021, arXiv
[3]  
Aslani S, 2020, I S BIOMED IMAGING, P781, DOI [10.1109/ISBI45749.2020.9098721, 10.1109/isbi45749.2020.9098721]
[4]   Growing random forest on deep convolutional neural networks for scene categorization [J].
Bai, Shuang .
EXPERT SYSTEMS WITH APPLICATIONS, 2017, 71 :279-287
[5]  
Bharadwaj P., 2019, Int. J. Nat. Lang. Comput., V8, P80, DOI 10.5121/ijnlc.2019.8302
[6]  
Chen PHH, 2021, Arxiv, DOI arXiv:1912.01238
[7]   Cross-Forgery Analysis of Vision Transformers and CNNs for Deepfake Image Detection [J].
Coccomini, Davide Alessandro ;
Caldelli, Roberto ;
Falchi, Fabrizio ;
Gennaro, Claudio ;
Amato, Giuseppe .
1ST ACM INTERNATIONAL WORKSHOP ON MULTIMEDIA AI AGAINST DISINFORMATION, MAD 2022, 2022, :52-58
[8]  
Devlin J, 2019, Arxiv, DOI arXiv:1810.04805
[9]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[10]   Domain Generalization for Object Recognition with Multi-task Autoencoders [J].
Ghifary, Muhammad ;
Kleijn, W. Bastiaan ;
Zhang, Mengjie ;
Balduzzi, David .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2551-2559