Knowledge-Augmented Visual Question Answering With Natural Language Explanation

被引:8
作者
Xie, Jiayuan [1 ]
Cai, Yi [2 ,3 ]
Chen, Jiali [2 ,3 ]
Xu, Ruohang [2 ,3 ]
Wang, Jiexin [2 ,3 ]
Li, Qing [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
[2] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China
[3] South China Univ Technol, Key Lab Big Data & Intelligent Robot, Minist Educ, Guangzhou 510006, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Visualization; Feature extraction; Question answering (information retrieval); Iterative methods; Predictive models; Natural languages; Visual question answering; natural language explanation; multimodal;
D O I
10.1109/TIP.2024.3379900
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering with natural language explanation (VQA-NLE) is a challenging task that requires models to not only generate accurate answers but also to provide explanations that justify the relevant decision-making processes. This task is accomplished by generating natural language sentences based on the given question-image pair. However, existing methods often struggle to ensure consistency between the answers and explanations due to their disregard of the crucial interactions between these factors. Moreover, existing methods overlook the potential benefits of incorporating additional knowledge, which hinders their ability to effectively bridge the semantic gap between questions and images, leading to less accurate explanations. In this paper, we present a novel approach denoted the knowledge-based iterative consensus VQA-NLE (KICNLE) model to address these limitations. To maintain consistency, our model incorporates an iterative consensus generator that adopts a multi-iteration generative method, enabling multiple iterations of the answer and explanation in each generation. In each iteration, the current answer is utilized to generate an explanation, which in turn guides the generation of a new answer. Additionally, a knowledge retrieval module is introduced to provide potentially valid candidate knowledge, guide the generation process, effectively bridge the gap between questions and images, and enable the production of high-quality answer-explanation pairs. Extensive experiments conducted on three different datasets demonstrate the superiority of our proposed KICNLE model over competing state-of-the-art approaches. Our code is available at https://github.com/Gary-code/KICNLE.
引用
收藏
页码:2652 / 2664
页数:13
相关论文
共 50 条
[21]   Question Answering Mediated by Visual Clues and Knowledge Graphs [J].
de Faria, Fabricio F. ;
Usbeck, Ricardo ;
Sarullo, Alessio ;
Mu, Tingting ;
Freitas, Andre .
COMPANION PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2018 (WWW 2018), 2018, :1937-1939
[22]   Bilinear Graph Networks for Visual Question Answering [J].
Guo, Dalu ;
Xu, Chang ;
Tao, Dacheng .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (02) :1023-1034
[23]   Change Detection Meets Visual Question Answering [J].
Yuan, Zhenghang ;
Mou, Lichao ;
Xiong, Zhitong ;
Zhu, Xiao Xiang .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[24]   Image captioning for effective use of language models in knowledge-based visual question answering [J].
Salaberria, Ander ;
Azkune, Gorka ;
Lacalle, Oier Lopez de ;
Soroa, Aitor ;
Agirre, Eneko .
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 212
[25]   Frame Augmented Alternating Attention Network for Video Question Answering [J].
Zhang, Wenqiao ;
Tang, Siliang ;
Cao, Yanpeng ;
Pu, Shiliang ;
Wu, Fei ;
Zhuang, Yueting .
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (04) :1032-1041
[26]   CGMVQA: A New Classification and Generative Model for Medical Visual Question Answering [J].
Ren, Fuji ;
Zhou, Yangyang .
IEEE ACCESS, 2020, 8 :50626-50636
[27]   Self-Adaptive Neural Module Transformer for Visual Question Answering [J].
Zhong, Huasong ;
Chen, Jingyuan ;
Shen, Chen ;
Zhang, Hanwang ;
Huang, Jianqiang ;
Hua, Xian-Sheng .
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :1264-1273
[28]   Multimodal Encoder-Decoder Attention Networks for Visual Question Answering [J].
Chen, Chongqing ;
Han, Dezhi ;
Wang, Jun .
IEEE ACCESS, 2020, 8 :35662-35671
[29]   Cross-Modal Retrieval for Knowledge-Based Visual Question Answering [J].
Lerner, Paul ;
Ferret, Olivier ;
Guinaudeau, Camille .
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 :421-438
[30]   Visual Question Answering [J].
Nada, Ahmed ;
Chen, Min .
2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, :6-10