Knowledge-Augmented Visual Question Answering With Natural Language Explanation

被引:9
作者
Xie, Jiayuan [1 ]
Cai, Yi [2 ,3 ]
Chen, Jiali [2 ,3 ]
Xu, Ruohang [2 ,3 ]
Wang, Jiexin [2 ,3 ]
Li, Qing [1 ]
机构
[1] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China
[2] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China
[3] South China Univ Technol, Key Lab Big Data & Intelligent Robot, Minist Educ, Guangzhou 510006, Peoples R China
基金
中国国家自然科学基金;
关键词
Task analysis; Visualization; Feature extraction; Question answering (information retrieval); Iterative methods; Predictive models; Natural languages; Visual question answering; natural language explanation; multimodal;
D O I
10.1109/TIP.2024.3379900
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual question answering with natural language explanation (VQA-NLE) is a challenging task that requires models to not only generate accurate answers but also to provide explanations that justify the relevant decision-making processes. This task is accomplished by generating natural language sentences based on the given question-image pair. However, existing methods often struggle to ensure consistency between the answers and explanations due to their disregard of the crucial interactions between these factors. Moreover, existing methods overlook the potential benefits of incorporating additional knowledge, which hinders their ability to effectively bridge the semantic gap between questions and images, leading to less accurate explanations. In this paper, we present a novel approach denoted the knowledge-based iterative consensus VQA-NLE (KICNLE) model to address these limitations. To maintain consistency, our model incorporates an iterative consensus generator that adopts a multi-iteration generative method, enabling multiple iterations of the answer and explanation in each generation. In each iteration, the current answer is utilized to generate an explanation, which in turn guides the generation of a new answer. Additionally, a knowledge retrieval module is introduced to provide potentially valid candidate knowledge, guide the generation process, effectively bridge the gap between questions and images, and enable the production of high-quality answer-explanation pairs. Extensive experiments conducted on three different datasets demonstrate the superiority of our proposed KICNLE model over competing state-of-the-art approaches. Our code is available at https://github.com/Gary-code/KICNLE.
引用
收藏
页码:2652 / 2664
页数:13
相关论文
共 50 条
[31]   Visual Question Answering [J].
Nada, Ahmed ;
Chen, Min .
2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, :6-10
[32]   Overcoming Language Priors with Counterfactual Inference for Visual Question Answering [J].
Ren, Zhibo ;
Wang, Huizhen ;
Zhu, Muhua ;
Wang, Yichao ;
Xiao, Tong ;
Zhu, Jingbo .
CHINESE COMPUTATIONAL LINGUISTICS, CCL 2023, 2023, 14232 :58-71
[33]   Quantifying and Alleviating the Language Prior Problem in Visual Question Answering [J].
Guo, Yangyang ;
Cheng, Zhiyong ;
Nie, Liqiang ;
Liu, Yibing ;
Wang, Yinglong ;
Kankanhalli, Mohan .
PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, :75-84
[34]   Interpretable Visual Question Answering by Reasoning on Dependency Trees [J].
Cao, Qingxing ;
Liang, Xiaodan ;
Li, Bailin ;
Lin, Liang .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (03) :887-901
[35]   RSVQA: Visual Question Answering for Remote Sensing Data [J].
Lobry, Sylvain ;
Marcos, Diego ;
Murray, Jesse ;
Tuia, Devis .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2020, 58 (12) :8555-8566
[36]   Unbiased Visual Question Answering by Leveraging Instrumental Variable [J].
Pan, Yonghua ;
Liu, Jing ;
Jin, Lu ;
Li, Zechao .
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :6648-6662
[37]   Knowledge-Based Visual Question Generation [J].
Xie, Jiayuan ;
Fang, Wenhao ;
Cai, Yi ;
Huang, Qingbao ;
Li, Qing .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (11) :7547-7558
[38]   DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation [J].
Zhang, Weifeng ;
Yu, Jing ;
Zhao, Wenhong ;
Ran, Chuan .
INFORMATION FUSION, 2021, 72 :70-79
[39]   DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering [J].
Wang, Jianyu ;
Bao, Bing-Kun ;
Xu, Changsheng .
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 24 :3369-3380
[40]   Multitask Learning for Visual Question Answering [J].
Ma, Jie ;
Liu, Jun ;
Lin, Qika ;
Wu, Bei ;
Wang, Yaxian ;
You, Yang .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) :1380-1394