Knowledge-Augmented Visual Question Answering With Natural Language Explanation

被引：9

作者：

Xie, Jiayuan ^{[1
]}

Cai, Yi ^{[2
,3
]}

Chen, Jiali ^{[2
,3
]}

Xu, Ruohang ^{[2
,3
]}

Wang, Jiexin ^{[2
,3
]}

Li, Qing ^{[1
]}

机构：

[1] Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China

[2] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China

[3] South China Univ Technol, Key Lab Big Data & Intelligent Robot, Minist Educ, Guangzhou 510006, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

基金：

中国国家自然科学基金;

关键词：

Task analysis; Visualization; Feature extraction; Question answering (information retrieval); Iterative methods; Predictive models; Natural languages; Visual question answering; natural language explanation; multimodal;

D O I：

10.1109/TIP.2024.3379900

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Visual question answering with natural language explanation (VQA-NLE) is a challenging task that requires models to not only generate accurate answers but also to provide explanations that justify the relevant decision-making processes. This task is accomplished by generating natural language sentences based on the given question-image pair. However, existing methods often struggle to ensure consistency between the answers and explanations due to their disregard of the crucial interactions between these factors. Moreover, existing methods overlook the potential benefits of incorporating additional knowledge, which hinders their ability to effectively bridge the semantic gap between questions and images, leading to less accurate explanations. In this paper, we present a novel approach denoted the knowledge-based iterative consensus VQA-NLE (KICNLE) model to address these limitations. To maintain consistency, our model incorporates an iterative consensus generator that adopts a multi-iteration generative method, enabling multiple iterations of the answer and explanation in each generation. In each iteration, the current answer is utilized to generate an explanation, which in turn guides the generation of a new answer. Additionally, a knowledge retrieval module is introduced to provide potentially valid candidate knowledge, guide the generation process, effectively bridge the gap between questions and images, and enable the production of high-quality answer-explanation pairs. Extensive experiments conducted on three different datasets demonstrate the superiority of our proposed KICNLE model over competing state-of-the-art approaches. Our code is available at https://github.com/Gary-code/KICNLE.

引用

页码：2652 / 2664

页数：13

共 50 条

[41] Multitask Learning for Visual Question Answering [J].

Ma, Jie ;

Liu, Jun ;

Lin, Qika ;

Wu, Bei ;

Wang, Yaxian ;

You, Yang .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) :1380-1394

[42] VQA as a factoid question answering problem: A novel approach for knowledge-aware and explainable visual question answering [J].

Narayanan, Abhishek ;

Rao, Abijna ;

Prasad, Abhishek ;

Natarajan, S. .

IMAGE AND VISION COMPUTING, 2021, 116

[43] Positional Attention Guided Transformer-Like Architecture for Visual Question Answering [J].

Mao, Aihua ;

Yang, Zhi ;

Lin, Ken ;

Xuan, Jun ;

Liu, Yong-Jin .

IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :6997-7009

[44] Counterfactual Causal-Effect Intervention for Interpretable Medical Visual Question Answering [J].

Cai, Linqin ;

Fang, Haodu ;

Xu, Nuoying ;

Ren, Bo .

IEEE TRANSACTIONS ON MEDICAL IMAGING, 2024, 43 (12) :4430-4441

[45] Knowledge enhancement and scene understanding for knowledge-based visual question answering [J].

Su, Zhenqiang ;

Gou, Gang .

KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (03) :2193-2208

[46] Knowledge enhancement and scene understanding for knowledge-based visual question answering [J].

Zhenqiang Su ;

Gang Gou .

Knowledge and Information Systems, 2024, 66 :2193-2208

[47] Question Modifiers in Visual Question Answering [J].

Britton, William ;

Sarkhel, Somdeb ;

Venugopal, Deepak .

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, :1472-1479

[48] Resolving Zero-Shot and Fact-Based Visual Question Answering via Enhanced Fact Retrieval [J].

Wu, Sen ;

Zhao, Guoshuai ;

Qian, Xueming .

IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :1790-1800

[49] QUES-TO-VISUAL GUIDED VISUAL QUESTION ANSWERING [J].

Wu, Xiangyu ;

Lu, Jianfeng ;

Li, Zhuanfeng ;

Xiong, Fengchao .

2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, :4193-4197

[50] From Easy to Hard: Learning Language-Guided Curriculum for Visual Question Answering on Remote Sensing Data [J].

Yuan, Zhenghang ;

Mou, Lichao ;

Wang, Qi ;

Zhu, Xiao Xiang .

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60

← 1 2 3 4 5 →