VQA as a factoid question answering problem: A novel approach for knowledge-aware and explainable visual question answering

被引：7

作者：

Narayanan, Abhishek ^{[1
]}

Rao, Abijna ^{[1
]}

Prasad, Abhishek ^{[1
]}

Natarajan, S. ^{[1
]}

机构：

[1] PES Univ, Dept Comp Sci & Engn, Bangalore, Karnataka, India

来源：

IMAGE AND VISION COMPUTING | 2021年 / 116卷

关键词：

Visual question answering; Factoid question answering; Knowledge based reasoning; Explainable VQA;

D O I：

10.1016/j.imavis.2021.104328

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With recent advancements in machine perception and scene understanding, Visual Question Answering (VQA) has garnered much attraction from researchers in the direction of training neural models for jointly analyzing, grounding and reasoning over the multi-modal space of image visual context and natural language in order to answer natural language questions pertaining to the image contents. However, though recent works have achieved significant improvement over state-of-art models for answering questions that are answerable by solely referring to the visual context of the image, such models are often limited, being incapable of tackling questions involving external world knowledge beyond the visible contents. Though recently, research has been driven towards tackling external knowledge based VQA as well, there is significant room for improvement as limited studies exist in this area. Inspired by the aforementioned challenges involved, this paper is aimed at answering free form and open ended natural language questions, not limited to visual context of an image, but external world knowledge as well. With this motive, inspired by human cognitive abilities of comprehending and reasoning answers when given a set of facts, this paper proposes a novel model architecture to model VQA as a factoid question answering problem, leveraging state-of-the-art deep learning techniques for reasoning and inferring answers to free form questions, in an attempt of improving the state-of-art in open ended visual question answering. (c) 2021 Elsevier B.V. All rights reserved.

引用

页数：12

共 44 条

[1]

Agarwal V., P IEEE CVF C COMP VI, P9690

[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[3]

[Anonymous], 2015, Simple baseline for visual question answering

[4]

[Anonymous], 2016, ARXIV160603647

[5] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[6]

Blundell C, 2015, PR MACH LEARN RES, V37, P1613

[7]

Cho K., 2014, ARXIV14061078, DOI 10.3115/v1/D14-1179

[8]

Chollet F., KERAS

[9] Stimulus-driven and concept-driven analysis for image caption generation [J].

Ding, Songtao ;

Qu, Shiru ;

Xi, Yuling ;

Wan, Shaohua .

NEUROCOMPUTING, 2020, 398 :520-530

[10]

Fukui A, 2016, P C EMP METH NAT LAN, P457, DOI [10.18653/v1/d16-1044, DOI 10.18653/V1/D16-1044]

← 1 2 3 4 5 →