Question Modifiers in Visual Question Answering

被引：0

作者：

Britton, William ^{[1
]}

Sarkhel, Somdeb ^{[2
]}

Venugopal, Deepak ^{[1
]}

机构：

[1] Univ Memphis, Memphis, TN 38152 USA

[2] Adobe Res, Bangalore, Karnataka, India

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

基金：

美国国家科学基金会;

关键词：

visual question answering; modifiers; deep models; perception;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Visual Question Answering (VQA) is a challenge problem that can advance AI by integrating several important sub-disciplines including natural language understanding and computer vision. Large VQA datasets that are publicly available for training and evaluation have driven the growth of VQA models that have obtained increasingly larger accuracy scores. However, it is also important to understand how much a model understands the details that are provided in a question. For example, studies in psychology have shown that syntactic complexity places a larger cognitive load on humans. Analogously, we want to understand if models have the perceptual capability to handle modifications to questions. Therefore, we develop a new dataset using Amazon Mechanical Turk where we asked workers to add modifiers to questions based on object properties and spatial relationships. We evaluate this data on LXMERT which is a state-of-the-art model in VQA that focuses more extensively on language processing. Our conclusions indicate that there is a significant negative impact on the performance of the model when the questions are modified to include more detailed information.

引用

页码：1472 / 1479

页数：8

共 16 条

[1] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[2]

Das A., 2016, C EMP METH NAT LANG

[3] Approximate statistical tests for comparing supervised classification learning algorithms [J].

Dietterich, TG .

NEURAL COMPUTATION, 1998, 10 (07) :1895-1923

[4]

Ellison DH, 2020, CARDIORENAL SYNDROME IN HEART FAILURE, P51, DOI 10.1007/978-3-030-21033-5_5

[5] The McNemar test for binary matched-pairs data: mid-p and asymptotic are better than exact conditional [J].

Fagerland, Morten W. ;

Lydersen, Stian ;

Laake, Petter .

BMC MEDICAL RESEARCH METHODOLOGY, 2013, 13

[6]

Fukui A., 2016, P 2016 C EMP METH NA

[7] Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].

Goyal, Yash ;

Khot, Tejas ;

Summers-Stay, Douglas ;

Batra, Dhruv ;

Parikh, Devi .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6325-6334

[8]

Grice HP., 1975, SYNTAX SEMANTICS SPE, P41

[9]

Jas M, 2015, PROC CVPR IEEE, P2727, DOI 10.1109/CVPR.2015.7298889

[10] Brain activation modulated by sentence comprehension [J].

Just, MA ;

Carpenter, PA ;

Keller, TA ;

Eddy, WF ;

Thulborn, KR .

SCIENCE, 1996, 274 (5284) :114-116

← 1 2 →