Relation-Aware Graph Attention Network for Visual Question Answering

被引：568

作者：

Li, Linjie ^{[1
]}

Gan, Zhe ^{[1
]}

Cheng, Yu ^{[1
]}

Liu, Jingjing ^{[1
]}

机构：

[1] Microsoft Dynam 365 AI Res, Bellevue, WA USA

来源：

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年

关键词：

D O I：

10.1109/ICCV.2019.01041

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA.

引用

页码：10312 / 10321

页数：10

共 63 条

[1] Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].

Agrawal, Aishwarya ;

Batra, Dhruv ;

Parikh, Devi ;

Kembhavi, Aniruddha .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980

[2]

ANDERSON P, 2018, CVPR, V3, P6, DOI [10.1109/CVPR.2018.00636, DOI 10.1109/CVPR.2018.00636]

[3]

[Anonymous], 2018, ARXIV180709956

[4]

[Anonymous], 2018, ARXIV181209681

[5] VQA: Visual Question Answering [J].

Antol, Stanislaw ;

Agrawal, Aishwarya ;

Lu, Jiasen ;

Mitchell, Margaret ;

Batra, Dhruv ;

Zitnick, C. Lawrence ;

Parikh, Devi .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433

[6] MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].

Ben-younes, Hedi ;

Cadene, Remi ;

Cord, Matthieu ;

Thome, Nicolas .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639

[7] SCENE PERCEPTION - DETECTING AND JUDGING OBJECTS UNDERGOING RELATIONAL VIOLATIONS [J].

BIEDERMAN, I ;

MEZZANOTTE, RJ ;

RABINOWITZ, JC .

COGNITIVE PSYCHOLOGY, 1982, 14 (02) :143-177

[8]

Boski M, 2017, 2017 10TH INTERNATIONAL WORKSHOP ON MULTIDIMENSIONAL (ND) SYSTEMS (NDS)

[9] MUREL: Multimodal Relational Reasoning for Visual Question Answering [J].

Cadene, Remi ;

Ben-younes, Hedi ;

Cord, Matthieu ;

Thome, Nicolas .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1989-1998

[10]

Choi Myung Jin, 2012, PAMI

← 1 2 3 4 5 6 7 →