Relation-Aware Graph Attention Network for Visual Question Answering

被引:568
作者
Li, Linjie [1 ]
Gan, Zhe [1 ]
Cheng, Yu [1 ]
Liu, Jingjing [1 ]
机构
[1] Microsoft Dynam 365 AI Res, Bellevue, WA USA
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019) | 2019年
关键词
D O I
10.1109/ICCV.2019.01041
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA.
引用
收藏
页码:10312 / 10321
页数:10
相关论文
共 63 条
[1]   Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering [J].
Agrawal, Aishwarya ;
Batra, Dhruv ;
Parikh, Devi ;
Kembhavi, Aniruddha .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4971-4980
[2]  
ANDERSON P, 2018, CVPR, V3, P6, DOI [10.1109/CVPR.2018.00636, DOI 10.1109/CVPR.2018.00636]
[3]  
[Anonymous], 2018, ARXIV180709956
[4]  
[Anonymous], 2018, ARXIV181209681
[5]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[6]   MUTAN: Multimodal Tucker Fusion for Visual Question Answering [J].
Ben-younes, Hedi ;
Cadene, Remi ;
Cord, Matthieu ;
Thome, Nicolas .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2631-2639
[7]   SCENE PERCEPTION - DETECTING AND JUDGING OBJECTS UNDERGOING RELATIONAL VIOLATIONS [J].
BIEDERMAN, I ;
MEZZANOTTE, RJ ;
RABINOWITZ, JC .
COGNITIVE PSYCHOLOGY, 1982, 14 (02) :143-177
[8]  
Boski M, 2017, 2017 10TH INTERNATIONAL WORKSHOP ON MULTIDIMENSIONAL (ND) SYSTEMS (NDS)
[9]   MUREL: Multimodal Relational Reasoning for Visual Question Answering [J].
Cadene, Remi ;
Ben-younes, Hedi ;
Cord, Matthieu ;
Thome, Nicolas .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1989-1998
[10]  
Choi Myung Jin, 2012, PAMI