Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

被引:241
作者
Duy-Kien Nguyen [1 ]
Okatani, Takayuki [1 ,2 ]
机构
[1] Tohoku Univ, Sendai, Miyagi, Japan
[2] RIKEN, Ctr AIP, Wako, Saitama, Japan
来源
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年
关键词
D O I
10.1109/CVPR.2018.00637
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.
引用
收藏
页码:6087 / 6096
页数:10
相关论文
共 33 条
  • [1] [Anonymous], INT C COMP VIS ICCV
  • [2] [Anonymous], ASS COMPUTATIONAL LI
  • [3] [Anonymous], INT C NEUR INF PROC
  • [4] [Anonymous], 2016, C COMP VIS PATT REC
  • [5] [Anonymous], 2017, INT J COMPUTER VISIO
  • [6] [Anonymous], 2016, EMPIRICAL METHODS NA
  • [7] [Anonymous], 2014, J MACHINE LEARNING R
  • [8] [Anonymous], 2016, INT C NEUR INF PROC
  • [9] [Anonymous], INT C COMP VIS PATT
  • [10] [Anonymous], 2015, ICML WORKSH DEEP LEA