Visual Grounding Via Accumulated Attention

被引:9
作者
Deng, Chaorui [1 ,2 ]
Wu, Qi [3 ]
Wu, Qingyao [1 ]
Hu, Fuyuan [4 ]
Lyu, Fan [5 ]
Tan, Mingkui [1 ]
机构
[1] South China Univ Technol, Sch Software Engn, Guangzhou 510006, Peoples R China
[2] Pazhou Lab, Guangzhou 510335, Peoples R China
[3] Univ Adelaide, Sch Comp Sci, Adelaide, SA 5005, Australia
[4] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China
[5] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300350, Peoples R China
基金
澳大利亚研究理事会; 中国国家自然科学基金;
关键词
Proposals; Visualization; Training; Feature extraction; Task analysis; Grounding; Cognition; Visual grounding; accumulated attention; noised training strategy; bounding box regression;
D O I
10.1109/TPAMI.2020.3023438
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. Generally, it requires the machine to first understand the query, identify the key concepts in the image, and then locate the target object by specifying its bounding box. However, in many real-world visual grounding applications, we have to face with ambiguous queries and images with complicated scene structures. Identifying the target based on highly redundant and correlated information can be very challenging, and often leading to unsatisfactory performance. To tackle this, in this paper, we exploit an attention module for each kind of information to reduce internal redundancies. We then propose an accumulated attention (A-ATT) mechanism to reason among all the attention modules jointly. In this way, the relation among different kinds of information can be explicitly captured. Moreover, to improve the performance and robustness of our VG models, we additionally introduce some noises into the training procedure to bridge the distribution gap between the human-labeled training data and the real-world poor quality data. With this "noised" training strategy, we can further learn a bounding box regressor, which can be used to refine the bounding box of the target object. We evaluate the proposed methods on four popular datasets (namely ReferCOCO, ReferCOCO+, ReferCOCOg, and GuessWhat?!). The experimental results show that our methods significantly outperform all previous works on every dataset in terms of accuracy.
引用
收藏
页码:1670 / 1684
页数:15
相关论文
共 66 条
  • [1] Andreas J., 2016, NAACL, P1545
  • [2] Neural Module Networks
    Andreas, Jacob
    Rohrbach, Marcus
    Darrell, Trevor
    Klein, Dan
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 39 - 48
  • [3] [Anonymous], 2015, Deep captioning with multimodal recurrent neural networks (mRNN)
  • [4] [Anonymous], 2015, ADV NEURAL INFORM PR
  • [5] [Anonymous], Simple baseline for visual question answering
  • [6] [Anonymous], 2015, NEURAL INFORM PROCES
  • [7] Mask R-CNN
    He, Kaiming
    Gkioxari, Georgia
    Dollar, Piotr
    Girshick, Ross
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2980 - 2988
  • [8] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [9] Chelba Ciprian, 2014, One billion word benchmark for measuring progress in statistical language modeling, DOI DOI 10.21437/INTERSPEECH.2014-564
  • [10] Chen YP, 2018, ADV NEUR IN, V31