Learning visual relationship and context-aware attention for image captioning

被引:107
作者
Wang, Junbo [1 ,3 ]
Wang, Wei [1 ,3 ]
Wang, Liang [1 ,2 ,3 ]
Wang, Zhiyong [4 ]
Feng, David Dagan [4 ]
Tan, Tieniu [1 ,2 ,3 ]
机构
[1] Chinese Acad Sci CASIA, Inst Automat, Natl Lab Pattern Recognit NLPR, CRIPAC, Beijing, Peoples R China
[2] CASIA, CEBSIT, Beijing, Peoples R China
[3] UCAS, Beijing, Peoples R China
[4] Univ Sydney, Sch Informat Technol, Sydney, NSW, Australia
基金
中国国家自然科学基金; 澳大利亚研究理事会;
关键词
Image captioning; Relational reasoning; Context-aware attention; RECOGNITION;
D O I
10.1016/j.patcog.2019.107075
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning which automatically generates natural language descriptions for images has attracted lots of research attentions and there have been substantial progresses with attention based captioning methods. However, most attention-based image captioning methods focus on extracting visual information in regions of interest for sentence generation and usually ignore the relational reasoning among those regions of interest in an image. Moreover, these methods do not take into account previously attended regions which can be used to guide the subsequent attention selection. In this paper, we propose a novel method to implicitly model the relationship among regions of interest in an image with a graph neural network, as well as a novel context-aware attention mechanism to guide attention selection by fully memorizing previously attended visual content. Compared with the existing attention-based image captioning methods, ours can not only learn relation-aware visual representations for image captioning, but also consider historical context information on previous attention. We perform extensive experiments on two public benchmark datasets: MS COCO and Flickr30K, and the experimental results indicate that our proposed method is able to outperform various state-of-the-art methods in terms of the widely used evaluation metrics. (C) 2019 Elsevier Ltd. All rights reserved.
引用
收藏
页数:11
相关论文
共 60 条
  • [1] Image Understanding using vision and reasoning through Scene Description Graph
    Aditya, Somak
    Yang, Yezhou
    Baral, Chitta
    Aloimonos, Yiannis
    Fermueller, Cornelia
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 2018, 173 : 33 - 45
  • [2] Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments
    Anderson, Peter
    Wu, Qi
    Teney, Damien
    Bruce, Jake
    Johnson, Mark
    Sunderhauf, Niko
    Reid, Ian
    Gould, Stephen
    van den Hengel, Anton
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3674 - 3683
  • [3] Convolutional Image Captioning
    Aneja, Jyoti
    Deshpande, Aditya
    Schwing, Alexander G.
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5561 - 5570
  • [4] [Anonymous], 2012, P AAAI C ART INT
  • [5] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, DOI 10.48550/ARXIV.1409.0473]
  • [6] Text/non-text image classification in the wild with convolutional neural networks
    Bai, Xiang
    Shi, Baoguang
    Zhang, Chengquan
    Cai, Xuan
    Qi, Li
    [J]. PATTERN RECOGNITION, 2017, 66 : 437 - 446
  • [7] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
    Chen, Long
    Zhang, Hanwang
    Xiao, Jun
    Nie, Liqiang
    Shao, Jian
    Liu, Wei
    Chua, Tat-Seng
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6298 - 6306
  • [8] Regularizing RNNs for Caption Generation by Reconstructing The Past with The Present
    Chen, Xinpeng
    Ma, Lin
    Jiang, Wenhao
    Yao, Jian
    Liu, Wei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7995 - 8003
  • [9] CHO K., 2014, LEARNING PHRASE REPR, DOI DOI 10.3115/V1/D14-1179
  • [10] Context-dependent word representation for neural machine translation
    Choi, Heeyoul
    Cho, Kyunghyun
    Bengio, Yoshua
    [J]. COMPUTER SPEECH AND LANGUAGE, 2017, 45 : 149 - 160