RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words

被引:159
作者
Zhang, Xuying [1 ]
Sun, Xiaoshuai [1 ,2 ]
Luo, Yunpeng [1 ]
Ji, Jiayi [1 ]
Zhou, Yiyi [1 ]
Wu, Yongjian [2 ]
Huang, Feiyue [2 ]
Ji, Rongrong [1 ,2 ,3 ]
机构
[1] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Media Analyt & Comp Lab, Xiamen 361005, Peoples R China
[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR46437.2021.01521
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent progress on visual question answering has explored the merits of grid features for vision language tasks. Meanwhile, transformer-based models have shown remarkable performance in various sequence prediction problems. However, the spatial information loss of grid features caused by flattening operation, as well as the defect of the transformer model in distinguishing visual words and non visual words, are still left unexplored. In this paper, we first propose Grid-Augmented (GA) module, in which relative geometry features between grids are incorporated to enhance visual representations. Then, we build a BERT-based language model to extract language context and propose Adaptive-Attention (AA) module on top of a transformer decoder to adaptively measure the contribution of visual and language cues before making decisions for word prediction. To prove the generality of our proposals, we apply the two modules to the vanilla transformer model to build our Relationship-Sensitive Transformer (RSTNet) for image captioning task. The proposed model is tested on the MSCOCO benchmark, where it achieves new state-of-art results on both the Karpathy test split and the online test server. Source code is available at GitHub(1).
引用
收藏
页码:15460 / 15469
页数:10
相关论文
共 43 条
  • [1] Anderson P, 2008, NAT ASSESS EDUC ACH, V2, P1, DOI 10.1596/978-0-8213-7497-9
  • [2] SPICE: Semantic Propositional Image Caption Evaluation
    Anderson, Peter
    Fernando, Basura
    Johnson, Mark
    Gould, Stephen
    [J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
  • [3] Convolutional Image Captioning
    Aneja, Jyoti
    Deshpande, Aditya
    Schwing, Alexander G.
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5561 - 5570
  • [4] [Anonymous], 2005, P ACL WORKSH INTR EX
  • [5] [Anonymous], 2012, P AAAI C ART INT
  • [6] Cornia Marcella, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings, P10575, DOI 10.1109/CVPR42600.2020.01059
  • [7] Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8299 - 8308
  • [8] Devlin J., 2018, ARXIV
  • [9] Every Picture Tells a Story: Generating Sentences from Images
    Farhadi, Ali
    Hejrati, Mohsen
    Sadeghi, Mohammad Amin
    Young, Peter
    Rashtchian, Cyrus
    Hockenmaier, Julia
    Forsyth, David
    [J]. COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 : 15 - +
  • [10] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778