Attention-Aligned Transformer for Image Captioning

被引:0
作者
Fei, Zhengcong [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
来源
THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2022年
关键词
REPRESENTATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, attention-based image captioning models, which are expected to ground correct image regions for proper word generations, have achieved remarkable performance. However, some researchers have argued "deviated focus" problem of existing attention mechanisms in determining the effective and influential image features. In this paper, we present A(2)- an attention-aligned Transformer for image captioning, which guides attention learning in a perturbation-based self-supervised manner, without any annotation overhead. Specifically, we add mask operation on image regions through a learnable network to estimate the true function in ultimate description generation. We hypothesize that the necessary image region features, where small disturbance causes an obvious performance degradation, deserve more attention weight. Then, we propose four aligned strategies to use this information to refine attention weight distribution. Under such a pattern, image regions are attended correctly with the output words. Extensive experiments conducted on the MS COCO dataset demonstrate that the proposed A(2) Transformer consistently outperforms baselines in both automatic metrics and human evaluation. Trained models and code for reproducing the experiments are publicly available.
引用
收藏
页码:607 / 615
页数:9
相关论文
共 57 条
  • [1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [2] SPICE: Semantic Propositional Image Caption Evaluation
    Anderson, Peter
    Fernando, Basura
    Johnson, Mark
    Gould, Stephen
    [J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
  • [3] Visual Explanations for DNNs with Contextual Importance
    Anjomshoae, Sule
    Jiang, Lili
    Framling, Kary
    [J]. EXPLAINABLE AND TRANSPARENT AI AND MULTI-AGENT SYSTEMS, EXTRAAMAS 2021, 2021, 12688 : 83 - 96
  • [4] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
  • [5] Bengio Y., 2009, ICML
  • [6] Chen L, 2020, PROC CVPR IEEE, P10797, DOI 10.1109/CVPR42600.2020.01081
  • [7] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
    Chen, Long
    Zhang, Hanwang
    Xiao, Jun
    Nie, Liqiang
    Shao, Jian
    Liu, Wei
    Chua, Tat-Seng
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6298 - 6306
  • [8] Chen XL, 2015, Arxiv, DOI arXiv:1504.00325
  • [9] Cheng Xu, 2019, 2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI), P172, DOI 10.1109/ICUSAI47366.2019.9124779
  • [10] What does BERT look at? An Analysis of BERT's Attention
    Clark, Kevin
    Khandelwal, Urvashi
    Levy, Omer
    Manning, Christopher D.
    [J]. BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, : 276 - 286