Attention-Aligned Transformer for Image Captioning

被引：0

作者：

Fei, Zhengcong ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

来源：

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2022年

关键词：

REPRESENTATION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, attention-based image captioning models, which are expected to ground correct image regions for proper word generations, have achieved remarkable performance. However, some researchers have argued "deviated focus" problem of existing attention mechanisms in determining the effective and influential image features. In this paper, we present A(2)- an attention-aligned Transformer for image captioning, which guides attention learning in a perturbation-based self-supervised manner, without any annotation overhead. Specifically, we add mask operation on image regions through a learnable network to estimate the true function in ultimate description generation. We hypothesize that the necessary image region features, where small disturbance causes an obvious performance degradation, deserve more attention weight. Then, we propose four aligned strategies to use this information to refine attention weight distribution. Under such a pattern, image regions are attended correctly with the output words. Extensive experiments conducted on the MS COCO dataset demonstrate that the proposed A(2) Transformer consistently outperforms baselines in both automatic metrics and human evaluation. Trained models and code for reproducing the experiments are publicly available.

引用

页码：607 / 615

页数：9

共 57 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[2] SPICE: Semantic Propositional Image Caption Evaluation
Anderson, Peter
Fernando, Basura
Johnson, Mark
Gould, Stephen
[J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
[3] Visual Explanations for DNNs with Contextual Importance
Anjomshoae, Sule
Jiang, Lili
Framling, Kary
[J]. EXPLAINABLE AND TRANSPARENT AI AND MULTI-AGENT SYSTEMS, EXTRAAMAS 2021, 2021, 12688 : 83 - 96
[4] Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[5] Bengio Y., 2009, ICML
[6] Chen L, 2020, PROC CVPR IEEE, P10797, DOI 10.1109/CVPR42600.2020.01081
[7] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning
Chen, Long
Zhang, Hanwang
Xiao, Jun
Nie, Liqiang
Shao, Jian
Liu, Wei
Chua, Tat-Seng
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6298 - 6306
[8] Chen XL, 2015, Arxiv, DOI arXiv:1504.00325
[9] Cheng Xu, 2019, 2019 IEEE International Conference on Unmanned Systems and Artificial Intelligence (ICUSAI), P172, DOI 10.1109/ICUSAI47366.2019.9124779
[10] What does BERT look at? An Analysis of BERT's Attention
Clark, Kevin
Khandelwal, Urvashi
Levy, Omer
Manning, Christopher D.
[J]. BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, : 276 - 286

← 1 2 3 4 5 6 →