Direction Relation Transformer for Image Captioning

被引：17

作者：

Song, Zeliang ^{[1
,2
]}

Zhou, Xiaofei ^{[1
,2
]}

Dong, Linhua ^{[1
,2
]}

Tan, Jianlong ^{[1
,2
]}

Guo, Li ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

基金：

中国国家自然科学基金;

关键词：

Image Captioning; Direction Relation Transformer; Multi-Head Attention; Direction Embedding;

D O I：

10.1145/3474085.3475607

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image captioning is a challenging task that combines computer vision and natural language processing for generating a textual description of the content within an image. Recently, Transformerbased encoder-decoder architectures have shown great success in image captioning, where multi-head attention mechanism is utilized to capture the contextual interactions between object regions. However, such methods regard region features as a bag of tokens without considering the directional relationships between them, making it hard to understand the relative position between objects in the image and generate correct captions effectively. In this paper, we propose a novel Direction Relation Transformer to improve the orientation perception between visual features by incorporating the relative direction embedding into multi-head attention, termed DRT. We first generate the relative direction matrix according to the positional information of the object regions, and then explore three forms of direction-aware multi-head attention to integrate the direction embedding into Transformer architecture. We conduct experiments on challenging Microsoft COCO image captioning benchmark. The quantitative and qualitative results demonstrate that, by integrating the relative directional relation, our proposed approach achieves significant improvements over all evaluation metrics compared with baseline model, e.g., DRT improves taskspecific metric CIDEr score from 129.7% to 133.2% on the offline '' Karpathy '' test split.

引用

页码：5056 / 5064

页数：9

共 35 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[3]

Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473

[4]

Banerjee S, 2005, P ACL WORKSH INTR EX, P65, DOI DOI 10.3115/1626355.1626389

[5] SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning [J].

Chen, Long ;

Zhang, Hanwang ;

Xiao, Jun ;

Nie, Liqiang ;

Shao, Jian ;

Liu, Wei ;

Chua, Tat-Seng .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6298-6306

[6]

Cornia M, 2020, PROC CVPR IEEE, P10575, DOI 10.1109/CVPR42600.2020.01059

[7]

Guo Longteng, 2019, IEEE T MULTIMEDIA

[8]

Herdade S, 2019, ADV NEUR IN, V32

[9] Attention on Attention for Image Captioning [J].

Huang, Lun ;

Wang, Wenmin ;

Chen, Jie ;

Wei, Xiao-Yong .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4633-4642

[10]

Ji Jiayi, 2021, P AAAI C ART INT

← 1 2 3 4 →