RSTNet: Captioning with Adaptive Attention on Visual and Non-Visual Words

被引：159

作者：

Zhang, Xuying ^{[1
]}

Sun, Xiaoshuai ^{[1
,2
]}

Luo, Yunpeng ^{[1
]}

Ji, Jiayi ^{[1
]}

Zhou, Yiyi ^{[1
]}

Wu, Yongjian ^{[2
]}

Huang, Feiyue ^{[2
]}

Ji, Rongrong ^{[1
,2
,3
]}

机构：

[1] Xiamen Univ, Sch Informat, Dept Artificial Intelligence, Media Analyt & Comp Lab, Xiamen 361005, Peoples R China

[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China

[3] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR46437.2021.01521

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent progress on visual question answering has explored the merits of grid features for vision language tasks. Meanwhile, transformer-based models have shown remarkable performance in various sequence prediction problems. However, the spatial information loss of grid features caused by flattening operation, as well as the defect of the transformer model in distinguishing visual words and non visual words, are still left unexplored. In this paper, we first propose Grid-Augmented (GA) module, in which relative geometry features between grids are incorporated to enhance visual representations. Then, we build a BERT-based language model to extract language context and propose Adaptive-Attention (AA) module on top of a transformer decoder to adaptively measure the contribution of visual and language cues before making decisions for word prediction. To prove the generality of our proposals, we apply the two modules to the vanilla transformer model to build our Relationship-Sensitive Transformer (RSTNet) for image captioning task. The proposed model is tested on the MSCOCO benchmark, where it achieves new state-of-art results on both the Karpathy test split and the online test server. Source code is available at GitHub(1).

引用

页码：15460 / 15469

页数：10

共 43 条

[1] Anderson P, 2008, NAT ASSESS EDUC ACH, V2, P1, DOI 10.1596/978-0-8213-7497-9
[2] SPICE: Semantic Propositional Image Caption Evaluation
Anderson, Peter
Fernando, Basura
Johnson, Mark
Gould, Stephen
[J]. COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 : 382 - 398
[3] Convolutional Image Captioning
Aneja, Jyoti
Deshpande, Aditya
Schwing, Alexander G.
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5561 - 5570
[4] [Anonymous], 2005, P ACL WORKSH INTR EX
[5] [Anonymous], 2012, P AAAI C ART INT
[6] Cornia Marcella, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Proceedings, P10575, DOI 10.1109/CVPR42600.2020.01059
[7] Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions
Cornia, Marcella
Baraldi, Lorenzo
Cucchiara, Rita
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 8299 - 8308
[8] Devlin J., 2018, ARXIV
[9] Every Picture Tells a Story: Generating Sentences from Images
Farhadi, Ali
Hejrati, Mohsen
Sadeghi, Mohammad Amin
Young, Peter
Rashtchian, Cyrus
Hockenmaier, Julia
Forsyth, David
[J]. COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 : 15 - +
[10] Deep Residual Learning for Image Recognition
He, Kaiming
Zhang, Xiangyu
Ren, Shaoqing
Sun, Jian
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778

← 1 2 3 4 5 →