X-Linear Attention Networks for Image Captioning

被引：497

作者：

Pan, Yingwei ^{[1
]}

Yao, Ting ^{[1
]}

Li, Yehao ^{[1
]}

Mei, Tao ^{[1
]}

机构：

[1] JD AI Res, Beijing, Peoples R China

来源：

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年

关键词：

D O I：

10.1109/CVPR42600.2020.01098

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2nd order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block - X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2nd order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intraand inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split. When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8%. Source code is available at https://github.com/Panda-Peter/image-captioning.

引用

页码：10968 / 10977

页数：10

共 42 条

[31]

Sutskever I, 2014, ADV NEUR IN, V27

[32]

Vaswani A, 2017, ADV NEUR IN, V30

[33]

Vedantam R, 2015, PROC CVPR IEEE, P4566, DOI 10.1109/CVPR.2015.7299087

[34]

Vinyals O, 2015, PROC CVPR IEEE, P3156, DOI 10.1109/CVPR.2015.7298935

[35]

Wang J, 2019, PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P940

[36]

Xu K, 2015, PR MACH LEARN RES, V37, P2048

[37] Hierarchy Parsing for Image Captioning [J].

Yao, Ting ;

Pan, Yingwei ;

Li, Yehao ;

Mei, Tao .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2621-2629

[38] Exploring Visual Relationship for Image Captioning [J].

Yao, Ting ;

Pan, Yingwei ;

Li, Yehao ;

Mei, Tao .

COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 :711-727

[39] Boosting Image Captioning with Attributes [J].

Yao, Ting ;

Pan, Yingwei ;

Li, Yehao ;

Qiu, Zhaofan ;

Mei, Tao .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4904-4912

[40] Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects [J].

Yao, Ting ;

Pan, Yingwei ;

Li, Yehao ;

Mei, Tao .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5263-5271

← 1 2 3 4 5 →