X-Linear Attention Networks for Image Captioning

被引:497
作者
Pan, Yingwei [1 ]
Yao, Ting [1 ]
Li, Yehao [1 ]
Mei, Tao [1 ]
机构
[1] JD AI Res, Beijing, Peoples R China
来源
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年
关键词
D O I
10.1109/CVPR42600.2020.01098
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2nd order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block - X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2nd order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intraand inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split. When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8%. Source code is available at https://github.com/Panda-Peter/image-captioning.
引用
收藏
页码:10968 / 10977
页数:10
相关论文
共 42 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.01094
[4]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.00856
[5]  
[Anonymous], 2019, CVPR, DOI DOI 10.1109/CVPR.2019.01278
[6]  
Banerjee Satanjeev, 2005, P ACL WORKSHOP INTRI
[7]  
Barron J.T., 2017, ARXIV
[8]   Describing Multimedia Content Using Attention-Based Encoder-Decoder Networks [J].
Cho, Kyunghyun ;
Courville, Aaron ;
Bengio, Yoshua .
IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (11) :1875-1886
[9]  
Dauphin YN, 2017, PR MACH LEARN RES, V70
[10]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848