Improving image captioning with Pyramid Attention and SC-GAN

被引:25
作者
Chen, Tianyu [1 ]
Li, Zhixin [1 ]
Wu, Jingli [1 ]
Ma, Huifang [2 ]
Su, Bianping [3 ]
机构
[1] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China
[2] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Peoples R China
[3] Xian Univ Architecture & Technol, Coll Sci, Xian 710055, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Pyramid Attention network; Self-critical training; Reinforcement learning; Generative adversarial network; Sequence-level learning;
D O I
10.1016/j.imavis.2021.104340
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of the existing image captioning models mainly use global attention, which represents the whole image fea-tures, local attention, representing the object features, or a combination of them; there are few models to inte-grate the relationship information between various object regions of the image. But this relationship information is also very instructive for caption generation. For example, if a football appears, there is a high prob-ability that the image also contains people near the football. In this article, the relationship feature is embedded into the global-local attention to constructing a new Pyramid Attention mechanism, which can explore the inter-nal visual and semantic relationship between different object regions. Besides, to alleviate the exposure bias problem and make the training process more efficient, we propose a new method to apply the Generative Adver-sarial Network into sequence generation. The greedy decoding method is used to generate an efficient baseline reward for self-critical training. Finally, experiments on MSCOCO dataset show that the model can generate more accurate and vivid captions and outperforms many recent advanced models in various prevailing evalua-tion metrics on both local and online test sets.(c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页数:12
相关论文
共 35 条
[1]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[2]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[3]  
[Anonymous], 2015, arXiv preprint arXiv:1511.06732
[4]  
Banerjee Satanjeev, 2005, P ACL WORKSH INTR EX, P65
[5]  
Chen J., P IEEE C COMP VIS PA, P10890
[6]  
Cornia M, 2020, PROC CVPR IEEE, P10575, DOI 10.1109/CVPR42600.2020.01059
[7]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[8]  
Hochreiter S., 1997, Neural Computation, V9, P1735
[9]   Boost image captioning with knowledge reasoning [J].
Huang, Feicheng ;
Li, Zhixin ;
Wei, Haiyang ;
Zhang, Canlong ;
Ma, Huifang .
MACHINE LEARNING, 2020, 109 (12) :2313-2332
[10]   Attention on Attention for Image Captioning [J].
Huang, Lun ;
Wang, Wenmin ;
Chen, Jie ;
Wei, Xiao-Yong .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4633-4642