Improving image captioning with Pyramid Attention and SC-GAN

被引:25
作者
Chen, Tianyu [1 ]
Li, Zhixin [1 ]
Wu, Jingli [1 ]
Ma, Huifang [2 ]
Su, Bianping [3 ]
机构
[1] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China
[2] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Peoples R China
[3] Xian Univ Architecture & Technol, Coll Sci, Xian 710055, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Pyramid Attention network; Self-critical training; Reinforcement learning; Generative adversarial network; Sequence-level learning;
D O I
10.1016/j.imavis.2021.104340
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of the existing image captioning models mainly use global attention, which represents the whole image fea-tures, local attention, representing the object features, or a combination of them; there are few models to inte-grate the relationship information between various object regions of the image. But this relationship information is also very instructive for caption generation. For example, if a football appears, there is a high prob-ability that the image also contains people near the football. In this article, the relationship feature is embedded into the global-local attention to constructing a new Pyramid Attention mechanism, which can explore the inter-nal visual and semantic relationship between different object regions. Besides, to alleviate the exposure bias problem and make the training process more efficient, we propose a new method to apply the Generative Adver-sarial Network into sequence generation. The greedy decoding method is used to generate an efficient baseline reward for self-critical training. Finally, experiments on MSCOCO dataset show that the model can generate more accurate and vivid captions and outperforms many recent advanced models in various prevailing evalua-tion metrics on both local and online test sets.(c) 2021 Elsevier B.V. All rights reserved.
引用
收藏
页数:12
相关论文
共 35 条
[11]   Recurrent Fusion Network for Image Captioning [J].
Jiang, Wenhao ;
Ma, Lin ;
Jiang, Yu-Gang ;
Liu, Wei ;
Zhang, Tong .
COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 :510-526
[12]   Reflective Decoding Network for Image Captioning [J].
Ke, Lei ;
Pei, Wenjie ;
Li, Ruiyu ;
Shen, Xiaoyong ;
Tai, Yu-Wing .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8887-8896
[13]   Entangled Transformer for Image Captioning [J].
Li, Guang ;
Zhu, Linchao ;
Liu, Ping ;
Yang, Yi .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8927-8936
[14]   GLA: Global-Local Attention for Image Description [J].
Li, Linghui ;
Tang, Sheng ;
Zhang, Yongdong ;
Deng, Lixi ;
Tian, Qi .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (03) :726-737
[15]  
Lin C.-Y., 2004, TEXT SUMMARIZATION B, P74
[16]  
Liu JH, 2020, AAAI CONF ARTIF INTE, V34, P11588
[17]  
Pan YW, 2020, PROC CVPR IEEE, P10968, DOI 10.1109/CVPR42600.2020.01098
[18]   BLEU: a method for automatic evaluation of machine translation [J].
Papineni, K ;
Roukos, S ;
Ward, T ;
Zhu, WJ .
40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2002, :311-318
[19]   Self-critical Sequence Training for Image Captioning [J].
Rennie, Steven J. ;
Marcheret, Etienne ;
Mroueh, Youssef ;
Ross, Jerret ;
Goel, Vaibhava .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1179-1195
[20]  
Sutton RS, 2000, ADV NEUR IN, V12, P1057