Improving image captioning with Pyramid Attention and SC-GAN

被引：25

作者：

Chen, Tianyu ^{[1
]}

Li, Zhixin ^{[1
]}

Wu, Jingli ^{[1
]}

Ma, Huifang ^{[2
]}

Su, Bianping ^{[3
]}

机构：

[1] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China

[2] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Peoples R China

[3] Xian Univ Architecture & Technol, Coll Sci, Xian 710055, Peoples R China

来源：

IMAGE AND VISION COMPUTING | 2022年 / 117卷

基金：

中国国家自然科学基金;

关键词：

Image captioning; Pyramid Attention network; Self-critical training; Reinforcement learning; Generative adversarial network; Sequence-level learning;

D O I：

10.1016/j.imavis.2021.104340

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most of the existing image captioning models mainly use global attention, which represents the whole image fea-tures, local attention, representing the object features, or a combination of them; there are few models to inte-grate the relationship information between various object regions of the image. But this relationship information is also very instructive for caption generation. For example, if a football appears, there is a high prob-ability that the image also contains people near the football. In this article, the relationship feature is embedded into the global-local attention to constructing a new Pyramid Attention mechanism, which can explore the inter-nal visual and semantic relationship between different object regions. Besides, to alleviate the exposure bias problem and make the training process more efficient, we propose a new method to apply the Generative Adver-sarial Network into sequence generation. The greedy decoding method is used to generate an efficient baseline reward for self-critical training. Finally, experiments on MSCOCO dataset show that the model can generate more accurate and vivid captions and outperforms many recent advanced models in various prevailing evalua-tion metrics on both local and online test sets.(c) 2021 Elsevier B.V. All rights reserved.

引用

页数：12

共 35 条

[11] Recurrent Fusion Network for Image Captioning [J].

Jiang, Wenhao ;

Ma, Lin ;

Jiang, Yu-Gang ;

Liu, Wei ;

Zhang, Tong .

COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 :510-526

[12] Reflective Decoding Network for Image Captioning [J].

Ke, Lei ;

Pei, Wenjie ;

Li, Ruiyu ;

Shen, Xiaoyong ;

Tai, Yu-Wing .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8887-8896

[13] Entangled Transformer for Image Captioning [J].

Li, Guang ;

Zhu, Linchao ;

Liu, Ping ;

Yang, Yi .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8927-8936

[14] GLA: Global-Local Attention for Image Description [J].

Li, Linghui ;

Tang, Sheng ;

Zhang, Yongdong ;

Deng, Lixi ;

Tian, Qi .

IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (03) :726-737

[15]

Lin C.-Y., 2004, TEXT SUMMARIZATION B, P74

[16]

Liu JH, 2020, AAAI CONF ARTIF INTE, V34, P11588

[17]

Pan YW, 2020, PROC CVPR IEEE, P10968, DOI 10.1109/CVPR42600.2020.01098

[18] BLEU: a method for automatic evaluation of machine translation [J].

Papineni, K ;

Roukos, S ;

Ward, T ;

Zhu, WJ .

40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2002, :311-318

[19] Self-critical Sequence Training for Image Captioning [J].

Rennie, Steven J. ;

Marcheret, Etienne ;

Mroueh, Youssef ;

Ross, Jerret ;

Goel, Vaibhava .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1179-1195

[20]

Sutton RS, 2000, ADV NEUR IN, V12, P1057

← 1 2 3 4 →