Improving image captioning with Pyramid Attention and SC-GAN

被引：25

作者：

Chen, Tianyu ^{[1
]}

Li, Zhixin ^{[1
]}

Wu, Jingli ^{[1
]}

Ma, Huifang ^{[2
]}

Su, Bianping ^{[3
]}

机构：

[1] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China

[2] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Peoples R China

[3] Xian Univ Architecture & Technol, Coll Sci, Xian 710055, Peoples R China

来源：

IMAGE AND VISION COMPUTING | 2022年 / 117卷

基金：

中国国家自然科学基金;

关键词：

Image captioning; Pyramid Attention network; Self-critical training; Reinforcement learning; Generative adversarial network; Sequence-level learning;

D O I：

10.1016/j.imavis.2021.104340

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most of the existing image captioning models mainly use global attention, which represents the whole image fea-tures, local attention, representing the object features, or a combination of them; there are few models to inte-grate the relationship information between various object regions of the image. But this relationship information is also very instructive for caption generation. For example, if a football appears, there is a high prob-ability that the image also contains people near the football. In this article, the relationship feature is embedded into the global-local attention to constructing a new Pyramid Attention mechanism, which can explore the inter-nal visual and semantic relationship between different object regions. Besides, to alleviate the exposure bias problem and make the training process more efficient, we propose a new method to apply the Generative Adver-sarial Network into sequence generation. The greedy decoding method is used to generate an efficient baseline reward for self-critical training. Finally, experiments on MSCOCO dataset show that the model can generate more accurate and vivid captions and outperforms many recent advanced models in various prevailing evalua-tion metrics on both local and online test sets.(c) 2021 Elsevier B.V. All rights reserved.

引用

页数：12

共 35 条

[1] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].

Anderson, Peter ;

He, Xiaodong ;

Buehler, Chris ;

Teney, Damien ;

Johnson, Mark ;

Gould, Stephen ;

Zhang, Lei .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086

[2] SPICE: Semantic Propositional Image Caption Evaluation [J].

Anderson, Peter ;

Fernando, Basura ;

Johnson, Mark ;

Gould, Stephen .

COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398

[3]

[Anonymous], 2015, arXiv preprint arXiv:1511.06732

[4]

Banerjee Satanjeev, 2005, P ACL WORKSH INTR EX, P65

[5]

Chen J., P IEEE C COMP VIS PA, P10890

[6] Meshed-Memory Transformer for Image Captioning [J].

Cornia, Marcella ;

Stefanini, Matteo ;

Baraldi, Lorenzo ;

Cucchiara, Rita .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10575-10584

[7] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

[8]

Hochreiter S., 1997, Neural Computation, V9, P1735

[9] Boost image captioning with knowledge reasoning [J].

Huang, Feicheng ;

Li, Zhixin ;

Wei, Haiyang ;

Zhang, Canlong ;

Ma, Huifang .

MACHINE LEARNING, 2020, 109 (12) :2313-2332

[10] Attention on Attention for Image Captioning [J].

Huang, Lun ;

Wang, Wenmin ;

Chen, Jie ;

Wei, Xiao-Yong .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4633-4642

← 1 2 3 4 →