Local-to-Global Semantic Supervised Learning for Image Captioning

被引：0

作者：

Wang, Juan ^{[1
]}

Duan, Yiping ^{[1
]}

Tao, Xiaoming ^{[1
]}

Lu, Jianhua ^{[1
]}

机构：

[1] Tsinghua Univ, Beijing Natl Res Ctr Informat Sci & Technol BNRis, Dept Elect Engn, Beijing, Peoples R China

来源：

ICC 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC) | 2020年

基金：

中国国家自然科学基金;

关键词：

image caption; semantic supervised learning; attention mechanism; ATTENTION;

D O I：

10.1109/icc40277.2020.9149264

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Image captioning is a challenging problem owing to the complexity of image content and the diverse ways of describing the content in natural language. Although current methods have made substantial progress in terms of objective metrics (such as BLEU, METEOR, ROUGE-L and CIDEr), there still exist some problems. Specifically, most of these methods are trained to maximize the log-likelihood or objective metrics. As a result, these methods often generate rigid and semantically incomplete captions. In this paper, we develop a new model that aims to generate captions conforming to human evaluation. The core idea is to use local-to-global semantic supervised learning by introducing the two-level optimization objective functions. At the word level, we match each word to the image regions using the local attention objective function; at the sentence level, we align the entire sentence and the image using the global semantic objective function. Experimentally, we compare the proposed model with current methods on MSCOCO dataset. We show that either local attention supervision or global semantic supervision is the necessary component for the success of our model through ablation studies. Furthermore, combining these two supervision objective functions achieves state-of-the-art performance in terms of both standard evaluation metrics and human judgment.

引用

页数：6

共 31 条

[1] Auli M, 2016, 2016 INT C LEARN REP
[2] Dai B., 2017, 2017 C WORKSH NEUR I
[3] Towards Diverse and Natural Image Descriptions via a Conditional GAN
Dai, Bo
Fidler, Sanja
Urtasun, Raquel
Lin, Dahua
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2989 - 2998
[4] Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754
[5] Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts
Fu, Kun
Jin, Junqi
Cui, Runpeng
Sha, Fei
Zhang, Changshui
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) : 2321 - 2334
[6] Deep Residual Learning for Image Recognition
He, Kaiming
Zhang, Xiangyu
Ren, Shaoqing
Sun, Jian
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
[7] Guiding the Long-Short Term Memory model for Image Caption Generation
Jia, Xu
Gavves, Efstratios
Fernando, Basura
Tuytelaars, Tinne
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2407 - 2415
[8] Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932
[9] Klein E, 2015, PROC CVPR IEEE, P4437, DOI 10.1109/CVPR.2015.7299073
[10] Microsoft COCO: Common Objects in Context
Lin, Tsung-Yi
Maire, Michael
Belongie, Serge
Hays, James
Perona, Pietro
Ramanan, Deva
Dollar, Piotr
Zitnick, C. Lawrence
[J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755

← 1 2 3 4 →