Local-to-Global Semantic Supervised Learning for Image Captioning

被引:0
作者
Wang, Juan [1 ]
Duan, Yiping [1 ]
Tao, Xiaoming [1 ]
Lu, Jianhua [1 ]
机构
[1] Tsinghua Univ, Beijing Natl Res Ctr Informat Sci & Technol BNRis, Dept Elect Engn, Beijing, Peoples R China
来源
ICC 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC) | 2020年
基金
中国国家自然科学基金;
关键词
image caption; semantic supervised learning; attention mechanism; ATTENTION;
D O I
10.1109/icc40277.2020.9149264
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image captioning is a challenging problem owing to the complexity of image content and the diverse ways of describing the content in natural language. Although current methods have made substantial progress in terms of objective metrics (such as BLEU, METEOR, ROUGE-L and CIDEr), there still exist some problems. Specifically, most of these methods are trained to maximize the log-likelihood or objective metrics. As a result, these methods often generate rigid and semantically incomplete captions. In this paper, we develop a new model that aims to generate captions conforming to human evaluation. The core idea is to use local-to-global semantic supervised learning by introducing the two-level optimization objective functions. At the word level, we match each word to the image regions using the local attention objective function; at the sentence level, we align the entire sentence and the image using the global semantic objective function. Experimentally, we compare the proposed model with current methods on MSCOCO dataset. We show that either local attention supervision or global semantic supervision is the necessary component for the success of our model through ablation studies. Furthermore, combining these two supervision objective functions achieves state-of-the-art performance in terms of both standard evaluation metrics and human judgment.
引用
收藏
页数:6
相关论文
共 31 条
  • [1] Auli M, 2016, 2016 INT C LEARN REP
  • [2] Dai B., 2017, 2017 C WORKSH NEUR I
  • [3] Towards Diverse and Natural Image Descriptions via a Conditional GAN
    Dai, Bo
    Fidler, Sanja
    Urtasun, Raquel
    Lin, Dahua
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2989 - 2998
  • [4] Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754
  • [5] Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts
    Fu, Kun
    Jin, Junqi
    Cui, Runpeng
    Sha, Fei
    Zhang, Changshui
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) : 2321 - 2334
  • [6] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [7] Guiding the Long-Short Term Memory model for Image Caption Generation
    Jia, Xu
    Gavves, Efstratios
    Fernando, Basura
    Tuytelaars, Tinne
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2407 - 2415
  • [8] Karpathy A, 2015, PROC CVPR IEEE, P3128, DOI 10.1109/CVPR.2015.7298932
  • [9] Klein E, 2015, PROC CVPR IEEE, P4437, DOI 10.1109/CVPR.2015.7299073
  • [10] Microsoft COCO: Common Objects in Context
    Lin, Tsung-Yi
    Maire, Michael
    Belongie, Serge
    Hays, James
    Perona, Pietro
    Ramanan, Deva
    Dollar, Piotr
    Zitnick, C. Lawrence
    [J]. COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 : 740 - 755