Fine-grained attention for image caption generation

被引:15
作者
Chang, Yan-Shuo [1 ,2 ]
机构
[1] China Xian Inst Silk Rd Res, Xian 710100, Shaanxi, Peoples R China
[2] Xian Univ Finance & Econ, Sch Informat, Xian 710100, Shaanxi, Peoples R China
关键词
Fine-grained attention; Image caption generation; Attention generation;
D O I
10.1007/s11042-017-4593-1
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite the progress, generating natural language descriptions for images is still a challenging task. Most state-of-the-art methods for solving this problem apply existing deep convolutional neural network (CNN) models to extract a visual representation of the entire image, based on which the parallel structures between images and sentences are exploited using recurrent neural networks. However, there is an inherent drawback that their models may attend to a partial view of a visual element or a conglomeration of several concepts. In this paper, we present a fine-grained attention based model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation. The model contains three sub-networks: a deep recurrent neural network for sentences, a deep convolutional network for images, and a region proposal network for nearly cost-free region proposals. Our model is able to automatically learn to fix its gaze on salient region proposals. The process of generating the next word, given the previously generated ones, is aligned with this visual perception experience. We validate the effectiveness of the proposed model on three benchmark datasets (Flickr 8K, Flickr 30K and MS COCO). The experimental results confirm the effectiveness of the proposed system.
引用
收藏
页码:2959 / 2971
页数:13
相关论文
共 51 条
[1]  
[Anonymous], ARXIV14112539 CORR
[2]  
[Anonymous], NAACL HLT WORKSH
[3]  
[Anonymous], ARXIV14101090 CORR
[4]  
[Anonymous], 2015, NIPS
[5]  
[Anonymous], 2011, P 15 C COMP NAT LANG
[6]  
[Anonymous], 2012, P 13 C EUR CHAPT ASS
[7]  
[Anonymous], 2010, Advances in Neural Information Processing Systems
[8]  
Ba J, 2015, ADV NEUR IN, V28
[9]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[10]   They Are Not Equally Reliable: Semantic Event Search using Differentiated Concept Classifiers [J].
Chang, Xiaojun ;
Yu, Yao-Liang ;
Yang, Yi ;
Xing, Eric P. .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1884-1893