StyleNet: Generating Attractive Visual Captions with Styles

被引:191
作者
Gan, Chuang [1 ]
Gan, Zhe [2 ]
He, Xiaodong [3 ]
Gao, Jianfeng [3 ]
Deng, Li [3 ]
机构
[1] Tsinghua Univ, IIIS, Beijing, Peoples R China
[2] Duke Univ, Durham, NC 27706 USA
[3] Microsoft Res Redmond, Redmond, WA USA
来源
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017) | 2017年
基金
中国国家自然科学基金;
关键词
D O I
10.1109/CVPR.2017.108
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a novel framework named StyleNet to address the task of generating attractive captions for images and videos with different styles. To this end, we devise a novel model component, named factored LSTM, which automatically distills the style factors in the monolingual text corpus. Then at runtime, we can explicitly control the style in the caption generation process so as to produce attractive visual captions with the desired style. Our approach achieves this goal by leveraging two sets of data: 1) factual image/video-caption paired data, and 2) stylized monolingual text data (e.g., romantic and humorous sentences). We show experimentally that StyleNet outperforms existing approaches for generating visual captions with different styles, measured in both automatic and human evaluation metrics on the newly collected FlickrStyle10K image caption dataset, which contains 10K Flickr images with corresponding humorous and romantic captions.
引用
收藏
页码:955 / 964
页数:10
相关论文
共 56 条
[1]  
[Anonymous], NIPS WOM MACH LEARN
[2]  
[Anonymous], 2016, ARXIV160502688
[3]  
[Anonymous], 2014, INF SOFTW TECHNOL
[4]  
[Anonymous], 2014, ACL
[5]  
[Anonymous], 2017, INT J PROD RES, DOI DOI 10.1080/00207543.2016.1154209
[6]  
[Anonymous], 2016, P 29 C NEUR INF PROC
[7]  
[Anonymous], IMPROVING LSTM BASED
[8]  
[Anonymous], 2004, TEXT SUMMARIZATION B
[9]  
[Anonymous], 2015, P 3 INT C LEARN REPR
[10]  
[Anonymous], 1997, Neural Computation