Image Captioning With Visual-Semantic Double Attention

被引:12
作者
He, Chen [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Technol, Guangzhou 510006, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual-semantic double attention; image captioning; semantic attention;
D O I
10.1145/3292058
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic Attention (SEA) model is proposed to distill semantic features. Traditional attribute-based models always neglect the distinctive importance of each attribute word and fuse all of them into recurrent neural networks, resulting in abundant irrelevant semantic features. In contrast, at each timestep, our model selects the most relevant word that aligns with current context. In other words, the real power of VSDA lies in the ability of not only leveraging semantic features but also eliminating the influence of irrelevant attribute words to make the semantic guidance more precise. Furthermore, our approach solves the problem that visual attention models cannot boost generating non-visual words. Considering that visual and semantic features are complementary to each other, our model can leverage both of them to strengthen the generations of visual and non-visual words. Extensive experiments are conducted on famous datasets: MS COCO and Flickr30k. The results show that VSDA outperforms other methods and achieves promising performance.
引用
收藏
页数:16
相关论文
共 32 条
[1]  
[Anonymous], P IEEE INT C COMP VI
[2]  
[Anonymous], 2017, ICCV
[3]  
[Anonymous], 2014, T ASSOC COMPUT LING
[4]  
[Anonymous], 2015, ARXIV150501809
[5]  
[Anonymous], 2017, P IEEE INT C COMP VI
[6]  
[Anonymous], 2014, ARXIV14128419
[7]  
Banerjee S., 2005, ACL WORKSHOP INTRINS, P65
[8]  
Elliott Desmond, 2013, P 2013 C EMP METH NA, P1292
[9]  
Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754
[10]   Every Picture Tells a Story: Generating Sentences from Images [J].
Farhadi, Ali ;
Hejrati, Mohsen ;
Sadeghi, Mohammad Amin ;
Young, Peter ;
Rashtchian, Cyrus ;
Hockenmaier, Julia ;
Forsyth, David .
COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+