Sequential Dual Attention: Coarse-to-Fine-Grained Hierarchical Generation for Image Captioning

被引:2
作者
Guan, Zhibin [1 ]
Liu, Kang [1 ]
Ma, Yan [1 ]
Qian, Xu [1 ]
Ji, Tongkai [1 ,2 ]
机构
[1] China Univ Min & Technol Beijing, Sch Mech Elect & Informat Engn, Beijing 100083, Peoples R China
[2] Chinese Acad Sci, Cloud Comp Ctr, G Cloud Technol Corp, Dongguan 523808, Peoples R China
来源
SYMMETRY-BASEL | 2018年 / 10卷 / 11期
关键词
image caption generation; sequential dual attention; coarse-to-fine-grained; SDA-CFGHG;
D O I
10.3390/sym10110626
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Image caption generation is a fundamental task to build a bridge between image and its description in text, which is drawing increasing interest in artificial intelligence. Images and textual sentences are viewed as two different carriers of information, which are symmetric and unified in the same content of visual scene. The existing image captioning methods rarely consider generating a final description sentence in a coarse-grained to fine-grained way, which is how humans understand the surrounding scenes; and the generated sentence sometimes only describes coarse-grained image content. Therefore, we propose a coarse-to-fine-grained hierarchical generation method for image captioning, named SDA-CFGHG, to address the two problems above. The core of our SDA-CFGHG method is a sequential dual attention that is used to fuse different grained visual information with sequential means. The advantage of our SDA-CFGHG method is that it can achieve image captioning in a coarse-to-fine-grained way and the generated textual sentence can capture details of the raw image to some degree. Moreover, we validate the impressive performance of our method on benchmark datasets-MS COCO, Flickr-with several popular evaluation metrics-CIDEr, SPICE, METEOR, ROUGE-L, and BLEU.
引用
收藏
页数:17
相关论文
共 49 条
[1]   SPICE: Semantic Propositional Image Caption Evaluation [J].
Anderson, Peter ;
Fernando, Basura ;
Johnson, Mark ;
Gould, Stephen .
COMPUTER VISION - ECCV 2016, PT V, 2016, 9909 :382-398
[2]  
[Anonymous], 2012, Long Papers
[3]  
[Anonymous], ARXIV170707998
[4]  
[Anonymous], 2011, P 2011 C EMPIRICAL M
[5]  
[Anonymous], 2017, PROCEEDINGS OF THE I
[6]  
[Anonymous], ARXIV14065726
[7]   Fine-grained attention for image caption generation [J].
Chang, Yan-Shuo .
MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (03) :2959-2971
[8]  
Cho K., 2014, ARXIV14061078, P1724, DOI 10.3115/V1/D14-1179
[9]  
Denkowski M. J., 2014, P 9 WORKSHOP STAT MA, P376
[10]   Every Picture Tells a Story: Generating Sentences from Images [J].
Farhadi, Ali ;
Hejrati, Mohsen ;
Sadeghi, Mohammad Amin ;
Young, Peter ;
Rashtchian, Cyrus ;
Hockenmaier, Julia ;
Forsyth, David .
COMPUTER VISION-ECCV 2010, PT IV, 2010, 6314 :15-+