Deep multimodal embedding for video captioning

被引:9
作者
Lee, Jin Young [1 ]
机构
[1] Sejong Univ, Sch Intelligent Mechatron Engn, Seoul, South Korea
关键词
Deep embedding; LSTM network; Multimodal features; Video captioning;
D O I
10.1007/s11042-019-08011-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatically generating natural language descriptions from videos, which is simply called video captioning, is very challenging work in computer vision. Thanks to the success of image captioning, in recent years, there has been rapid progress in the video captioning. Unlike images, videos have a variety of modality information, such as frames, motion, audio, and so on. However, since each modality has different characteristic, how they are embedded in a multimodal video captioning network is very important. This paper proposes a deep multimodal embedding network based on analysis of the multimodal features. The experimental results show that the captioning performance of the proposed network is very competitive in comparison with conventional networks.
引用
收藏
页码:31793 / 31805
页数:13
相关论文
共 39 条
[1]  
[Anonymous], ACM MULTIMEDIA
[2]  
[Anonymous], INT C COMP LING
[3]  
[Anonymous], IEEE C COMP VIS PATT
[4]  
[Anonymous], 2015, C N AM CHAPT ASS COM
[5]  
[Anonymous], AAAI C ART INT
[6]  
[Anonymous], IEEE C COMP VIS PATT
[7]  
[Anonymous], AAAI C ART INT
[8]  
[Anonymous], ARXIV171209532V1
[9]  
[Anonymous], 2017, IEEE C COMP VIS PATT
[10]  
[Anonymous], 2014, INT C LEARN REPR