Incorporating Textual Similarity in Video Captioning Schemes

被引:0
作者
Gkountakos, Konstantinos [1 ]
Dimou, Anastasios [1 ]
Papadopoulos, Georgios Th. [1 ]
Daras, Petros [1 ]
机构
[1] Ctr Res & Technol Hellas, Inst Informat Technol, Thessaloniki, Greece
来源
2019 IEEE INTERNATIONAL CONFERENCE ON ENGINEERING, TECHNOLOGY AND INNOVATION (ICE/ITMC) | 2019年
基金
欧盟地平线“2020”;
关键词
video captioning; Word2Vec; textual information; encoder-decoder; Recurrent Neural Network (RNN);
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
The problem of video captioning has been heavily investigated from the research community the last years and, especially, since Recurrent Neural Networks (RNNs) have been introduced. Aforementioned approaches of video captioning, are usually based on sequence-to-sequence models that aim to exploit the visual information by detecting events, objects, or via matching entities to words. However, the exploitation of the contextual information that can be extracted from the vocabulary has not been investigated yet, except from approaches that make use of parts of speech such as verbs, nouns, and adjectives. The proposed approach is based on the assumption that textually similar captions should represent similar visual content. Specifically, we propose a novel loss function that penalizes/rewards the wrong/correct predicted words based on the semantic cluster that they belong to. The proposed method is evaluated using two widely-known datasets in the video captioning domain, Microsoft Research - Video to Text (MSR-VTT) and Microsoft Research Video Description Corpus (MSVD). Finally, experimental analysis proves that the proposed method outperforms the baseline approach in most cases.
引用
收藏
页数:6
相关论文
共 35 条
[1]   Hierarchical Boundary-Aware Neural Encoder for Video Captioning [J].
Baraldi, Lorenzo ;
Grana, Costantino ;
Cucchiara, Rita .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :3185-3194
[2]   BidirectionalLong-Short Term Memory for Video Description [J].
Bin, Yi ;
Yang, Yang ;
Shen, Fumin ;
Xu, Xing ;
Shen, Heng Tao .
MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, :436-440
[3]  
Chen David, 2011, ACL, P190
[4]   Less Is More: Picking Informative Frames for Video Captioning [J].
Chen, Yangyu ;
Wang, Shuhui ;
Zhang, Weigang ;
Huang, Qingming .
COMPUTER VISION - ECCV 2018, PT XIII, 2018, 11217 :367-384
[5]  
Collobert R., 2008, Proceedings of the 25th international conference on Machine learning, V25, P160, DOI DOI 10.1145/1390156.1390177
[6]  
Denkowski M, 2014, P 9 WORKSH STAT MACH, P376
[7]   Video Captioning With Attention-Based LSTM and Semantic Consistency [J].
Gao, Lianli ;
Guo, Zhao ;
Zhang, Hanwang ;
Xu, Xing ;
Shen, Heng Tao .
IEEE TRANSACTIONS ON MULTIMEDIA, 2017, 19 (09) :2045-2055
[8]  
Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[9]  
Hartigan J. A., 1979, Applied Statistics, V28, P100, DOI 10.2307/2346830
[10]  
He K., 2016, IEEE C COMPUT VIS PA, DOI [10.1007/978-3-319-46493-0_38, DOI 10.1007/978-3-319-46493-0_38, DOI 10.1109/CVPR.2016.90]