Watch It Twice: Video Captioning with a Refocused Video Encoder

被引:20
作者
Shi, Xiangxi [1 ]
Cai, Jianfei [1 ,2 ]
Joty, Shafiq [1 ]
Gu, Jiuxiang [1 ]
机构
[1] Nanyang Technol Univ, Singapore, Singapore
[2] Monash Univ, Clayton, Vic, Australia
来源
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19) | 2019年
基金
新加坡国家研究基金会;
关键词
video captioning; recurrent video encoding; reinforcement learning; key frame;
D O I
10.1145/3343031.3351060
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
With the rapid growth of video data and the increasing demands of various crossmodal applications such as intelligent video search and assistance towards visually-impaired people, video captioning task has received a lot of attention recently in computer vision and natural language processing fields. The state-of-the-art video captioning methods focus more on encoding the temporal information, while lacking effective ways to remove irrelevant temporal information and also neglecting the spatial details. In particular, the current unidirectional video encoder can be negatively affected by irrelevant temporal information, especially the irrelevant information at the beginning and at the end of a video. In addition, disregarding detailed spatial features may lead to incorrect word choices in decoding. In this paper, we propose a novel recurrent video encoding method and a novel visual spatial feature for the video captioning task. The recurrent encoding module encodes the video twice with a predicted key frame to avoid irrelevant temporal information often occurring at the beginning and at the end of a video. The novel spatial features represent spatial information from different regions of a video and provide the decoder with more detailed information. Experiments on two benchmark datasets show superior performance of the proposed method.
引用
收藏
页码:818 / 826
页数:9
相关论文
共 47 条
[41]   Sequence to Sequence - Video to Text [J].
Venugopalan, Subhashini ;
Rohrbach, Marcus ;
Donahue, Jeff ;
Mooney, Raymond ;
Darrell, Trevor ;
Saenko, Kate .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4534-4542
[42]   Reconstruction Network for Video Captioning [J].
Wang, Bairui ;
Ma, Lin ;
Zhang, Wei ;
Liu, Wei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7622-7631
[43]   M3: Multimodal Memory Modelling for Video Captioning [J].
Wang, Junbo ;
Wang, Wei ;
Huang, Yan ;
Wang, Liang ;
Tan, Tieniu .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7512-7520
[44]   Video Captioning via Hierarchical Reinforcement Learning [J].
Wang, Xin ;
Chen, Wenhu ;
Wu, Jiawei ;
Wang, Yuan-Fang ;
Wang, William Yang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :4213-4222
[45]   Interpretable Video Captioning via Trajectory Structured Localization [J].
Wu, Xian ;
Li, Guanbin ;
Cao, Qingxing ;
Ji, Qingge ;
Lin, Liang .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6829-6837
[46]   MSR-VTT: A Large Video Description Dataset for Bridging Video and Language [J].
Xu, Jun ;
Mei, Tao ;
Yao, Ting ;
Rui, Yong .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :5288-5296
[47]  
Yao T., 2018, P EUR C COMP VIS ECC, P684