Reconstruction Network for Video Captioning

被引:237
作者
Wang, Bairui [2 ]
Ma, Lin [1 ]
Zhang, Wei [2 ]
Liu, Wei [1 ]
机构
[1] Tencent AI Lab, Bellevue, WA 98004 USA
[2] Shandong Univ, Sch Control Sci & Engn, Jinan, Shandong, Peoples R China
来源
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年
关键词
D O I
10.1109/CVPR.2018.00795
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder makes use of the forward flow to produce the sentence description based on the encoded video semantic features. Two types of reconstructors are customized to employ the backward flow and reproduce the video features based on the hidden state sequence generated by the decoder The generation loss yielded by the encoder-decoder and the reconstruction loss introduced by the reconstructor are jointly drawn into training the proposed RecNet in an end-to-end fashion. Experimental results on benchmark datasets demonstrate that the proposed reconstructor can boost the encoder-decoder models and leads to significant gains in video caption accuracy.
引用
收藏
页码:7622 / 7631
页数:10
相关论文
共 52 条
[1]  
[Anonymous], 2017, ARXIV170601231
[2]  
[Anonymous], 2014, GERM C PATT REC
[3]  
[Anonymous], 2014, P SSST EMNLP 2014 8
[4]  
[Anonymous], 2017, ARXIV170403899
[5]  
[Anonymous], 2015, CVPR
[6]  
[Anonymous], 2015, P ICLR
[7]  
[Anonymous], 2014, P 28 INT C NEUR INF
[8]  
[Anonymous], 2015, ICCV
[9]  
[Anonymous], 2015, ICCV
[10]  
[Anonymous], IEEE T CIRCUITS SYST