Refocused Attention: Long Short-Term Rewards Guided Video Captioning

被引:2
作者
Dong, Jiarong [1 ,2 ]
Gao, Ke [1 ]
Chen, Xiaokai [1 ,2 ]
Cao, Juan [1 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
关键词
Video captioning; Hierarchical attention; Reinforcement learning; Reward;
D O I
10.1007/s11063-019-10030-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The adaptive cooperation of visual model and language model is essential for video captioning. However, due to the lack of proper guidance for each time step in end-to-end training, the over-dependence of language model often results in the invalidation of attention-based visual model, which is called 'Attention Defocus' problem in this paper. Based on an important observation that the recognition precision of entity word can reflect the effectiveness of the visual model, we propose a novel strategy called refocused attention to optimize the training and cooperating of visual model and language model, using ingenious guidance at appropriate time step. The strategy consists of a short-term-reward guided local entity recognition and a long-term-reward guided global relation understanding, neither requires any external training data. Moreover, a framework with hierarchical visual representations and hierarchical attention is established to fully exploit the potential strength of the proposed learning strategy. Extensive experiments demonstrate that the ingenious guidance strategy together with the optimized structure outperform state-of-the-art video captioning methods with relative improvements 7.7% in BLEU-4 and 5.0% in CIDEr-D on MSVD dataset, even without multi-modal features.
引用
收藏
页码:935 / 948
页数:14
相关论文
共 39 条
[1]  
Anderson P, 2017, ARXIVPREPRINTARXIV17
[2]  
[Anonymous], J. CoRR
[3]   Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention [J].
Chen, Jingyuan ;
Zhang, Hanwang ;
He, Xiangnan ;
Nie, Liqiang ;
Liu, Wei ;
Chua, Tat-Seng .
SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, :335-344
[4]   Video Captioning with Guidance of Multimodal Latent Topics [J].
Chen, Shizhe ;
Chen, Jia ;
Jin, Qin ;
Hauptmann, Alexander .
PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, :1838-1846
[5]  
Chen X., 2015, MICROSOFT COCO CAPTI
[6]   COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES [J].
DAVIS, SB ;
MERMELSTEIN, P .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04) :357-366
[7]  
Denkowski M., 2014, PROC ACL WORKSHOP, P376
[8]   Early Embedding and Late Reranking for Video Captioning [J].
Dong, Jianfeng ;
Li, Xirong ;
Lan, Weiyu ;
Huo, Yujia ;
Snoek, Cees G. M. .
MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, :1082-1086
[9]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[10]  
Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754