Refocused Attention: Long Short-Term Rewards Guided Video Captioning

被引:1
作者
Dong, Jiarong [1 ,2 ]
Gao, Ke [1 ]
Chen, Xiaokai [1 ,2 ]
Cao, Juan [1 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Beijing, Peoples R China
关键词
Video captioning; Hierarchical attention; Reinforcement learning; Reward;
D O I
10.1007/s11063-019-10030-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The adaptive cooperation of visual model and language model is essential for video captioning. However, due to the lack of proper guidance for each time step in end-to-end training, the over-dependence of language model often results in the invalidation of attention-based visual model, which is called 'Attention Defocus' problem in this paper. Based on an important observation that the recognition precision of entity word can reflect the effectiveness of the visual model, we propose a novel strategy called refocused attention to optimize the training and cooperating of visual model and language model, using ingenious guidance at appropriate time step. The strategy consists of a short-term-reward guided local entity recognition and a long-term-reward guided global relation understanding, neither requires any external training data. Moreover, a framework with hierarchical visual representations and hierarchical attention is established to fully exploit the potential strength of the proposed learning strategy. Extensive experiments demonstrate that the ingenious guidance strategy together with the optimized structure outperform state-of-the-art video captioning methods with relative improvements 7.7% in BLEU-4 and 5.0% in CIDEr-D on MSVD dataset, even without multi-modal features.
引用
收藏
页码:935 / 948
页数:14
相关论文
共 39 条
  • [1] Anderson P, 2017, ARXIVPREPRINTARXIV17
  • [2] [Anonymous], 2016, PROC 24 ACM INT C MU, DOI [DOI 10.1145/2964284.2984065, 10.1145/2964284.2984065]
  • [3] Attentive Collaborative Filtering: Multimedia Recommendation with Item- and Component-Level Attention
    Chen, Jingyuan
    Zhang, Hanwang
    He, Xiangnan
    Nie, Liqiang
    Liu, Wei
    Chua, Tat-Seng
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 335 - 344
  • [4] Video Captioning with Guidance of Multimodal Latent Topics
    Chen, Shizhe
    Chen, Jia
    Jin, Qin
    Hauptmann, Alexander
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1838 - 1846
  • [5] Chen X, 2015, CORR, V1504, P325
  • [6] COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES
    DAVIS, SB
    MERMELSTEIN, P
    [J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04): : 357 - 366
  • [7] DENKOWSKI M., 2014, P 9 WORKSH STAT MACH
  • [8] Early Embedding and Late Reranking for Video Captioning
    Dong, Jianfeng
    Li, Xirong
    Lan, Weiyu
    Huo, Yujia
    Snoek, Cees G. M.
    [J]. MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 1082 - 1086
  • [9] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [10] Fang H, 2015, PROC CVPR IEEE, P1473, DOI 10.1109/CVPR.2015.7298754