Video captioning with global and local text attention

被引:1
作者
Peng, Yuqing [1 ]
Wang, Chenxi [1 ]
Pei, Yixin [1 ]
Li, Yingjun [1 ]
机构
[1] Hebei Univ Technol, Sch Artificial Intelligence, Tianjin 300401, Peoples R China
关键词
Video captioning; Global control; Local strengthening; Bidirectional; AGGREGATION;
D O I
10.1007/s00371-021-02294-0
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The task of video captioning is to generate a video description corresponding to the video content, so there are stringent requirements for the extraction of fine-grained video features and the language processing of tag text. A new method using global control of the text and local strengthening during training is proposed in this paper. In this method, the context can be referred to when the training generates text. In addition, more attention is given to important words in the text, such as nouns and predicate verbs, and this approach greatly improves the recognition of objects and provides more accurate prediction of actions in the video. Moreover, in this paper, the authors adopt 2D and 3D multimodal feature extraction for the process of video feature extraction. Better results are achieved by the fine-grained feature capture of global attention and the fusion of bidirectional time flow. The method in this paper obtains good results on both the MSR-VTT and MSVD datasets.
引用
收藏
页码:4267 / 4278
页数:12
相关论文
共 50 条
  • [31] Multimodal-enhanced hierarchical attention network for video captioning
    Zhong, Maosheng
    Chen, Youde
    Zhang, Hao
    Xiong, Hao
    Wang, Zhixiang
    MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482
  • [32] Hierarchical attention-based multimodal fusion for video captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Weichen, Sun
    Su, Fei
    Wang, Leiquan
    NEUROCOMPUTING, 2018, 315 : 362 - 370
  • [33] Attention-based Densely Connected LSTM for Video Captioning
    Zhu, Yongqing
    Jiang, Shuqiang
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 802 - 810
  • [34] Multimodal-enhanced hierarchical attention network for video captioning
    Maosheng Zhong
    Youde Chen
    Hao Zhang
    Hao Xiong
    Zhixiang Wang
    Multimedia Systems, 2023, 29 : 2469 - 2482
  • [35] Multimodal architecture for video captioning with memory networks and an attention mechanism
    Li, Wei
    Guo, Dashan
    Fang, Xiangzhong
    PATTERN RECOGNITION LETTERS, 2018, 105 : 23 - 29
  • [36] Syntax-Guided Hierarchical Attention Network for Video Captioning
    Deng, Jincan
    Li, Liang
    Zhang, Beichen
    Wang, Shuhui
    Zha, Zhengjun
    Huang, Qingming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (02) : 880 - 892
  • [37] Video Captioning using Hierarchical Multi-Attention Model
    Xiao, Huanhou
    Shi, Jinglun
    ICAIP 2018: 2018 THE 2ND INTERNATIONAL CONFERENCE ON ADVANCES IN IMAGE PROCESSING, 2018, : 96 - 101
  • [38] Stacked Multimodal Attention Network for Context-Aware Video Captioning
    Zheng, Yi
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Fan, Weiguo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42
  • [39] Relation-aware attention for video captioning via graph learning
    Tu, Yunbin
    Zhou, Chang
    Guo, Junjun
    Li, Huafeng
    Gao, Shengxiang
    Yu, Zhengtao
    PATTERN RECOGNITION, 2023, 136
  • [40] Video Captioning by Learning from Global Sentence and Looking Ahead
    Niu, Tian-Zi
    Chen, Zhen-Duo
    Luo, Xin
    Zhang, Peng-Fei
    Huang, Zi
    Xu, Xin-Shun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (05)