Global semantic enhancement network for video captioning

被引:10
作者
Luo, Xuemei [1 ,3 ]
Luo, Xiaotong [1 ]
Wang, Di [1 ]
Liu, Jinhui [1 ]
Wan, Bo [1 ]
Zhao, Lin [2 ,3 ]
机构
[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China
[2] Nanjing Univ Sci & Technol, Jiangsu Key Lab Image & Video Understanding Social, Nanjing 210094, Peoples R China
[3] Xidian Univ, Key Lab Integrated Serv Networks, Xian 710071, Peoples R China
基金
中国国家自然科学基金;
关键词
Video captioning; Feature aggregation; Semantic enhancement;
D O I
10.1016/j.patcog.2023.109906
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video captioning aims to briefly describe the content of a video in accurate and fluent natural language, which is a hot research topic in multimedia processing. As a bridge between video and natural language, video captioning is a challenging task that requires a deep understanding of video content and effective utilization of diverse video multimodal information. Existing video captioning methods usually ignore the relative importance between different frames when aggregating frame-level video features and neglect the global semantic correlations between videos and texts in learning visual representations, resulting in the learned representations less effective. To address these problems, we propose a novel framework, namely Global Semantic Enhancement Network (GSEN) to generate high-quality captions for videos. Specifically, a feature aggregation module based on a lightweight attention mechanism is designed to aggregate frame -level video features, which highlights features of informative frames in video representations. In addition, a global semantic enhancement module is proposed to enhance semantic correlations for video and language representations in order to generate semantically more accurate captions. Extensive qualitative and quantitative experiments on two public benchmark datasets MSVD and MSR-VTT demonstrate that the proposed GSEN can achieve superior performance than state-of-the-art methods.
引用
收藏
页数:11
相关论文
共 52 条
  • [1] Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
    Aafaq, Nayyer
    Akhtar, Naveed
    Liu, Wei
    Gilani, Syed Zulqarnain
    Mian, Ajmal
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12479 - 12488
  • [2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
    Anderson, Peter
    He, Xiaodong
    Buehler, Chris
    Teney, Damien
    Johnson, Mark
    Gould, Stephen
    Zhang, Lei
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
  • [3] Nguyen A, 2018, IEEE INT CONF ROBOT, P3782
  • [4] Banerjee S., 2005, P ACL WORKSH INTR EX, P65
  • [5] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [6] Chen David, 2011, P 49 ANN M ASS COMP, P190
  • [7] Chen M., 2018, P MACHINE LEARNING, P847
  • [8] Chen SX, 2019, AAAI CONF ARTIF INTE, P8191
  • [9] Video Captioning with Guidance of Multimodal Latent Topics
    Chen, Shizhe
    Chen, Jia
    Jin, Qin
    Hauptmann, Alexander
    [J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1838 - 1846
  • [10] A multi-embedding neural model for incident video retrieval
    Chiang, Ting-Hui
    Tseng, Yi-Chun
    Tseng, Yu-Chee
    [J]. PATTERN RECOGNITION, 2022, 130