Spotting and Aggregating Salient Regions for Video Captioning

被引:18
|
作者
Wang, Huiyun [1 ]
Xu, Youjiang [1 ]
Han, Yahong [1 ]
机构
[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China
来源
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年
关键词
Video Captioning; Salient Regions; Spatio-Temporal Representation;
D O I
10.1145/3240508.3240677
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Towards an interpretable video captioning process, we target to locate salient regions of video objects along with the sequentially uttering words. This paper proposes a new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning. First, in a Spot Module, we automatically learn the saliency value of each location to separate salient regions from video content as the foreground and the rest as background by two operations of 'hard separation' and 'soft separation', respectively. Then, in an Aggregate Module, to aggregate the foreground/background descriptors into a discriminative spatio-temporal representation, we devise a trainable video VLAD process to learn the aggregation parameters. Finally, we utilize the attention mechanism to decode the spatio-temporal representations of different regions into video descriptions. Experiments on two benchmark datasets demonstrate our method outperforms most of the state-of-the-art methods in terms of Bleu@4, METEOR and CIDEr metrics for the task of video captioning. Also examples demonstrate our method can successfully utter words to sequentially salient regions of video objects.
引用
收藏
页码:1519 / 1526
页数:8
相关论文
共 50 条
  • [1] Catching the Temporal Regions-of-Interest for Video Captioning
    Yang, Ziwei
    Han, Yahong
    Wang, Zheng
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 146 - 153
  • [2] Sequence in sequence for video captioning
    Wang, Huiyun
    Gao, Chongyang
    Han, Yahong
    PATTERN RECOGNITION LETTERS, 2020, 130 (130) : 327 - 334
  • [3] Video captioning – a survey
    Vaishnavi J.
    Narmatha V.
    Multimedia Tools and Applications, 2025, 84 (2) : 947 - 978
  • [4] Video Captioning based on Image Captioning as Subsidiary Content
    Vaishnavi, J.
    Narmatha, V
    2022 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL, COMPUTING, COMMUNICATION AND SUSTAINABLE TECHNOLOGIES (ICAECT), 2022,
  • [5] Rethink video retrieval representation for video captioning
    Tian, Mingkai
    Li, Guorong
    Qi, Yuankai
    Wang, Shuhui
    Sheng, Quan Z.
    Huang, Qingming
    PATTERN RECOGNITION, 2024, 156
  • [6] A Review Of Video Captioning Methods
    Mahajan, Dewarthi
    Bhosale, Sakshi
    Nighot, Yash
    Tayal, Madhuri
    INTERNATIONAL JOURNAL OF NEXT-GENERATION COMPUTING, 2021, 12 (05): : 708 - 715
  • [7] Multirate Multimodal Video Captioning
    Yang, Ziwei
    Xu, Youjiang
    Wang, Huiyun
    Wang, Bo
    Han, Yahong
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882
  • [8] Video Captioning by Adversarial LSTM
    Yang, Yang
    Zhou, Jie
    Ai, Jiangbo
    Bin, Yi
    Hanjalic, Alan
    Shen, Heng Tao
    Ji, Yanli
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (11) : 5600 - 5611
  • [9] Video Captioning with Semantic Guiding
    Yuan, Jin
    Tian, Chunna
    Zhang, Xiangnan
    Ding, Yuxuan
    Wei, Wei
    2018 IEEE FOURTH INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2018,
  • [10] Watch It Twice: Video Captioning with a Refocused Video Encoder
    Shi, Xiangxi
    Cai, Jianfei
    Joty, Shafiq
    Gu, Jiuxiang
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 818 - 826