Spotting and Aggregating Salient Regions for Video Captioning

被引:18
作者
Wang, Huiyun [1 ]
Xu, Youjiang [1 ]
Han, Yahong [1 ]
机构
[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China
来源
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年
关键词
Video Captioning; Salient Regions; Spatio-Temporal Representation;
D O I
10.1145/3240508.3240677
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Towards an interpretable video captioning process, we target to locate salient regions of video objects along with the sequentially uttering words. This paper proposes a new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning. First, in a Spot Module, we automatically learn the saliency value of each location to separate salient regions from video content as the foreground and the rest as background by two operations of 'hard separation' and 'soft separation', respectively. Then, in an Aggregate Module, to aggregate the foreground/background descriptors into a discriminative spatio-temporal representation, we devise a trainable video VLAD process to learn the aggregation parameters. Finally, we utilize the attention mechanism to decode the spatio-temporal representations of different regions into video descriptions. Experiments on two benchmark datasets demonstrate our method outperforms most of the state-of-the-art methods in terms of Bleu@4, METEOR and CIDEr metrics for the task of video captioning. Also examples demonstrate our method can successfully utter words to sequentially salient regions of video objects.
引用
收藏
页码:1519 / 1526
页数:8
相关论文
共 50 条
  • [31] Exploiting the local temporal information for video captioning
    Wei, Ran
    Mi, Li
    Hu, Yaosi
    Chen, Zhenzhong
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 67 (67)
  • [32] Global semantic enhancement network for video captioning
    Luo, Xuemei
    Luo, Xiaotong
    Wang, Di
    Liu, Jinhui
    Wan, Bo
    Zhao, Lin
    PATTERN RECOGNITION, 2024, 145
  • [33] Image and Video Captioning with Augmented Neural Architectures
    Shetty, Rakshith
    Tavakoli, Hamed R.
    Laaksonen, Jorma
    IEEE MULTIMEDIA, 2018, 25 (02) : 34 - 46
  • [34] UAT: Universal Attention Transformer for Video Captioning
    Im, Heeju
    Choi, Yong-Suk
    SENSORS, 2022, 22 (13)
  • [35] Chained semantic generation network for video captioning
    Mao L.
    Gao H.
    Yang D.
    Zhang R.
    Guangxue Jingmi Gongcheng/Optics and Precision Engineering, 2022, 30 (24): : 3198 - 3209
  • [36] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [37] SibNet: Sibling Convolutional Encoder for Video Captioning
    Liu, Sheng
    Ren, Zhou
    Yuan, Junsong
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (09) : 3259 - 3272
  • [38] EvCap: Element-Aware Video Captioning
    Liu, Sheng
    Li, Annan
    Zhao, Yuwei
    Wang, Jiahao
    Wang, Yunhong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9718 - 9731
  • [39] MULTISTREAM HIERARCHICAL BOUNDARY NETWORK FOR VIDEO CAPTIONING
    Thang Nguyen
    Sah, Shagan
    Ptucha, Raymond
    2017 IEEE WESTERN NEW YORK IMAGE AND SIGNAL PROCESSING WORKSHOP (WNYISPW), 2017,
  • [40] Bidirectional transformer with knowledge graph for video captioning
    Zhong, Maosheng
    Chen, Youde
    Zhang, Hao
    Xiong, Hao
    Wang, Zhixiang
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (20) : 58309 - 58328