Spotting and Aggregating Salient Regions for Video Captioning

被引:18
|
作者
Wang, Huiyun [1 ]
Xu, Youjiang [1 ]
Han, Yahong [1 ]
机构
[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China
来源
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年
关键词
Video Captioning; Salient Regions; Spatio-Temporal Representation;
D O I
10.1145/3240508.3240677
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Towards an interpretable video captioning process, we target to locate salient regions of video objects along with the sequentially uttering words. This paper proposes a new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning. First, in a Spot Module, we automatically learn the saliency value of each location to separate salient regions from video content as the foreground and the rest as background by two operations of 'hard separation' and 'soft separation', respectively. Then, in an Aggregate Module, to aggregate the foreground/background descriptors into a discriminative spatio-temporal representation, we devise a trainable video VLAD process to learn the aggregation parameters. Finally, we utilize the attention mechanism to decode the spatio-temporal representations of different regions into video descriptions. Experiments on two benchmark datasets demonstrate our method outperforms most of the state-of-the-art methods in terms of Bleu@4, METEOR and CIDEr metrics for the task of video captioning. Also examples demonstrate our method can successfully utter words to sequentially salient regions of video objects.
引用
收藏
页码:1519 / 1526
页数:8
相关论文
共 50 条
  • [21] Deep multimodal embedding for video captioning
    Jin Young Lee
    Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
  • [22] Traffic Scenario Understanding and Video Captioning via Guidance Attention Captioning Network
    Liu, Chunsheng
    Zhang, Xiao
    Chang, Faliang
    Li, Shuang
    Hao, Penghui
    Lu, Yansha
    Wang, Yinhai
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (05) : 3615 - 3627
  • [23] Quality Enhancement Based Video Captioning in Video Communication Systems
    Le, The Van
    Lee, Jin Young
    IEEE ACCESS, 2024, 12 : 40989 - 40999
  • [24] A Grey Relational Analysis based Evaluation Metric for Image Captioning and Video Captioning
    Ma, Miao
    Wang, Bolong
    PROCEEDINGS OF 2017 IEEE INTERNATIONAL CONFERENCE ON GREY SYSTEMS AND INTELLIGENT SERVICES (GSIS), 2017, : 76 - 81
  • [25] Learning Video-Text Aligned Representations for Video Captioning
    Shi, Yaya
    Xu, Haiyang
    Yuan, Chunfeng
    Li, Bing
    Hu, Weiming
    Zha, Zheng-Jun
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [26] Improving distinctiveness in video captioning with text-video similarity
    Velda, Vania
    Immanuel, Steve Andreas
    Hendria, Willy Fitra
    Jeong, Cheol
    IMAGE AND VISION COMPUTING, 2023, 136
  • [27] Multiple Videos Captioning Model for Video Storytelling
    Han, Seung-Ho
    Go, Bo-Won
    Choi, Ho-Jin
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 355 - 358
  • [28] Video captioning with global and local text attention
    Yuqing Peng
    Chenxi Wang
    Yixin Pei
    Yingjun Li
    The Visual Computer, 2022, 38 : 4267 - 4278
  • [29] A NOVEL ATTRIBUTE SELECTION MECHANISM FOR VIDEO CAPTIONING
    Xiao, Huanhou
    Shi, Jinglun
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 619 - 623
  • [30] Convolutional Reconstruction-to-Sequence for Video Captioning
    Wu, Aming
    Han, Yahong
    Yang, Yi
    Hu, Qinghua
    Wu, Fei
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (11) : 4299 - 4308