Spotting and Aggregating Salient Regions for Video Captioning

被引:18
作者
Wang, Huiyun [1 ]
Xu, Youjiang [1 ]
Han, Yahong [1 ]
机构
[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China
来源
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年
关键词
Video Captioning; Salient Regions; Spatio-Temporal Representation;
D O I
10.1145/3240508.3240677
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Towards an interpretable video captioning process, we target to locate salient regions of video objects along with the sequentially uttering words. This paper proposes a new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning. First, in a Spot Module, we automatically learn the saliency value of each location to separate salient regions from video content as the foreground and the rest as background by two operations of 'hard separation' and 'soft separation', respectively. Then, in an Aggregate Module, to aggregate the foreground/background descriptors into a discriminative spatio-temporal representation, we devise a trainable video VLAD process to learn the aggregation parameters. Finally, we utilize the attention mechanism to decode the spatio-temporal representations of different regions into video descriptions. Experiments on two benchmark datasets demonstrate our method outperforms most of the state-of-the-art methods in terms of Bleu@4, METEOR and CIDEr metrics for the task of video captioning. Also examples demonstrate our method can successfully utter words to sequentially salient regions of video objects.
引用
收藏
页码:1519 / 1526
页数:8
相关论文
共 50 条
  • [41] A Study of Evaluation Metrics and Datasets for Video Captioning
    Park, Jaehui
    Song, Chibon
    Han, Ji-hyeong
    2017 2ND INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATICS AND BIOMEDICAL SCIENCES (ICIIBMS), 2017, : 172 - 175
  • [42] Discriminative Latent Semantic Graph for Video Captioning
    Bai, Yang
    Wang, Junyan
    Long, Yang
    Hu, Bingzhang
    Song, Yang
    Pagnucco, Maurice
    Guan, Yu
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3556 - 3564
  • [43] IcoCap: Improving Video Captioning by Compounding Images
    Liang, Yuanzhi
    Zhu, Linchao
    Wang, Xiaohan
    Yang, Yi
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4389 - 4400
  • [44] Video captioning: a review of theory, techniques and practices
    Jain, Vanita
    Al-Turjman, Fadi
    Chaudhary, Gopal
    Nayar, Devang
    Gupta, Varun
    Kumar, Aayush
    MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (25) : 35619 - 35653
  • [45] Deep Reinforcement Polishing Network for Video Captioning
    Xu, Wanru
    Yu, Jian
    Miao, Zhenjiang
    Wan, Lili
    Tian, Yi
    Ji, Qiang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 (23) : 1772 - 1784
  • [46] Adaptive semantic guidance network for video captioning☆
    Liu, Yuanyuan
    Zhu, Hong
    Wu, Zhong
    Du, Sen
    Wu, Shuning
    Shi, Jing
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 251
  • [47] SibNet: Sibling Convolutional Encoder for Video Captioning
    Liu, Sheng
    Ren, Zhou
    Yuan, Junsong
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1425 - 1434
  • [48] Accelerating Video Captioning on Heterogeneous System Architectures
    Huang, Horng-Ruey
    Hong, Ding-Yong
    Wu, Jan-Jan
    Chen, Kung-Fu
    Liu, Pangfeng
    Hsu, Wei-Chung
    ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (03)
  • [49] Learning Comprehensive Visual Grounding for Video Captioning
    Jiang, Wenhui
    Liu, Linxin
    Fang, Yuming
    Cheng, Yibo
    Peng, Yuxin
    Liu, Yang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (04) : 3355 - 3367
  • [50] Learning Hierarchical Modular Networks for Video Captioning
    Li, Guorong
    Ye, Hanhua
    Qi, Yuankai
    Wang, Shuhui
    Qing, Laiyun
    Huang, Qingming
    Yang, Ming-Hsuan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (02) : 1049 - 1064