Spotting and Aggregating Salient Regions for Video Captioning

被引：18

作者：

Wang, Huiyun ^{[1
]}

Xu, Youjiang ^{[1
]}

Han, Yahong ^{[1
]}

机构：

[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China

来源：

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年

关键词：

Video Captioning; Salient Regions; Spatio-Temporal Representation;

D O I：

10.1145/3240508.3240677

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Towards an interpretable video captioning process, we target to locate salient regions of video objects along with the sequentially uttering words. This paper proposes a new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning. First, in a Spot Module, we automatically learn the saliency value of each location to separate salient regions from video content as the foreground and the rest as background by two operations of 'hard separation' and 'soft separation', respectively. Then, in an Aggregate Module, to aggregate the foreground/background descriptors into a discriminative spatio-temporal representation, we devise a trainable video VLAD process to learn the aggregation parameters. Finally, we utilize the attention mechanism to decode the spatio-temporal representations of different regions into video descriptions. Experiments on two benchmark datasets demonstrate our method outperforms most of the state-of-the-art methods in terms of Bleu@4, METEOR and CIDEr metrics for the task of video captioning. Also examples demonstrate our method can successfully utter words to sequentially salient regions of video objects.

引用

页码：1519 / 1526

页数：8

共 50 条

[41] A Study of Evaluation Metrics and Datasets for Video Captioning
Park, Jaehui
Song, Chibon
Han, Ji-hyeong
2017 2ND INTERNATIONAL CONFERENCE ON INTELLIGENT INFORMATICS AND BIOMEDICAL SCIENCES (ICIIBMS), 2017, : 172 - 175
[42] Discriminative Latent Semantic Graph for Video Captioning
Bai, Yang
Wang, Junyan
Long, Yang
Hu, Bingzhang
Song, Yang
Pagnucco, Maurice
Guan, Yu
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 3556 - 3564
[43] IcoCap: Improving Video Captioning by Compounding Images
Liang, Yuanzhi
Zhu, Linchao
Wang, Xiaohan
Yang, Yi
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4389 - 4400
[44] Video captioning: a review of theory, techniques and practices
Jain, Vanita
Al-Turjman, Fadi
Chaudhary, Gopal
Nayar, Devang
Gupta, Varun
Kumar, Aayush
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (25) : 35619 - 35653
[45] Deep Reinforcement Polishing Network for Video Captioning
Xu, Wanru
Yu, Jian
Miao, Zhenjiang
Wan, Lili
Tian, Yi
Ji, Qiang
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 (23) : 1772 - 1784
[46] Adaptive semantic guidance network for video captioning☆
Liu, Yuanyuan
Zhu, Hong
Wu, Zhong
Du, Sen
Wu, Shuning
Shi, Jing
COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 251
[47] SibNet: Sibling Convolutional Encoder for Video Captioning
Liu, Sheng
Ren, Zhou
Yuan, Junsong
PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 1425 - 1434
[48] Accelerating Video Captioning on Heterogeneous System Architectures
Huang, Horng-Ruey
Hong, Ding-Yong
Wu, Jan-Jan
Chen, Kung-Fu
Liu, Pangfeng
Hsu, Wei-Chung
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2022, 19 (03)
[49] Learning Comprehensive Visual Grounding for Video Captioning
Jiang, Wenhui
Liu, Linxin
Fang, Yuming
Cheng, Yibo
Peng, Yuxin
Liu, Yang
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2025, 35 (04) : 3355 - 3367
[50] Learning Hierarchical Modular Networks for Video Captioning
Li, Guorong
Ye, Hanhua
Qi, Yuankai
Wang, Shuhui
Qing, Laiyun
Huang, Qingming
Yang, Ming-Hsuan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (02) : 1049 - 1064

← 1 2 3 4 5 →