Spotting and Aggregating Salient Regions for Video Captioning

被引：18

作者：

Wang, Huiyun ^{[1
]}

Xu, Youjiang ^{[1
]}

Han, Yahong ^{[1
]}

机构：

[1] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China

来源：

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18) | 2018年

关键词：

Video Captioning; Salient Regions; Spatio-Temporal Representation;

D O I：

10.1145/3240508.3240677

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Towards an interpretable video captioning process, we target to locate salient regions of video objects along with the sequentially uttering words. This paper proposes a new framework to automatically spot salient regions in each video frame and simultaneously learn a discriminative spatio-temporal representation for video captioning. First, in a Spot Module, we automatically learn the saliency value of each location to separate salient regions from video content as the foreground and the rest as background by two operations of 'hard separation' and 'soft separation', respectively. Then, in an Aggregate Module, to aggregate the foreground/background descriptors into a discriminative spatio-temporal representation, we devise a trainable video VLAD process to learn the aggregation parameters. Finally, we utilize the attention mechanism to decode the spatio-temporal representations of different regions into video descriptions. Experiments on two benchmark datasets demonstrate our method outperforms most of the state-of-the-art methods in terms of Bleu@4, METEOR and CIDEr metrics for the task of video captioning. Also examples demonstrate our method can successfully utter words to sequentially salient regions of video objects.

引用

页码：1519 / 1526

页数：8

共 50 条

[21] Deep multimodal embedding for video captioning
Jin Young Lee
Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
[22] Traffic Scenario Understanding and Video Captioning via Guidance Attention Captioning Network
Liu, Chunsheng
Zhang, Xiao
Chang, Faliang
Li, Shuang
Hao, Penghui
Lu, Yansha
Wang, Yinhai
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2024, 25 (05) : 3615 - 3627
[23] Quality Enhancement Based Video Captioning in Video Communication Systems
Le, The Van
Lee, Jin Young
IEEE ACCESS, 2024, 12 : 40989 - 40999
[24] A Grey Relational Analysis based Evaluation Metric for Image Captioning and Video Captioning
Ma, Miao
Wang, Bolong
PROCEEDINGS OF 2017 IEEE INTERNATIONAL CONFERENCE ON GREY SYSTEMS AND INTELLIGENT SERVICES (GSIS), 2017, : 76 - 81
[25] Learning Video-Text Aligned Representations for Video Captioning
Shi, Yaya
Xu, Haiyang
Yuan, Chunfeng
Li, Bing
Hu, Weiming
Zha, Zheng-Jun
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
[26] Improving distinctiveness in video captioning with text-video similarity
Velda, Vania
Immanuel, Steve Andreas
Hendria, Willy Fitra
Jeong, Cheol
IMAGE AND VISION COMPUTING, 2023, 136
[27] Multiple Videos Captioning Model for Video Storytelling
Han, Seung-Ho
Go, Bo-Won
Choi, Ho-Jin
2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 355 - 358
[28] Video captioning with global and local text attention
Yuqing Peng
Chenxi Wang
Yixin Pei
Yingjun Li
The Visual Computer, 2022, 38 : 4267 - 4278
[29] A NOVEL ATTRIBUTE SELECTION MECHANISM FOR VIDEO CAPTIONING
Xiao, Huanhou
Shi, Jinglun
2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 619 - 623
[30] Convolutional Reconstruction-to-Sequence for Video Captioning
Wu, Aming
Han, Yahong
Yang, Yi
Hu, Qinghua
Wu, Fei
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2020, 30 (11) : 4299 - 4308

← 1 2 3 4 5 →