Global semantic enhancement network for video captioning

被引：10

作者：

Luo, Xuemei ^{[1
,3
]}

Luo, Xiaotong ^{[1
]}

Wang, Di ^{[1
]}

Liu, Jinhui ^{[1
]}

Wan, Bo ^{[1
]}

Zhao, Lin ^{[2
,3
]}

机构：

[1] Xidian Univ, Key Lab Smart Human Comp Interact & Wearable Techn, Xian 710071, Peoples R China

[2] Nanjing Univ Sci & Technol, Jiangsu Key Lab Image & Video Understanding Social, Nanjing 210094, Peoples R China

[3] Xidian Univ, Key Lab Integrated Serv Networks, Xian 710071, Peoples R China

来源：

PATTERN RECOGNITION | 2024年 / 145卷

基金：

中国国家自然科学基金;

关键词：

Video captioning; Feature aggregation; Semantic enhancement;

D O I：

10.1016/j.patcog.2023.109906

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video captioning aims to briefly describe the content of a video in accurate and fluent natural language, which is a hot research topic in multimedia processing. As a bridge between video and natural language, video captioning is a challenging task that requires a deep understanding of video content and effective utilization of diverse video multimodal information. Existing video captioning methods usually ignore the relative importance between different frames when aggregating frame-level video features and neglect the global semantic correlations between videos and texts in learning visual representations, resulting in the learned representations less effective. To address these problems, we propose a novel framework, namely Global Semantic Enhancement Network (GSEN) to generate high-quality captions for videos. Specifically, a feature aggregation module based on a lightweight attention mechanism is designed to aggregate frame -level video features, which highlights features of informative frames in video representations. In addition, a global semantic enhancement module is proposed to enhance semantic correlations for video and language representations in order to generate semantically more accurate captions. Extensive qualitative and quantitative experiments on two public benchmark datasets MSVD and MSR-VTT demonstrate that the proposed GSEN can achieve superior performance than state-of-the-art methods.

引用

页数：11

共 52 条

[1] Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
Aafaq, Nayyer
Akhtar, Naveed
Liu, Wei
Gilani, Syed Zulqarnain
Mian, Ajmal
[J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 12479 - 12488
[2] Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Anderson, Peter
He, Xiaodong
Buehler, Chris
Teney, Damien
Johnson, Mark
Gould, Stephen
Zhang, Lei
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6077 - 6086
[3] Nguyen A, 2018, IEEE INT CONF ROBOT, P3782
[4] Banerjee S., 2005, P ACL WORKSH INTR EX, P65
[5] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[6] Chen David, 2011, P 49 ANN M ASS COMP, P190
[7] Chen M., 2018, P MACHINE LEARNING, P847
[8] Chen SX, 2019, AAAI CONF ARTIF INTE, P8191
[9] Video Captioning with Guidance of Multimodal Latent Topics
Chen, Shizhe
Chen, Jia
Jin, Qin
Hauptmann, Alexander
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1838 - 1846
[10] A multi-embedding neural model for incident video retrieval
Chiang, Ting-Hui
Tseng, Yi-Chun
Tseng, Yu-Chee
[J]. PATTERN RECOGNITION, 2022, 130

← 1 2 3 4 5 6 →