A GLOBAL-LOCAL CONTRASTIVE LEARNING FRAMEWORK FOR VIDEO CAPTIONING

被引：0

作者：

Huang, Qunyue ^{[1
]}

Fang, Bin ^{[1
]}

Ai, Xi ^{[1
]}

机构：

[1] Chongqing Univ, Coll Comp Sci, Chongqing, Peoples R China

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2023年

关键词：

video captioning; contrastive learning; local encoder; global encoder; multimodal encoder;

D O I：

10.1109/ICIP49359.2023.10223123

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, a global-local contrastive learning framework is proposed to leverage global contextual information from different modalities and then effectively fuse them with the supervision of contrastive learning. First, a global-local encoder is proposed to sufficiently explore the salient contextual information from different modalities, which generates the global contextual information. Second, contrastive learning is used to minimize the semantic distance between the paired modalities, which can improve the content matching between videos and the predicted captions. Finally, an attention-based multimodal encoder is presented to effectively fuse different modalities, thereby generating the multimodal representations that include global contextual information from different modalities. Extensive experimental results on benchmark datasets indicate that our proposed method is superior to the state-of-the-art approaches.

引用

页码：2410 / 2414

页数：5

共 50 条

[1] Contrastive Learning of Global-Local Video Representations
Ma, Shuang
Zeng, Zhaoyang
McDuff, Daniel
Song, Yale
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[2] Video Captioning Using Global-Local Representation
Yan, Liqi
Ma, Siqi
Wang, Qifan
Chen, Yingjie
Zhang, Xiangyu
Savakis, Andreas
Liu, Dongfang
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (10) : 6642 - 6656
[3] Hierarchical Global-Local Temporal Modeling for Video Captioning
Hu, Yaosi
Chen, Zhenzhong
Zha, Zheng-Jun
Wu, Feng
PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 774 - 783
[4] Violent Video Recognition Based on Global-Local Visual and Audio Contrastive Learning
Liu, Zihao
Wu, Xiaoyu
Wang, Shengjin
Shang, Yimeng
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 476 - 480
[5] Global-Local Combined Semantic Generation Network for Video Captioning
Mao L.
Gao H.
Yang D.
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (09): : 1374 - 1382
[6] Hard Contrastive Learning for Video Captioning
Wu, Lilei
Liu, Jie
2022 IEEE 5TH INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION ENGINEERING, ICECE, 2022, : 202 - 209
[7] GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning
Wang, Qi
Huang, Wei
Zhang, Xueting
Li, Xuelong
IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (11) : 6910 - 6922
[8] GL-CLEF: A Global-Local Contrastive Learning Framework for Cross-lingual Spoken Language Understanding
Qin, Libo
Chen, Qiguang
Xie, Tianbao
Li, Qixin
Lou, Jian-Guang
Che, Wanxiang
Kan, Min-Yen
PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2677 - 2686
[9] ActBERT: Learning Global-Local Video-Text Representations
Zhu, Linchao
Yang, Yi
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 8743 - 8752
[10] Global-local contrastive multiview representation learning for skeleton-based action
Bian, Cunling
Feng, Wei
Meng, Fanbo
Wang, Song
COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 229

← 1 2 3 4 5 →