A GLOBAL-LOCAL CONTRASTIVE LEARNING FRAMEWORK FOR VIDEO CAPTIONING

被引:0
|
作者
Huang, Qunyue [1 ]
Fang, Bin [1 ]
Ai, Xi [1 ]
机构
[1] Chongqing Univ, Coll Comp Sci, Chongqing, Peoples R China
来源
2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP | 2023年
关键词
video captioning; contrastive learning; local encoder; global encoder; multimodal encoder;
D O I
10.1109/ICIP49359.2023.10223123
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, a global-local contrastive learning framework is proposed to leverage global contextual information from different modalities and then effectively fuse them with the supervision of contrastive learning. First, a global-local encoder is proposed to sufficiently explore the salient contextual information from different modalities, which generates the global contextual information. Second, contrastive learning is used to minimize the semantic distance between the paired modalities, which can improve the content matching between videos and the predicted captions. Finally, an attention-based multimodal encoder is presented to effectively fuse different modalities, thereby generating the multimodal representations that include global contextual information from different modalities. Extensive experimental results on benchmark datasets indicate that our proposed method is superior to the state-of-the-art approaches.
引用
收藏
页码:2410 / 2414
页数:5
相关论文
共 50 条
  • [1] Contrastive Learning of Global-Local Video Representations
    Ma, Shuang
    Zeng, Zhaoyang
    McDuff, Daniel
    Song, Yale
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [2] Video Captioning Using Global-Local Representation
    Yan, Liqi
    Ma, Siqi
    Wang, Qifan
    Chen, Yingjie
    Zhang, Xiangyu
    Savakis, Andreas
    Liu, Dongfang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (10) : 6642 - 6656
  • [3] Hierarchical Global-Local Temporal Modeling for Video Captioning
    Hu, Yaosi
    Chen, Zhenzhong
    Zha, Zheng-Jun
    Wu, Feng
    PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA (MM'19), 2019, : 774 - 783
  • [4] Violent Video Recognition Based on Global-Local Visual and Audio Contrastive Learning
    Liu, Zihao
    Wu, Xiaoyu
    Wang, Shengjin
    Shang, Yimeng
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 476 - 480
  • [5] Global-Local Combined Semantic Generation Network for Video Captioning
    Mao L.
    Gao H.
    Yang D.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (09): : 1374 - 1382
  • [6] Hard Contrastive Learning for Video Captioning
    Wu, Lilei
    Liu, Jie
    2022 IEEE 5TH INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION ENGINEERING, ICECE, 2022, : 202 - 209
  • [7] GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning
    Wang, Qi
    Huang, Wei
    Zhang, Xueting
    Li, Xuelong
    IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (11) : 6910 - 6922
  • [8] GL-CLEF: A Global-Local Contrastive Learning Framework for Cross-lingual Spoken Language Understanding
    Qin, Libo
    Chen, Qiguang
    Xie, Tianbao
    Li, Qixin
    Lou, Jian-Guang
    Che, Wanxiang
    Kan, Min-Yen
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 2677 - 2686
  • [9] ActBERT: Learning Global-Local Video-Text Representations
    Zhu, Linchao
    Yang, Yi
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 8743 - 8752
  • [10] Global-local contrastive multiview representation learning for skeleton-based action
    Bian, Cunling
    Feng, Wei
    Meng, Fanbo
    Wang, Song
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2023, 229