Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval

被引:0
作者
Liu, Hui [1 ]
Lv, Gang [2 ]
Gu, Yanhong [1 ]
Nian, Fudong [1 ,3 ]
机构
[1] Hefei Univ, Sch Adv Mfg Engn, Hefei, Peoples R China
[2] Chizhou Univ, Sch Big Data & Artificial Intelligence, Chizhou, Peoples R China
[3] Anhui Prov Engn Technol Res Ctr Intelligent Vehic, Hefei, Peoples R China
来源
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024 | 2024年 / 14866卷
关键词
video-text retrieval; computer vision; multi-modal;
D O I
10.1007/978-981-97-5594-3_25
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video-sharing platforms emphasize video-text retrieval in multimodal information retrieval. Existing methods often overlook video text intricacies and redundancy, focusing mainly on single-granularity information. To address this, we propose Fine-grained Cross-modal Contrast Learning (FCCL), an end-to-end framework. FCCL includes a frame enhancement module to reduce data complexity by discerning key features from each video frame. Additionally, we introduce a multimodal attention model to identify text-similar video sub-regions accurately. We also intro-duce a multi-granularity discrepancy analysis model to capture cross-modal similarity across different levels, including video-sentence, frame-sentence, and frame-word perspectives. Experimental results on MSR-VTT and MSVD datasets demonstrate FCCL's superiority in video-text retrieval. Code is available at: https://github.com/LHlh917/FCCL.
引用
收藏
页码:298 / 310
页数:13
相关论文
共 27 条
[1]  
[Anonymous], 2011, P 49 ANN M ASS COMPU
[2]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[3]   TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [J].
Croitoru, Ioana ;
Bogolin, Simion-Vlad ;
Leordeanu, Marius ;
Jin, Hailin ;
Zisserman, Andrew ;
Albanie, Samuel ;
Liu, Yang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11563-11573
[4]   MDMMT: Multidomain Multimodal Transformer for Video Retrieval [J].
Dzabraev, Maksim ;
Kalashnikov, Maksim ;
Komkov, Stepan ;
Petiushko, Aleksandr .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :3349-3358
[5]  
Fang H., 2021, arXiv, DOI 10.48550/arXiv.2106.11097
[6]  
Gabeur V., 2020, COMPUTER VISION ECCV
[7]   X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [J].
Gorti, Satya Krishna ;
Vouitsis, Noel ;
Ma, Junwei ;
Golestan, Keyvan ;
Volkovs, Maksims ;
Garg, Animesh ;
Yu, Guangwei .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4996-5005
[8]  
Jiang Jie, 2022, IEEE Access
[9]  
Kay W, 2017, Arxiv, DOI arXiv:1705.06950
[10]   Stacked Cross Attention for Image-Text Matching [J].
Lee, Kuang-Huei ;
Chen, Xi ;
Hua, Gang ;
Hu, Houdong ;
He, Xiaodong .
COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :212-228