Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval

被引：0

作者：

Liu, Hui ^{[1
]}

Lv, Gang ^{[2
]}

Gu, Yanhong ^{[1
]}

Nian, Fudong ^{[1
,3
]}

机构：

[1] Hefei Univ, Sch Adv Mfg Engn, Hefei, Peoples R China

[2] Chizhou Univ, Sch Big Data & Artificial Intelligence, Chizhou, Peoples R China

[3] Anhui Prov Engn Technol Res Ctr Intelligent Vehic, Hefei, Peoples R China

来源：

ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT V, ICIC 2024 | 2024年 / 14866卷

关键词：

video-text retrieval; computer vision; multi-modal;

D O I：

10.1007/978-981-97-5594-3_25

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video-sharing platforms emphasize video-text retrieval in multimodal information retrieval. Existing methods often overlook video text intricacies and redundancy, focusing mainly on single-granularity information. To address this, we propose Fine-grained Cross-modal Contrast Learning (FCCL), an end-to-end framework. FCCL includes a frame enhancement module to reduce data complexity by discerning key features from each video frame. Additionally, we introduce a multimodal attention model to identify text-similar video sub-regions accurately. We also intro-duce a multi-granularity discrepancy analysis model to capture cross-modal similarity across different levels, including video-sentence, frame-sentence, and frame-word perspectives. Experimental results on MSR-VTT and MSVD datasets demonstrate FCCL's superiority in video-text retrieval. Code is available at: https://github.com/LHlh917/FCCL.

引用

页码：298 / 310

页数：13

共 27 条

[1]

[Anonymous], 2011, P 49 ANN M ASS COMPU

[2] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].

Bain, Max ;

Nagrani, Arsha ;

Varol, Gul ;

Zisserman, Andrew .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718

[3] TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [J].

Croitoru, Ioana ;

Bogolin, Simion-Vlad ;

Leordeanu, Marius ;

Jin, Hailin ;

Zisserman, Andrew ;

Albanie, Samuel ;

Liu, Yang .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11563-11573

[4] MDMMT: Multidomain Multimodal Transformer for Video Retrieval [J].

Dzabraev, Maksim ;

Kalashnikov, Maksim ;

Komkov, Stepan ;

Petiushko, Aleksandr .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :3349-3358

[5]

Fang H., 2021, arXiv, DOI 10.48550/arXiv.2106.11097

[6]

Gabeur V., 2020, COMPUTER VISION ECCV

[7] X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [J].

Gorti, Satya Krishna ;

Vouitsis, Noel ;

Ma, Junwei ;

Golestan, Keyvan ;

Volkovs, Maksims ;

Garg, Animesh ;

Yu, Guangwei .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4996-5005

[8]

Jiang Jie, 2022, IEEE Access

[9]

Kay W, 2017, Arxiv, DOI arXiv:1705.06950

[10] Stacked Cross Attention for Image-Text Matching [J].

Lee, Kuang-Huei ;

Chen, Xi ;

Hua, Gang ;

Hu, Houdong ;

He, Xiaodong .

COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :212-228

← 1 2 3 →