CMFG: Cross-Model Fine-Grained Feature Interaction for Text-Video Retrieval

被引:0
作者
Zhao, Shengwei [1 ]
Liu, Yuying [1 ]
Du, Shaoyi [1 ,2 ]
Tian, Zhiqiang [1 ]
Qu, Ting [3 ]
Xu, Linhai [1 ]
机构
[1] Xi An Jiao Tong Univ, Inst Artificial Intelligence & Robot, Xian 710049, Peoples R China
[2] Shunan Acad Artificial Intelligence, Ningbo 315000, Zhejiang, Peoples R China
[3] Jilin Univ, State Key Lab Automot Simulat & Control, Changchun 130022, Peoples R China
来源
MULTIMEDIA MODELING, MMM 2023, PT II | 2023年 / 13834卷
基金
中国国家自然科学基金;
关键词
Text-video retrieval; Fine-grained; Cross-model;
D O I
10.1007/978-3-031-27818-1_36
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As a fundamental task in the multimodal domain, text-tovideo retrieval task has received great attention in recent years. Most of the current research focuses on the interaction between cross-modal coarse-grained features. However, the feature granularity of retrievalmodels has not been fully explored. Therefore, we introduce video internal region information into cross-modal retrieval and propose a cross-model fine-grained feature retrieval framework. Videos are represented as video-frame-region triple features, texts are represented as sentence-word dual features, and the cross-similarity between visual features and text features is computed through token-wise interaction. It effectively extracts the detailed information in the video, guides the model to pay attention to the effective video region information and keyword information in the sentence, and reduces the adverse effects of redundant words and interfering frames. On the most popular retrieval dataset MSRVTT, the framework achieves state-of-the-art results (51.1@1). Excellent experimental results demonstrate the superiority of fine-grained feature interaction.
引用
收藏
页码:435 / 445
页数:11
相关论文
共 26 条
[1]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[2]  
Cheng Xing., 2021, arXiv
[3]   TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [J].
Croitoru, Ioana ;
Bogolin, Simion-Vlad ;
Leordeanu, Marius ;
Jin, Hailin ;
Zisserman, Andrew ;
Albanie, Samuel ;
Liu, Yang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11563-11573
[4]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[5]   MDMMT: Multidomain Multimodal Transformer for Video Retrieval [J].
Dzabraev, Maksim ;
Kalashnikov, Maksim ;
Komkov, Stepan ;
Petiushko, Aleksandr .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :3349-3358
[6]   Multi-modal Transformer for Video Retrieval [J].
Gabeur, Valentin ;
Sun, Chen ;
Alahari, Karteek ;
Schmid, Cordelia .
COMPUTER VISION - ECCV 2020, PT IV, 2020, 12349 :214-229
[7]   X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval [J].
Gorti, Satya Krishna ;
Vouitsis, Noel ;
Ma, Junwei ;
Golestan, Keyvan ;
Volkovs, Maksims ;
Garg, Animesh ;
Yu, Guangwei .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :4996-5005
[8]  
Goyal P, 2018, Arxiv, DOI [arXiv:1706.02677, 10.48550/arXiv.1706.02677, DOI 10.48550/ARXIV.1706.02677]
[9]  
Kingma DP, 2014, ADV NEUR IN, V27
[10]   Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling [J].
Lei, Jie ;
Li, Linjie ;
Zhou, Luowei ;
Gan, Zhe ;
Berg, Tamara L. ;
Bansal, Mohit ;
Liu, Jingjing .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :7327-7337