INTEGRATED MODALITIES AND MULTI-LEVEL GRANULARITY: TOWARDS A UNIFIED VIDEO-TEXT RETRIEVAL FRAMEWORK

被引:0
作者
Liu, Liu [1 ]
Wang, Wenzhe [2 ]
Zhang, Zhijie [1 ]
Zhang, Mengdan [3 ]
Peng, Pai [3 ]
Sun, Xing [3 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[2] Zhejiang Univ, Hangzhou, Peoples R China
[3] Tencent, Youtu Lab, Shenzhen, Peoples R China
来源
2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW) | 2021年
关键词
Video-text retrieval; multi-modal transformer; hierarchical alignment;
D O I
10.1109/ICMEW53276.2021.9455971
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. Recent researches handle different issues of this task such as exploiting multi-modal video cues, hierarchical reasoning, and learning pre-trained models. The implementations of these approaches vary a lot, which brings difficulty for the further research. Therefore, in this paper, we provide a unified video-text retrieval framework that has following features: 1) a modular design for easy modification of different structures of deep learning models; 2) training and test pipelines of the state-of-the-art (SOTA) models that leverage hierarchy cues and interactions between different levels of granularity and different video modalities; 3) support for various benchmark datasets; 4) demo exhibitions and well tested and documented. We hope our unified framework useful and efficient for the further research.
引用
收藏
页数:2
相关论文
共 14 条
[1]  
Chen Feiyu, 2020, T MULTIMEDIA
[2]  
Gabeur Valentin, 2020, EUROPEAN C COMPUTER
[3]  
Ging S., 2020, arXiv
[4]   Dense-Captioning Events in Videos [J].
Krishna, Ranjay ;
Hata, Kenji ;
Ren, Frederic ;
Fei-Fei, Li ;
Niebles, Juan Carlos .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :706-715
[5]   End-to-End Learning of Visual Representations from Uncurated Instructional Videos [J].
Miech, Antoine ;
Alayrac, Jean-Baptiste ;
Smaira, Lucas ;
Laptev, Ivan ;
Sivic, Josef ;
Zisserman, Andrew .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9876-9886
[6]  
Patrick M, 2021, Arxiv, DOI arXiv:2010.02824
[7]  
Rohrbach A, 2015, PROC CVPR IEEE, P3202, DOI 10.1109/CVPR.2015.7298940
[8]  
Rouditchenko A, 2021, Arxiv, DOI arXiv:2006.09199
[9]   Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [J].
Chen, Shizhe ;
Zhao, Yida ;
Jin, Qin ;
Wu, Qi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10635-10644
[10]   Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval [J].
Wang, Wei ;
Gao, Junyu ;
Yang, Xiaoshan ;
Xu, Changsheng .
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 :2386-2397