Multi-modal Transformer for Video Retrieval

被引:409
作者
Gabeur, Valentin [1 ,2 ]
Sun, Chen [2 ]
Alahari, Karteek [1 ]
Schmid, Cordelia [2 ]
机构
[1] Univ Grenoble Alpes, CNRS, Inria, Grenoble INP,LJK, F-38000 Grenoble, France
[2] Google Res, Meylan, France
来源
COMPUTER VISION - ECCV 2020, PT IV | 2020年 / 12349卷
关键词
Video; Language; Retrieval; Multi-modal; Cross-modal; Temporality; Transformer; Attention;
D O I
10.1007/978-3-030-58548-8_13
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT.
引用
收藏
页码:214 / 229
页数:16
相关论文
共 35 条
[1]   Language Features Matter: Effective Language Representations for Vision-Language Tasks [J].
Burns, Andrea ;
Tan, Reuben ;
Saenko, Kate ;
Sclaroff, Stan ;
Plummer, Bryan A. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7473-7482
[2]  
Carreira J., 2017, CVPR NEWMODEL KINETI
[3]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[4]  
Gabeur V., 2020, CVPR VID PENT WORKSH
[5]   Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input [J].
Harwath, David ;
Recasens, Adria ;
Suris, Didac ;
Chuang, Galen ;
Torralba, Antonio ;
Glass, James .
COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 :659-677
[6]  
Hershey S, 2017, INT CONF ACOUST SPEE, P131, DOI 10.1109/ICASSP.2017.7952132
[7]  
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI [10.1162/neco.1997.9.8.1735, 10.1162/neco.1997.9.1.1, 10.1007/978-3-642-24797-2]
[8]   Squeeze-and-Excitation Networks [J].
Hu, Jie ;
Shen, Li ;
Albanie, Samuel ;
Sun, Gang ;
Wu, Enhua .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (08) :2011-2023
[9]   Densely Connected Convolutional Networks [J].
Huang, Gao ;
Liu, Zhuang ;
van der Maaten, Laurens ;
Weinberger, Kilian Q. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2261-2269
[10]  
Karpathy A, 2014, ADV NEUR IN, V27