Partially Relevant Video Retrieval

被引:27
作者
Dong, Jianfeng [1 ]
Chen, Xianke [1 ]
Zhang, Minsong [1 ]
Yang, Xun [2 ]
Chen, Shujie [1 ]
Li, Xirong [3 ]
Wang, Xun [1 ]
机构
[1] Zhejiang Gongshang Univ, Hangzhou, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Renmin Univ, Key Lab DEKE, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
基金
国家重点研发计划;
关键词
Video-Text Retrieval; Partially Relevant; Multiple Instance Learning; Video Representation Learning; TEXT;
D O I
10.1145/3503161.3547976
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a given paired video and caption, the video is supposed to be fully relevant to the caption. In reality, however, as queries are not known a priori, pre-trimmed video clips may not contain sufficient content to fully meet the query. This suggests a gap between the literature and the real world. To fill the gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An untrimmed video is considered to be partially relevant w.r.t. a given textual query if it contains a moment relevant to the query. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos. PRVR differs from single video moment retrieval and video corpus moment retrieval, as the latter two are to retrieve moments rather than untrimmed videos. We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames. Clips and frames represent video content at different time scales. We propose a Multi-Scale Similarity Learning (MS-SL) network that jointly learns clip-scale and frame-scale similarities for PRVR. Extensive experiments on three datasets (TVR, ActivityNet Captions, and Charades-STA) demonstrate the viability of the proposed method. We also show that our method can be used for improving video corpus moment retrieval.
引用
收藏
页数:12
相关论文
共 70 条
[1]  
Ba J. L., 2016, arXiv, DOI 10.48550/arXiv:1607.06450
[2]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[3]  
Chen D., 2011, P 49 ANN M ASS COMP, P190
[4]  
Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
[5]  
Chen L, 2020, AAAI CONF ARTIF INTE, V34, P10551
[6]   Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [J].
Chen, Shizhe ;
Zhao, Yida ;
Jin, Qin ;
Wu, Qi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10635-10644
[7]   TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [J].
Croitoru, Ioana ;
Bogolin, Simion-Vlad ;
Leordeanu, Marius ;
Jin, Hailin ;
Zisserman, Andrew ;
Albanie, Samuel ;
Liu, Yang .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11563-11573
[8]   Solving the multiple instance problem with axis-parallel rectangles [J].
Dietterich, TG ;
Lathrop, RH ;
LozanoPerez, T .
ARTIFICIAL INTELLIGENCE, 1997, 89 (1-2) :31-71
[9]   Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval [J].
Dong, Jianfeng ;
Wang, Yabing ;
Chen, Xianke ;
Qu, Xiaoye ;
Li, Xirong ;
He, Yuan ;
Wang, Xun .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) :5680-5694
[10]   Dual Encoding for Video Retrieval by Text [J].
Dong, Jianfeng ;
Li, Xirong ;
Xu, Chaoxi ;
Yang, Xun ;
Yang, Gang ;
Wang, Xun ;
Wang, Meng .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (08) :4065-4080