Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval

被引:0
作者
Yin, Sh ukang [1 ]
Zhao, Sirui [2 ]
Wang, Hao [3 ]
Xu, Tong [1 ]
Chen, Enhong [1 ]
机构
[1] Univ Sci & Technol China, Sch Data Sci, Hefei, Peoples R China
[2] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei, Peoples R China
[3] Southwest Univ Sci & Technol, Sch Comp Sci & Technol, Mianyang, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-to-video retrieval; cross-modal retrieval; weakly supervised; multiple instance learning; IMAGE;
D O I
10.1145/3663571
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-to-Video Retrieval is a typical cross-modal retrieval task that has been studied extensively under a conventional supervised setting. Recently, some works have sought to extend the problem to a weakly supervised formulation, which can be more consistent with real-life scenarios and more efficient in annotation cost. In this context, a new task called Partially Relevant Video Retrieval (PRVR) is proposed, which aims to retrieve videos that are partially relevant to a given textual query, i.e., the videos containing at least one semantically relevant moment. Formulating the task as a Multiple Instance Learning (MIL) ranking problem, prior arts rely on heuristics algorithms such as a simple greedy search strategy and deal with each query independently. Although these early explorations have achieved decent performance, they may not fully utilize the bag-level label and only consider the local optimum, which could result in suboptimal solutions and inferior final retrieval performance. To address this problem, in this paper, we propose to exploit the relationships between instances to boost retrieval performance. Based on this idea, we creatively put forward: (1) a new matching scheme for pairing queries and their related moments in the video; and (2) a new loss function to facilitate cross-modal alignment between two views of an instance. Extensive validations on three publicly available datasets have demonstrated the effectiveness of our solution and verified our hypothesis that modeling instance-level relationships is beneficial in the MIL ranking setting. Our code will be publicly available at https://github.com/xjtupanda/BGM-Net.
引用
收藏
页数:21
相关论文
共 82 条
[1]  
[Anonymous], 1908, BIOMETRIKA, V6, P1
[2]  
Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473, DOI 10.48550/ARXIV.1409.0473]
[3]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[4]   Fast Bundle Algorithm for Multiple-Instance Learning [J].
Bergeron, Charles ;
Moore, Gregory ;
Zaretzki, Jed ;
Breneman, Curt M. ;
Bennett, Kristin P. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2012, 34 (06) :1068-1079
[5]  
Bertsekas D. P., 1988, Annals of Operations Research, V14, P105, DOI 10.1007/BF02186476
[6]   Multiple instance learning: A survey of problem characteristics and applications [J].
Carbonneau, Marc-Andre ;
Cheplygina, Veronika ;
Granger, Eric ;
Gagnon, Ghyslain .
PATTERN RECOGNITION, 2018, 77 :329-353
[7]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[8]   Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [J].
Chen, Shizhe ;
Zhao, Yida ;
Jin, Qin ;
Wu, Qi .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :10635-10644
[9]   TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model [J].
Chen, Yunkai ;
Wang, Qimeng ;
Wu, Shiwei ;
Gao, Yan ;
Xu, Tong ;
Hu, Yao .
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2024, 18 (07)
[10]   Cross-modal Graph Matching Network for Image-text Retrieval [J].
Cheng, Yuhao ;
Zhu, Xiaoguang ;
Qian, Jiuchao ;
Wen, Fei ;
Liu, Peilin .
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2022, 18 (04)