Learning to Diversify for Robust Video Moment Retrieval

被引：0

作者：

Ge, Huilin ^{[1
]}

Liu, Xiaolei ^{[2
]}

Guo, Zihang ^{[3
,4
,5
]}

Qiu, Zhiwen ^{[1
]}

机构：

[1] Jiangsu Univ Sci & Technol, Ocean Coll, Zhenjiang 212100, Peoples R China

[2] Elevoc Technol Co Ltd, Shenzhen 518000, Peoples R China

[3] Inner Mongolia Univ, Coll Comp Sci, Hohhot 010031, Peoples R China

[4] Natl & Local Joint Engn Res Ctr Intelligent Inform, Hohhot 010021, Peoples R China

[5] Inner Mongolia Key Lab Multilingual Artificial Int, Hohhot 010021, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2025年 / 35卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Proposals; Semantics; Circuit faults; Feature extraction; Cognition; Visualization; Training; Robustness; Streaming media; Predictive models; Video moment retrieval; cross-modal interaction;

D O I：

10.1109/TCSVT.2024.3498599

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this paper, we focus on diversifying the Video Moment Retrieval (VMR) model into more scenes. Most existing video moment retrieval methods focus on aligning video moments and queries by capturing the cross-modal relationship, which largely ignores the cross-instance relationship behind the representation learning. Thus, they may easily get trouble into the inaccurate cross-instance contrastive relationship in the training process: 1) Existing approaches can hardly identify similar semantic content across different scenes. They incorrectly treat such instances as negative samples (termed faulty negatives), which forces the model to learn the features from query-irrelevant scenes. 2) Existing methods perform unsatisfactorily in locating the queries with subtle differences. They neglect to mine the hard negative samples that belong to similar scenes but have different semantic content. In this paper, we propose a novel robust video moment retrieval method that prevents the model from overfitting the query-irrelevant scene features by accurately capturing both the cross-modal and cross-instance relationships. Specifically, we first develop a scene-independent cross-modal reasoning module that filters out the redundant scene contents and infers the video semantics under the guidance of query information. Then, the faulty and hard negative samples are mined from the negative ones and calibrated for their contribution to the overall loss in contrastive learning. We validate our contributions through extensive experiments on cross-scene video moment retrieval settings, where the training and test data are from different scenes. Experimental results show that the proposed robust video moment retrieval model can effectively retrieve target videos by capturing the real cross-modal and cross-instance relationships.

引用

页码：2894 / 2904

页数：11

共 50 条

[41] Prompt-based Zero-shot Video Moment Retrieval
Wang, Guolong
Wu, Xun
Liu, Zhaoyuan
Yan, Junchi
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
[42] Selective Query-Guided Debiasing for Video Corpus Moment Retrieval
Yoon, Sunjae
Hong, Ji Woo
Yoon, Eunseop
Kim, Dahyun
Kim, Junyeong
Yoon, Hee Suk
Yoo, Chang D.
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 185 - 200
[43] GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features
Sun, Yunzhuo
Xu, Yifang
Xie, Zien
Shu, Yukun
Du, Sidan
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 521 - 525
[44] Gait-Assisted Video Person Retrieval
Zhao, Yang
Wang, Xinlong
Yu, Xiaohan
Liu, Chunlei
Gao, Yongsheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (02) : 897 - 908
[45] Robust Asymmetric Cross-Modal Hashing Retrieval With Dual Semantic Enhancement
Teng, Shaohua
Xu, Tuhong
Zheng, Zefeng
Wu, Naiqi
Zhang, Wei
Teng, Luyao
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (03) : 4340 - 4353
[46] Generalized Few-Shot Video Classification With Video Retrieval and Feature Generation
Xian, Yongqin
Korbar, Bruno
Douze, Matthijs
Torresani, Lorenzo
Schiele, Bernt
Akata, Zeynep
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8949 - 8961
[47] Intra-frame scan-free video state spaces model for video moment retrieval
Yu, Fengzhen
Gu, Xiaodong
APPLIED INTELLIGENCE, 2025, 55 (07)
[48] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
Shu, Fangxun
Chen, Biaolong
Liao, Yue
Wang, Jinqiao
Liu, Si
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9962 - 9972
[49] Interactive Video Retrieval in the Age of Deep Learning - Detailed Evaluation of VBS 2019
Rossetto, Luca
Gasser, Ralph
Lokoc, Jakub
Bailer, Werner
Schoeffmann, Klaus
Muenzer, Bernd
Soucek, Tomas
Nguyen, Phuong Anh
Bolettieri, Paolo
Leibetseder, Andreas
Vrochidis, Stefanos
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 243 - 256
[50] Robust and Rotation-Equivariant Contrastive Learning
Bai, Gairui
Xi, Wei
Hong, Xiaopeng
Liu, Xinhui
Yue, Yang
Zhao, Songwen
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, : 1 - 14

← 1 2 3 4 5 →