Learning to Diversify for Robust Video Moment Retrieval

被引：0

作者：

Ge, Huilin ^{[1
]}

Liu, Xiaolei ^{[2
]}

Guo, Zihang ^{[3
,4
,5
]}

Qiu, Zhiwen ^{[1
]}

机构：

[1] Jiangsu Univ Sci & Technol, Ocean Coll, Zhenjiang 212100, Peoples R China

[2] Elevoc Technol Co Ltd, Shenzhen 518000, Peoples R China

[3] Inner Mongolia Univ, Coll Comp Sci, Hohhot 010031, Peoples R China

[4] Natl & Local Joint Engn Res Ctr Intelligent Inform, Hohhot 010021, Peoples R China

[5] Inner Mongolia Key Lab Multilingual Artificial Int, Hohhot 010021, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2025年 / 35卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Proposals; Semantics; Circuit faults; Feature extraction; Cognition; Visualization; Training; Robustness; Streaming media; Predictive models; Video moment retrieval; cross-modal interaction;

D O I：

10.1109/TCSVT.2024.3498599

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this paper, we focus on diversifying the Video Moment Retrieval (VMR) model into more scenes. Most existing video moment retrieval methods focus on aligning video moments and queries by capturing the cross-modal relationship, which largely ignores the cross-instance relationship behind the representation learning. Thus, they may easily get trouble into the inaccurate cross-instance contrastive relationship in the training process: 1) Existing approaches can hardly identify similar semantic content across different scenes. They incorrectly treat such instances as negative samples (termed faulty negatives), which forces the model to learn the features from query-irrelevant scenes. 2) Existing methods perform unsatisfactorily in locating the queries with subtle differences. They neglect to mine the hard negative samples that belong to similar scenes but have different semantic content. In this paper, we propose a novel robust video moment retrieval method that prevents the model from overfitting the query-irrelevant scene features by accurately capturing both the cross-modal and cross-instance relationships. Specifically, we first develop a scene-independent cross-modal reasoning module that filters out the redundant scene contents and infers the video semantics under the guidance of query information. Then, the faulty and hard negative samples are mined from the negative ones and calibrated for their contribution to the overall loss in contrastive learning. We validate our contributions through extensive experiments on cross-scene video moment retrieval settings, where the training and test data are from different scenes. Experimental results show that the proposed robust video moment retrieval model can effectively retrieve target videos by capturing the real cross-modal and cross-instance relationships.

引用

页码：2894 / 2904

页数：11

共 50 条

[1] Learning Video Moment Retrieval Without a Single Annotated Video
Gao, Junyu
Xu, Changsheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) : 1646 - 1657
[2] Momentum Cross-Modal Contrastive Learning for Video Moment Retrieval
Han, De
Cheng, Xing
Guo, Nan
Ye, Xiaochun
Rainer, Benjamin
Priller, Peter
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5977 - 5994
[3] Cascaded MPN: Cascaded Moment Proposal Network for Video Corpus Moment Retrieval
Yoon, Sunjae
Kim, Dahyun
Kim, Junyeong
Yoo, Chang D.
IEEE ACCESS, 2022, 10 : 64560 - 64568
[4] Frame-Wise Cross-Modal Matching for Video Moment Retrieval
Tang, Haoyu
Zhu, Jihua
Liu, Meng
Gao, Zan
Cheng, Zhiyong
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1338 - 1349
[5] Video Moment Retrieval via Comprehensive Relation-Aware Network
Sun, Xin
Gao, Jialin
Zhu, Yizhe
Wang, Xuan
Zhou, Xi
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5281 - 5295
[6] Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query
Wang, Gongmian
Xu, Xing
Shen, Fumin
Lu, Huimin
Ji, Yanli
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1221 - 1232
[7] Video Moment Retrieval with Hierarchical Contrastive Learning
Zhang, Bolin
Yang, Chao
Jiang, Bin
Zhou, Xiaokang
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
[8] Video Moment Retrieval With Noisy Labels
Pan, Wenwen
Zhao, Zhou
Huang, Wencan
Zhang, Zhu
Fu, Liyong
Pan, Zhigeng
Yu, Jun
Wu, Fei
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (05) : 6779 - 6791
[9] Video Corpus Moment Retrieval via Deformable Multigranularity Feature Fusion and Adversarial Training
Zhang, Xuemei
Zhao, Peng
Ji, Jinsheng
Lu, Xiankai
Yin, Yilong
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 6686 - 6698
[10] Gazing After Glancing: Edge Information Guided Perception Network for Video Moment Retrieval
Huang, Zhanghao
Ji, Yi
Li, Ying
Liu, Chunping
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1535 - 1539

← 1 2 3 4 5 →