Learning to Diversify for Robust Video Moment Retrieval

被引：0

作者：

Ge, Huilin ^{[1
]}

Liu, Xiaolei ^{[2
]}

Guo, Zihang ^{[3
,4
,5
]}

Qiu, Zhiwen ^{[1
]}

机构：

[1] Jiangsu Univ Sci & Technol, Ocean Coll, Zhenjiang 212100, Peoples R China

[2] Elevoc Technol Co Ltd, Shenzhen 518000, Peoples R China

[3] Inner Mongolia Univ, Coll Comp Sci, Hohhot 010031, Peoples R China

[4] Natl & Local Joint Engn Res Ctr Intelligent Inform, Hohhot 010021, Peoples R China

[5] Inner Mongolia Key Lab Multilingual Artificial Int, Hohhot 010021, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2025年 / 35卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Proposals; Semantics; Circuit faults; Feature extraction; Cognition; Visualization; Training; Robustness; Streaming media; Predictive models; Video moment retrieval; cross-modal interaction;

D O I：

10.1109/TCSVT.2024.3498599

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this paper, we focus on diversifying the Video Moment Retrieval (VMR) model into more scenes. Most existing video moment retrieval methods focus on aligning video moments and queries by capturing the cross-modal relationship, which largely ignores the cross-instance relationship behind the representation learning. Thus, they may easily get trouble into the inaccurate cross-instance contrastive relationship in the training process: 1) Existing approaches can hardly identify similar semantic content across different scenes. They incorrectly treat such instances as negative samples (termed faulty negatives), which forces the model to learn the features from query-irrelevant scenes. 2) Existing methods perform unsatisfactorily in locating the queries with subtle differences. They neglect to mine the hard negative samples that belong to similar scenes but have different semantic content. In this paper, we propose a novel robust video moment retrieval method that prevents the model from overfitting the query-irrelevant scene features by accurately capturing both the cross-modal and cross-instance relationships. Specifically, we first develop a scene-independent cross-modal reasoning module that filters out the redundant scene contents and infers the video semantics under the guidance of query information. Then, the faulty and hard negative samples are mined from the negative ones and calibrated for their contribution to the overall loss in contrastive learning. We validate our contributions through extensive experiments on cross-scene video moment retrieval settings, where the training and test data are from different scenes. Experimental results show that the proposed robust video moment retrieval model can effectively retrieve target videos by capturing the real cross-modal and cross-instance relationships.

引用

页码：2894 / 2904

页数：11

共 50 条

[21] Spatiotemporal contrastive modeling for video moment retrieval
Yi Wang
Kun Li
Guoliang Chen
Yan Zhang
Dan Guo
Meng Wang
World Wide Web, 2023, 26 : 1525 - 1544
[22] Spatiotemporal contrastive modeling for video moment retrieval
Wang, Yi
Li, Kun
Chen, Guoliang
Zhang, Yan
Guo, Dan
Wang, Meng
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2023, 26 (04): : 1525 - 1544
[23] MLLM as video narrator: Mitigating modality imbalance in video moment retrieval
Cai, Weitong
Huang, Jiabo
Gong, Shaogang
Jin, Hailin
Liu, Yang
PATTERN RECOGNITION, 2025, 166
[24] Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval
Wang, Zheng
Chen, Jingjing
Jiang, Yu-Gang
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1459 - 1468
[25] SEMANTIC ASSOCIATION NETWORK FOR VIDEO CORPUS MOMENT RETRIEVAL
Kim, Dahyun
Yoon, Sunjae
Hong, Ji Woo
Yoo, Chang D.
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 1720 - 1724
[26] Zero-Shot Video Moment Retrieval With Angular Reconstructive Text Embeddings
Jiang, Xun
Xu, Xing
Zhou, Zailei
Yang, Yang
Shen, Fumin
Shen, Heng Tao
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9657 - 9670
[27] Cross-Modality Knowledge Calibration Network for Video Corpus Moment Retrieval
Chen, Tongbao
Wang, Wenmin
Jiang, Zhe
Li, Ruochen
Wang, Bingshu
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3799 - 3813
[28] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
Dong, Jianfeng
Wang, Yabing
Chen, Xianke
Qu, Xiaoye
Li, Xirong
He, Yuan
Wang, Xun
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
[29] Modality-Aware Heterogeneous Graph for Joint Video Moment Retrieval and Highlight Detection
Wang, Ruomei
Feng, Jiawei
Zhang, Fuwei
Luo, Xiaonan
Luo, Yuanmao
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 8896 - 8911
[30] Cross-Modal Interaction Network for Video Moment Retrieval
Ping, Shen
Jiang, Xiao
Tian, Zean
Cao, Ronghui
Chi, Weiming
Yang, Shenghong
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023, 37 (08)

← 1 2 3 4 5 →