Learning to Diversify for Robust Video Moment Retrieval

被引：0

作者：

Ge, Huilin ^{[1
]}

Liu, Xiaolei ^{[2
]}

Guo, Zihang ^{[3
,4
,5
]}

Qiu, Zhiwen ^{[1
]}

机构：

[1] Jiangsu Univ Sci & Technol, Ocean Coll, Zhenjiang 212100, Peoples R China

[2] Elevoc Technol Co Ltd, Shenzhen 518000, Peoples R China

[3] Inner Mongolia Univ, Coll Comp Sci, Hohhot 010031, Peoples R China

[4] Natl & Local Joint Engn Res Ctr Intelligent Inform, Hohhot 010021, Peoples R China

[5] Inner Mongolia Key Lab Multilingual Artificial Int, Hohhot 010021, Peoples R China

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2025年 / 35卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Proposals; Semantics; Circuit faults; Feature extraction; Cognition; Visualization; Training; Robustness; Streaming media; Predictive models; Video moment retrieval; cross-modal interaction;

D O I：

10.1109/TCSVT.2024.3498599

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

In this paper, we focus on diversifying the Video Moment Retrieval (VMR) model into more scenes. Most existing video moment retrieval methods focus on aligning video moments and queries by capturing the cross-modal relationship, which largely ignores the cross-instance relationship behind the representation learning. Thus, they may easily get trouble into the inaccurate cross-instance contrastive relationship in the training process: 1) Existing approaches can hardly identify similar semantic content across different scenes. They incorrectly treat such instances as negative samples (termed faulty negatives), which forces the model to learn the features from query-irrelevant scenes. 2) Existing methods perform unsatisfactorily in locating the queries with subtle differences. They neglect to mine the hard negative samples that belong to similar scenes but have different semantic content. In this paper, we propose a novel robust video moment retrieval method that prevents the model from overfitting the query-irrelevant scene features by accurately capturing both the cross-modal and cross-instance relationships. Specifically, we first develop a scene-independent cross-modal reasoning module that filters out the redundant scene contents and infers the video semantics under the guidance of query information. Then, the faulty and hard negative samples are mined from the negative ones and calibrated for their contribution to the overall loss in contrastive learning. We validate our contributions through extensive experiments on cross-scene video moment retrieval settings, where the training and test data are from different scenes. Experimental results show that the proposed robust video moment retrieval model can effectively retrieve target videos by capturing the real cross-modal and cross-instance relationships.

引用

页码：2894 / 2904

页数：11

共 50 条

[31] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval
Wang, Wei
Gao, Junyu
Yang, Xiaoshan
Xu, Changsheng
IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2386 - 2397
[32] Faster Video Moment Retrieval with Point-Level Supervision
Jiang, Xun
Zhou, Zailei
Xu, Xing
Yang, Yang
Wang, Guoqing
Shen, Heng Tao
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 1334 - 1342
[33] Semantic Video Moment Retrieval by Temporal Feature Perturbation and Refinement
Cai, Weitong
Huang, Jiabo
Hu, Jian
Gong, Shaogang
Jin, Hailin
Liu, Yang
2024 14TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION SYSTEMS, ICPRS, 2024,
[34] Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization
Cao, Da
Zeng, Yawen
Wei, Xiaochi
Nie, Liqiang
Hong, Richang
Qin, Zheng
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 898 - 906
[35] Subtask Prior-Driven Optimized Mechanism on Joint Video Moment Retrieval and Highlight Detection
Zhou, Siyu
Zhang, Fuwei
Wang, Ruomei
Zhou, Fan
Su, Zhuo
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11271 - 11285
[36] Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning
Lu, Yu
Quan, Ruijie
Zhu, Linchao
Yang, Yi
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6748 - 6760
[37] Boundary-Aware Noise-Resistant Video Moment Retrieval
Yu, Fengzhen
Gu, Xiaodong
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT III, 2024, 15018 : 193 - 206
[38] Variational global clue inference for weakly supervised video moment retrieval
Lv, Zezhong
Su, Bing
KNOWLEDGE-BASED SYSTEMS, 2025, 311
[39] Video Moment Retrieval With Cross-Modal Neural Architecture Search
Yang, Xun
Wang, Shanshan
Dong, Jian
Dong, Jianfeng
Wang, Meng
Chua, Tat-Seng
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1204 - 1216
[40] Cross Interaction Network for Natural Language Guided Video Moment Retrieval
Yu, Xinli
Malmir, Mohsen
He, Xin
Chen, Jiangning
Wang, Tong
Wu, Yue
Liu, Yue
Liu, Yang
SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1860 - 1864

← 1 2 3 4 5 →