Learning to Diversify for Robust Video Moment Retrieval

被引:0
作者
Ge, Huilin [1 ]
Liu, Xiaolei [2 ]
Guo, Zihang [3 ,4 ,5 ]
Qiu, Zhiwen [1 ]
机构
[1] Jiangsu Univ Sci & Technol, Ocean Coll, Zhenjiang 212100, Peoples R China
[2] Elevoc Technol Co Ltd, Shenzhen 518000, Peoples R China
[3] Inner Mongolia Univ, Coll Comp Sci, Hohhot 010031, Peoples R China
[4] Natl & Local Joint Engn Res Ctr Intelligent Inform, Hohhot 010021, Peoples R China
[5] Inner Mongolia Key Lab Multilingual Artificial Int, Hohhot 010021, Peoples R China
基金
中国国家自然科学基金;
关键词
Proposals; Semantics; Circuit faults; Feature extraction; Cognition; Visualization; Training; Robustness; Streaming media; Predictive models; Video moment retrieval; cross-modal interaction;
D O I
10.1109/TCSVT.2024.3498599
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we focus on diversifying the Video Moment Retrieval (VMR) model into more scenes. Most existing video moment retrieval methods focus on aligning video moments and queries by capturing the cross-modal relationship, which largely ignores the cross-instance relationship behind the representation learning. Thus, they may easily get trouble into the inaccurate cross-instance contrastive relationship in the training process: 1) Existing approaches can hardly identify similar semantic content across different scenes. They incorrectly treat such instances as negative samples (termed faulty negatives), which forces the model to learn the features from query-irrelevant scenes. 2) Existing methods perform unsatisfactorily in locating the queries with subtle differences. They neglect to mine the hard negative samples that belong to similar scenes but have different semantic content. In this paper, we propose a novel robust video moment retrieval method that prevents the model from overfitting the query-irrelevant scene features by accurately capturing both the cross-modal and cross-instance relationships. Specifically, we first develop a scene-independent cross-modal reasoning module that filters out the redundant scene contents and infers the video semantics under the guidance of query information. Then, the faulty and hard negative samples are mined from the negative ones and calibrated for their contribution to the overall loss in contrastive learning. We validate our contributions through extensive experiments on cross-scene video moment retrieval settings, where the training and test data are from different scenes. Experimental results show that the proposed robust video moment retrieval model can effectively retrieve target videos by capturing the real cross-modal and cross-instance relationships.
引用
收藏
页码:2894 / 2904
页数:11
相关论文
共 50 条
  • [31] Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval
    Wang, Wei
    Gao, Junyu
    Yang, Xiaoshan
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2386 - 2397
  • [32] Faster Video Moment Retrieval with Point-Level Supervision
    Jiang, Xun
    Zhou, Zailei
    Xu, Xing
    Yang, Yang
    Wang, Guoqing
    Shen, Heng Tao
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 1334 - 1342
  • [33] Semantic Video Moment Retrieval by Temporal Feature Perturbation and Refinement
    Cai, Weitong
    Huang, Jiabo
    Hu, Jian
    Gong, Shaogang
    Jin, Hailin
    Liu, Yang
    2024 14TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION SYSTEMS, ICPRS, 2024,
  • [34] Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization
    Cao, Da
    Zeng, Yawen
    Wei, Xiaochi
    Nie, Liqiang
    Hong, Richang
    Qin, Zheng
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 898 - 906
  • [35] Subtask Prior-Driven Optimized Mechanism on Joint Video Moment Retrieval and Highlight Detection
    Zhou, Siyu
    Zhang, Fuwei
    Wang, Ruomei
    Zhou, Fan
    Su, Zhuo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (11) : 11271 - 11285
  • [36] Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning
    Lu, Yu
    Quan, Ruijie
    Zhu, Linchao
    Yang, Yi
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 6748 - 6760
  • [37] Boundary-Aware Noise-Resistant Video Moment Retrieval
    Yu, Fengzhen
    Gu, Xiaodong
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT III, 2024, 15018 : 193 - 206
  • [38] Variational global clue inference for weakly supervised video moment retrieval
    Lv, Zezhong
    Su, Bing
    KNOWLEDGE-BASED SYSTEMS, 2025, 311
  • [39] Video Moment Retrieval With Cross-Modal Neural Architecture Search
    Yang, Xun
    Wang, Shanshan
    Dong, Jian
    Dong, Jianfeng
    Wang, Meng
    Chua, Tat-Seng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 1204 - 1216
  • [40] Cross Interaction Network for Natural Language Guided Video Moment Retrieval
    Yu, Xinli
    Malmir, Mohsen
    He, Xin
    Chen, Jiangning
    Wang, Tong
    Wu, Yue
    Liu, Yue
    Liu, Yang
    SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 1860 - 1864