Learning to Diversify for Robust Video Moment Retrieval

被引:0
作者
Ge, Huilin [1 ]
Liu, Xiaolei [2 ]
Guo, Zihang [3 ,4 ,5 ]
Qiu, Zhiwen [1 ]
机构
[1] Jiangsu Univ Sci & Technol, Ocean Coll, Zhenjiang 212100, Peoples R China
[2] Elevoc Technol Co Ltd, Shenzhen 518000, Peoples R China
[3] Inner Mongolia Univ, Coll Comp Sci, Hohhot 010031, Peoples R China
[4] Natl & Local Joint Engn Res Ctr Intelligent Inform, Hohhot 010021, Peoples R China
[5] Inner Mongolia Key Lab Multilingual Artificial Int, Hohhot 010021, Peoples R China
基金
中国国家自然科学基金;
关键词
Proposals; Semantics; Circuit faults; Feature extraction; Cognition; Visualization; Training; Robustness; Streaming media; Predictive models; Video moment retrieval; cross-modal interaction;
D O I
10.1109/TCSVT.2024.3498599
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we focus on diversifying the Video Moment Retrieval (VMR) model into more scenes. Most existing video moment retrieval methods focus on aligning video moments and queries by capturing the cross-modal relationship, which largely ignores the cross-instance relationship behind the representation learning. Thus, they may easily get trouble into the inaccurate cross-instance contrastive relationship in the training process: 1) Existing approaches can hardly identify similar semantic content across different scenes. They incorrectly treat such instances as negative samples (termed faulty negatives), which forces the model to learn the features from query-irrelevant scenes. 2) Existing methods perform unsatisfactorily in locating the queries with subtle differences. They neglect to mine the hard negative samples that belong to similar scenes but have different semantic content. In this paper, we propose a novel robust video moment retrieval method that prevents the model from overfitting the query-irrelevant scene features by accurately capturing both the cross-modal and cross-instance relationships. Specifically, we first develop a scene-independent cross-modal reasoning module that filters out the redundant scene contents and infers the video semantics under the guidance of query information. Then, the faulty and hard negative samples are mined from the negative ones and calibrated for their contribution to the overall loss in contrastive learning. We validate our contributions through extensive experiments on cross-scene video moment retrieval settings, where the training and test data are from different scenes. Experimental results show that the proposed robust video moment retrieval model can effectively retrieve target videos by capturing the real cross-modal and cross-instance relationships.
引用
收藏
页码:2894 / 2904
页数:11
相关论文
共 50 条
  • [41] Prompt-based Zero-shot Video Moment Retrieval
    Wang, Guolong
    Wu, Xun
    Liu, Zhaoyuan
    Yan, Junchi
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
  • [42] Selective Query-Guided Debiasing for Video Corpus Moment Retrieval
    Yoon, Sunjae
    Hong, Ji Woo
    Yoon, Eunseop
    Kim, Dahyun
    Kim, Junyeong
    Yoon, Hee Suk
    Yoo, Chang D.
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 185 - 200
  • [43] GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features
    Sun, Yunzhuo
    Xu, Yifang
    Xie, Zien
    Shu, Yukun
    Du, Sidan
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 521 - 525
  • [44] Gait-Assisted Video Person Retrieval
    Zhao, Yang
    Wang, Xinlong
    Yu, Xiaohan
    Liu, Chunlei
    Gao, Yongsheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (02) : 897 - 908
  • [45] Robust Asymmetric Cross-Modal Hashing Retrieval With Dual Semantic Enhancement
    Teng, Shaohua
    Xu, Tuhong
    Zheng, Zefeng
    Wu, Naiqi
    Zhang, Wei
    Teng, Luyao
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (03) : 4340 - 4353
  • [46] Generalized Few-Shot Video Classification With Video Retrieval and Feature Generation
    Xian, Yongqin
    Korbar, Bruno
    Douze, Matthijs
    Torresani, Lorenzo
    Schiele, Bernt
    Akata, Zeynep
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (12) : 8949 - 8961
  • [47] Intra-frame scan-free video state spaces model for video moment retrieval
    Yu, Fengzhen
    Gu, Xiaodong
    APPLIED INTELLIGENCE, 2025, 55 (07)
  • [48] MAC: Masked Contrastive Pre-Training for Efficient Video-Text Retrieval
    Shu, Fangxun
    Chen, Biaolong
    Liao, Yue
    Wang, Jinqiao
    Liu, Si
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9962 - 9972
  • [49] Interactive Video Retrieval in the Age of Deep Learning - Detailed Evaluation of VBS 2019
    Rossetto, Luca
    Gasser, Ralph
    Lokoc, Jakub
    Bailer, Werner
    Schoeffmann, Klaus
    Muenzer, Bernd
    Soucek, Tomas
    Nguyen, Phuong Anh
    Bolettieri, Paolo
    Leibetseder, Andreas
    Vrochidis, Stefanos
    IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 243 - 256
  • [50] Robust and Rotation-Equivariant Contrastive Learning
    Bai, Gairui
    Xi, Wei
    Hong, Xiaopeng
    Liu, Xinhui
    Yue, Yang
    Zhao, Songwen
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, : 1 - 14