Learning to Diversify for Robust Video Moment Retrieval

被引:0
作者
Ge, Huilin [1 ]
Liu, Xiaolei [2 ]
Guo, Zihang [3 ,4 ,5 ]
Qiu, Zhiwen [1 ]
机构
[1] Jiangsu Univ Sci & Technol, Ocean Coll, Zhenjiang 212100, Peoples R China
[2] Elevoc Technol Co Ltd, Shenzhen 518000, Peoples R China
[3] Inner Mongolia Univ, Coll Comp Sci, Hohhot 010031, Peoples R China
[4] Natl & Local Joint Engn Res Ctr Intelligent Inform, Hohhot 010021, Peoples R China
[5] Inner Mongolia Key Lab Multilingual Artificial Int, Hohhot 010021, Peoples R China
基金
中国国家自然科学基金;
关键词
Proposals; Semantics; Circuit faults; Feature extraction; Cognition; Visualization; Training; Robustness; Streaming media; Predictive models; Video moment retrieval; cross-modal interaction;
D O I
10.1109/TCSVT.2024.3498599
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we focus on diversifying the Video Moment Retrieval (VMR) model into more scenes. Most existing video moment retrieval methods focus on aligning video moments and queries by capturing the cross-modal relationship, which largely ignores the cross-instance relationship behind the representation learning. Thus, they may easily get trouble into the inaccurate cross-instance contrastive relationship in the training process: 1) Existing approaches can hardly identify similar semantic content across different scenes. They incorrectly treat such instances as negative samples (termed faulty negatives), which forces the model to learn the features from query-irrelevant scenes. 2) Existing methods perform unsatisfactorily in locating the queries with subtle differences. They neglect to mine the hard negative samples that belong to similar scenes but have different semantic content. In this paper, we propose a novel robust video moment retrieval method that prevents the model from overfitting the query-irrelevant scene features by accurately capturing both the cross-modal and cross-instance relationships. Specifically, we first develop a scene-independent cross-modal reasoning module that filters out the redundant scene contents and infers the video semantics under the guidance of query information. Then, the faulty and hard negative samples are mined from the negative ones and calibrated for their contribution to the overall loss in contrastive learning. We validate our contributions through extensive experiments on cross-scene video moment retrieval settings, where the training and test data are from different scenes. Experimental results show that the proposed robust video moment retrieval model can effectively retrieve target videos by capturing the real cross-modal and cross-instance relationships.
引用
收藏
页码:2894 / 2904
页数:11
相关论文
共 50 条
  • [1] Learning Video Moment Retrieval Without a Single Annotated Video
    Gao, Junyu
    Xu, Changsheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) : 1646 - 1657
  • [2] Momentum Cross-Modal Contrastive Learning for Video Moment Retrieval
    Han, De
    Cheng, Xing
    Guo, Nan
    Ye, Xiaochun
    Rainer, Benjamin
    Priller, Peter
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5977 - 5994
  • [3] Cascaded MPN: Cascaded Moment Proposal Network for Video Corpus Moment Retrieval
    Yoon, Sunjae
    Kim, Dahyun
    Kim, Junyeong
    Yoo, Chang D.
    IEEE ACCESS, 2022, 10 : 64560 - 64568
  • [4] Frame-Wise Cross-Modal Matching for Video Moment Retrieval
    Tang, Haoyu
    Zhu, Jihua
    Liu, Meng
    Gao, Zan
    Cheng, Zhiyong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1338 - 1349
  • [5] Video Moment Retrieval via Comprehensive Relation-Aware Network
    Sun, Xin
    Gao, Jialin
    Zhu, Yizhe
    Wang, Xuan
    Zhou, Xi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 5281 - 5295
  • [6] Cross-Modal Dynamic Networks for Video Moment Retrieval With Text Query
    Wang, Gongmian
    Xu, Xing
    Shen, Fumin
    Lu, Huimin
    Ji, Yanli
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1221 - 1232
  • [7] Video Moment Retrieval with Hierarchical Contrastive Learning
    Zhang, Bolin
    Yang, Chao
    Jiang, Bin
    Zhou, Xiaokang
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022,
  • [8] Video Moment Retrieval With Noisy Labels
    Pan, Wenwen
    Zhao, Zhou
    Huang, Wencan
    Zhang, Zhu
    Fu, Liyong
    Pan, Zhigeng
    Yu, Jun
    Wu, Fei
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (05) : 6779 - 6791
  • [9] Video Corpus Moment Retrieval via Deformable Multigranularity Feature Fusion and Adversarial Training
    Zhang, Xuemei
    Zhao, Peng
    Ji, Jinsheng
    Lu, Xiankai
    Yin, Yilong
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (08) : 6686 - 6698
  • [10] Gazing After Glancing: Edge Information Guided Perception Network for Video Moment Retrieval
    Huang, Zhanghao
    Ji, Yi
    Li, Ying
    Liu, Chunping
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1535 - 1539