Learning to Diversify for Robust Video Moment Retrieval

被引:0
作者
Ge, Huilin [1 ]
Liu, Xiaolei [2 ]
Guo, Zihang [3 ,4 ,5 ]
Qiu, Zhiwen [1 ]
机构
[1] Jiangsu Univ Sci & Technol, Ocean Coll, Zhenjiang 212100, Peoples R China
[2] Elevoc Technol Co Ltd, Shenzhen 518000, Peoples R China
[3] Inner Mongolia Univ, Coll Comp Sci, Hohhot 010031, Peoples R China
[4] Natl & Local Joint Engn Res Ctr Intelligent Inform, Hohhot 010021, Peoples R China
[5] Inner Mongolia Key Lab Multilingual Artificial Int, Hohhot 010021, Peoples R China
基金
中国国家自然科学基金;
关键词
Proposals; Semantics; Circuit faults; Feature extraction; Cognition; Visualization; Training; Robustness; Streaming media; Predictive models; Video moment retrieval; cross-modal interaction;
D O I
10.1109/TCSVT.2024.3498599
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In this paper, we focus on diversifying the Video Moment Retrieval (VMR) model into more scenes. Most existing video moment retrieval methods focus on aligning video moments and queries by capturing the cross-modal relationship, which largely ignores the cross-instance relationship behind the representation learning. Thus, they may easily get trouble into the inaccurate cross-instance contrastive relationship in the training process: 1) Existing approaches can hardly identify similar semantic content across different scenes. They incorrectly treat such instances as negative samples (termed faulty negatives), which forces the model to learn the features from query-irrelevant scenes. 2) Existing methods perform unsatisfactorily in locating the queries with subtle differences. They neglect to mine the hard negative samples that belong to similar scenes but have different semantic content. In this paper, we propose a novel robust video moment retrieval method that prevents the model from overfitting the query-irrelevant scene features by accurately capturing both the cross-modal and cross-instance relationships. Specifically, we first develop a scene-independent cross-modal reasoning module that filters out the redundant scene contents and infers the video semantics under the guidance of query information. Then, the faulty and hard negative samples are mined from the negative ones and calibrated for their contribution to the overall loss in contrastive learning. We validate our contributions through extensive experiments on cross-scene video moment retrieval settings, where the training and test data are from different scenes. Experimental results show that the proposed robust video moment retrieval model can effectively retrieve target videos by capturing the real cross-modal and cross-instance relationships.
引用
收藏
页码:2894 / 2904
页数:11
相关论文
共 50 条
  • [21] Spatiotemporal contrastive modeling for video moment retrieval
    Yi Wang
    Kun Li
    Guoliang Chen
    Yan Zhang
    Dan Guo
    Meng Wang
    World Wide Web, 2023, 26 : 1525 - 1544
  • [22] Spatiotemporal contrastive modeling for video moment retrieval
    Wang, Yi
    Li, Kun
    Chen, Guoliang
    Zhang, Yan
    Guo, Dan
    Wang, Meng
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2023, 26 (04): : 1525 - 1544
  • [23] MLLM as video narrator: Mitigating modality imbalance in video moment retrieval
    Cai, Weitong
    Huang, Jiabo
    Gong, Shaogang
    Jin, Hailin
    Liu, Yang
    PATTERN RECOGNITION, 2025, 166
  • [24] Visual Co-Occurrence Alignment Learning for Weakly-Supervised Video Moment Retrieval
    Wang, Zheng
    Chen, Jingjing
    Jiang, Yu-Gang
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1459 - 1468
  • [25] SEMANTIC ASSOCIATION NETWORK FOR VIDEO CORPUS MOMENT RETRIEVAL
    Kim, Dahyun
    Yoon, Sunjae
    Hong, Ji Woo
    Yoo, Chang D.
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 1720 - 1724
  • [26] Zero-Shot Video Moment Retrieval With Angular Reconstructive Text Embeddings
    Jiang, Xun
    Xu, Xing
    Zhou, Zailei
    Yang, Yang
    Shen, Fumin
    Shen, Heng Tao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9657 - 9670
  • [27] Cross-Modality Knowledge Calibration Network for Video Corpus Moment Retrieval
    Chen, Tongbao
    Wang, Wenmin
    Jiang, Zhe
    Li, Ruochen
    Wang, Bingshu
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 3799 - 3813
  • [28] Reading-Strategy Inspired Visual Representation Learning for Text-to-Video Retrieval
    Dong, Jianfeng
    Wang, Yabing
    Chen, Xianke
    Qu, Xiaoye
    Li, Xirong
    He, Yuan
    Wang, Xun
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (08) : 5680 - 5694
  • [29] Modality-Aware Heterogeneous Graph for Joint Video Moment Retrieval and Highlight Detection
    Wang, Ruomei
    Feng, Jiawei
    Zhang, Fuwei
    Luo, Xiaonan
    Luo, Yuanmao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 8896 - 8911
  • [30] Cross-Modal Interaction Network for Video Moment Retrieval
    Ping, Shen
    Jiang, Xiao
    Tian, Zean
    Cao, Ronghui
    Chi, Weiming
    Yang, Shenghong
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2023, 37 (08)