Time-Frequency Mutual Learning for Moment Retrieval and Highlight Detection

被引:0
作者
Zhong, Yaokun [1 ]
Liang, Tianming [1 ]
Hu, Jian-Fang [1 ,2 ,3 ]
机构
[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Peoples R China
[2] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou, Peoples R China
[3] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Peoples R China
来源
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024 | 2025年 / 15035卷
关键词
video moment retrieval; frequency-domain deep learning; deep mutual learning;
D O I
10.1007/978-981-97-8620-6_3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Moment Retrieval and Highlight Detection (MR/HD) aims to concurrently retrieve relevant moments and predict clip-wise saliency scores according to a given textual query. Previous MR/HD works have overlooked explicit modeling of static-dynamic visual information described by the language query, which could lead to inaccurate predictions especially when the queried event describes both static appearances and dynamic motions. In this work, we consider learning the static interaction and dynamic reasoning from the time domain and frequency domain respectively, and propose a novel Time-Frequency Mutual Learning framework (TFML) which mainly consists of a time-domain branch, a frequency-domain branch, and a time-frequency aggregation branch. The time-domain branch learns to attend to the static visual information related to the textual query. In the frequency-domain branch, we introduce the Short-Time Fourier Transform (STFT) for dynamic modeling by attending to the frequency contents within varied segments. The time-frequency aggregation branch integrates the information from these two branches. To promote the mutual complementation of time-domain and frequency-domain information, we further employ a mutual learning strategy in concise and effective two-way loop, which enables the branches to collaboratively reason and achieve time-frequency consistent prediction. Extensive experiments on QVHighlights and TVSum demonstrate the effectiveness of our proposed framework as compared with state-of-the-art methods.
引用
收藏
页码:34 / 48
页数:15
相关论文
共 37 条
  • [1] Chi L., 2020, NeurIPS
  • [2] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [3] Guo G., 2023, Visual Intelligence, V1, P6
  • [4] Guo QS, 2020, PROC CVPR IEEE, P11017, DOI 10.1109/CVPR42600.2020.01103
  • [5] Localizing Moments in Video with Natural Language
    Hendricks, Lisa Anne
    Wang, Oliver
    Shechtman, Eli
    Sivic, Josef
    Darrell, Trevor
    Russell, Bryan
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5804 - 5813
  • [6] Hoffmann DT, 2022, AAAI CONF ARTIF INTE, P897
  • [7] StylizedNeRF: Consistent 3D Scene Stylization as Stylized NeRF via 2D-3D Mutual Learning
    Huang, Yi-Hua
    He, Yue
    Yuan, Yu-Jie
    Lai, Yu-Kun
    Gao, Lin
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18321 - 18331
  • [8] Huang Z., 2021, Learning with noisy correspondence for cross-modal matching, V34, P29406
  • [9] Jia Z., 2024, Vis. Intell.
  • [10] Jie Lei, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12366), P447, DOI 10.1007/978-3-030-58589-1_27