Time-Frequency Mutual Learning for Moment Retrieval and Highlight Detection

被引：0

作者：

Zhong, Yaokun ^{[1
]}

Liang, Tianming ^{[1
]}

Hu, Jian-Fang ^{[1
,2
,3
]}

机构：

[1] Sun Yat Sen Univ, Sch Comp Sci & Engn, Guangzhou, Peoples R China

[2] Guangdong Prov Key Lab Informat Secur Technol, Guangzhou, Peoples R China

[3] Minist Educ, Key Lab Machine Intelligence & Adv Comp, Guangzhou, Peoples R China

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024 | 2025年 / 15035卷

关键词：

video moment retrieval; frequency-domain deep learning; deep mutual learning;

D O I：

10.1007/978-981-97-8620-6_3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Moment Retrieval and Highlight Detection (MR/HD) aims to concurrently retrieve relevant moments and predict clip-wise saliency scores according to a given textual query. Previous MR/HD works have overlooked explicit modeling of static-dynamic visual information described by the language query, which could lead to inaccurate predictions especially when the queried event describes both static appearances and dynamic motions. In this work, we consider learning the static interaction and dynamic reasoning from the time domain and frequency domain respectively, and propose a novel Time-Frequency Mutual Learning framework (TFML) which mainly consists of a time-domain branch, a frequency-domain branch, and a time-frequency aggregation branch. The time-domain branch learns to attend to the static visual information related to the textual query. In the frequency-domain branch, we introduce the Short-Time Fourier Transform (STFT) for dynamic modeling by attending to the frequency contents within varied segments. The time-frequency aggregation branch integrates the information from these two branches. To promote the mutual complementation of time-domain and frequency-domain information, we further employ a mutual learning strategy in concise and effective two-way loop, which enables the branches to collaboratively reason and achieve time-frequency consistent prediction. Extensive experiments on QVHighlights and TVSum demonstrate the effectiveness of our proposed framework as compared with state-of-the-art methods.

引用

页码：34 / 48

页数：15

共 37 条

[1] Chi L., 2020, NeurIPS
[2] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
[3] Guo G., 2023, Visual Intelligence, V1, P6
[4] Guo QS, 2020, PROC CVPR IEEE, P11017, DOI 10.1109/CVPR42600.2020.01103
[5] Localizing Moments in Video with Natural Language
Hendricks, Lisa Anne
Wang, Oliver
Shechtman, Eli
Sivic, Josef
Darrell, Trevor
Russell, Bryan
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5804 - 5813
[6] Hoffmann DT, 2022, AAAI CONF ARTIF INTE, P897
[7] StylizedNeRF: Consistent 3D Scene Stylization as Stylized NeRF via 2D-3D Mutual Learning
Huang, Yi-Hua
He, Yue
Yuan, Yu-Jie
Lai, Yu-Kun
Gao, Lin
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 18321 - 18331
[8] Huang Z., 2021, Learning with noisy correspondence for cross-modal matching, V34, P29406
[9] Jia Z., 2024, Vis. Intell.
[10] Jie Lei, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12366), P447, DOI 10.1007/978-3-030-58589-1_27

← 1 2 3 4 →