Semantic Relevance Learning for Video-Query Based Video Moment Retrieval

被引:4
作者
Huo, Shuwei [1 ]
Zhou, Yuan [1 ]
Wang, Ruolin [1 ]
Xiang, Wei [2 ,3 ]
Kung, Sun-Yuan [4 ]
机构
[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[2] La Trobe Univ, Sch Comp Engn & Math Sci, Melbourne, Vic 3086, Australia
[3] James Cook Univ, Coll Sci & Engn, Cairns, Qld 4878, Australia
[4] Princeton Univ, Elect Engn Dept, Princeton, NJ 08540 USA
关键词
Video moment retrieval; video query; fine-grained feature interaction; semantic relevance measurement; TEMPORAL ACTION LOCALIZATION;
D O I
10.1109/TMM.2023.3250088
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The task of video-query based video moment retrieval (VQ-VMR) aims to localize the segment in the reference video, which matches semantically with a short query video. This is a challenging task due to the rapid expansion and massive growth of online video services. With accurate retrieval of the target moment, we propose a new metric to effectively assess the semantic relevance between the query video and segments in the reference video. We also develop a new VQ-VMR framework to discover the intrinsic semantic relevance between a pair of input videos. It comprises two key components: a Fine-grained Feature Interaction (FFI) module and a Semantic Relevance Measurement (SRM) module. Together they can effectively deal with both the spatial and temporal dimensions of videos. First, the FFI module computes the semantic similarity between videos at a local frame level, mainly considering the spatial information in the videos. Subsequently, the SRM module learns the similarity between videos from a global perspective, taking into account the temporal information. We have conducted extensive experiments on two key datasets which demonstrate noticeable improvements of the proposed approach over the state-of-the-art methods.
引用
收藏
页码:9290 / 9301
页数:12
相关论文
共 49 条
[1]   Weakly-Supervised Alignment of Video With Text [J].
Bojanowski, P. ;
Lajugie, R. ;
Grave, E. ;
Bach, F. ;
Laptev, I. ;
Ponce, J. ;
Schmid, C. .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4462-4470
[2]   SST: Single-Stream Temporal Action Proposals [J].
Buch, Shyamal ;
Escorcia, Victor ;
Shen, Chuanqi ;
Ghanem, Bernard ;
Niebles, Juan Carlos .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6373-6382
[3]  
Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[4]  
Chen L, 2020, AAAI CONF ARTIF INTE, V34, P10551
[5]  
Chen Ting, 2019, PMLR
[6]   An Empirical Study of Training Self-Supervised Vision Transformers [J].
Chen, Xinlei ;
Xie, Saining ;
He, Kaiming .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9620-9629
[7]   Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale Videos [J].
Chou, Chien-Li ;
Chen, Hua-Tsung ;
Lee, Suh-Yin .
IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (03) :382-395
[8]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[9]  
Feng Y., 2018, PROC EUR C COMPUT VI
[10]   Spatio-temporal Video Re-localization by Warp LSTM [J].
Feng, Yang ;
Ma, Lin ;
Liu, Wei ;
Luo, Jiebo .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1288-1297