Semantic Relevance Learning for Video-Query Based Video Moment Retrieval

被引：4

作者：

Huo, Shuwei ^{[1
]}

Zhou, Yuan ^{[1
]}

Wang, Ruolin ^{[1
]}

Xiang, Wei ^{[2
,3
]}

Kung, Sun-Yuan ^{[4
]}

机构：

[1] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China

[2] La Trobe Univ, Sch Comp Engn & Math Sci, Melbourne, Vic 3086, Australia

[3] James Cook Univ, Coll Sci & Engn, Cairns, Qld 4878, Australia

[4] Princeton Univ, Elect Engn Dept, Princeton, NJ 08540 USA

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2023年 / 25卷

关键词：

Video moment retrieval; video query; fine-grained feature interaction; semantic relevance measurement; TEMPORAL ACTION LOCALIZATION;

D O I：

10.1109/TMM.2023.3250088

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The task of video-query based video moment retrieval (VQ-VMR) aims to localize the segment in the reference video, which matches semantically with a short query video. This is a challenging task due to the rapid expansion and massive growth of online video services. With accurate retrieval of the target moment, we propose a new metric to effectively assess the semantic relevance between the query video and segments in the reference video. We also develop a new VQ-VMR framework to discover the intrinsic semantic relevance between a pair of input videos. It comprises two key components: a Fine-grained Feature Interaction (FFI) module and a Semantic Relevance Measurement (SRM) module. Together they can effectively deal with both the spatial and temporal dimensions of videos. First, the FFI module computes the semantic similarity between videos at a local frame level, mainly considering the spatial information in the videos. Subsequently, the SRM module learns the similarity between videos from a global perspective, taking into account the temporal information. We have conducted extensive experiments on two key datasets which demonstrate noticeable improvements of the proposed approach over the state-of-the-art methods.

引用

页码：9290 / 9301

页数：12

共 49 条

[1] Weakly-Supervised Alignment of Video With Text [J].

Bojanowski, P. ;

Lajugie, R. ;

Grave, E. ;

Bach, F. ;

Laptev, I. ;

Ponce, J. ;

Schmid, C. .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4462-4470

[2] SST: Single-Stream Temporal Action Proposals [J].

Buch, Shyamal ;

Escorcia, Victor ;

Shen, Chuanqi ;

Ghanem, Bernard ;

Niebles, Juan Carlos .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6373-6382

[3]

Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698

[4]

Chen L, 2020, AAAI CONF ARTIF INTE, V34, P10551

[5]

Chen Ting, 2019, PMLR

[6] An Empirical Study of Training Self-Supervised Vision Transformers [J].

Chen, Xinlei ;

Xie, Saining ;

He, Kaiming .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9620-9629

[7] Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale Videos [J].

Chou, Chien-Li ;

Chen, Hua-Tsung ;

Lee, Suh-Yin .

IEEE TRANSACTIONS ON MULTIMEDIA, 2015, 17 (03) :382-395

[8] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

[9]

Feng Y., 2018, PROC EUR C COMPUT VI

[10] Spatio-temporal Video Re-localization by Warp LSTM [J].

Feng, Yang ;

Ma, Lin ;

Liu, Wei ;

Luo, Jiebo .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :1288-1297

← 1 2 3 4 5 →