Spatiotemporal contrastive modeling for video moment retrieval

被引：2

作者：

Wang, Yi ^{[1
,2
]}

Li, Kun ^{[1
,2
]}

Chen, Guoliang ^{[1
,2
]}

Zhang, Yan ^{[1
,2
]}

Guo, Dan ^{[1
,2
]}

Wang, Meng ^{[1
,2
]}

机构：

[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230601, Anhui, Peoples R China

[2] Hefei Univ Technol, Sch Artificial Intelligence, Hefei 230601, Anhui, Peoples R China

来源：

WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2023年 / 26卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Video moment retrieval; Spatiotemporal modeling; Contrastive learning; Language query; Temporal localization; ACTION RECOGNITION;

D O I：

10.1007/s11280-022-01105-3

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the rapid development of social networks, video data has been growing explosively. As one of the important social mediums, spatiotemporal characteristics of videos have attracted considerable attention in recommendation system and video understanding. In this paper, we discuss the video moment retrieval (VMR) task, which locates moments in a video based on different textual queries. Existing methods are of two pipelines: 1) proposal-free approaches are mainly in modifying multi-modal interaction strategy; 2) proposal-based methods are dedicated to designing advanced proposal generation paradigm. Recently, contrastive representation learning has been successfully applied to the field of video understanding. From a new perspective, we propose a new VMR framework, named spatiotemporal contrastive network (STCNet), to learn discriminative boundary features of video grounding by contrast learning. To be specific, we propose a boundary matching sampling module for dense negative sample sampling. The contrast learning can refine the feature representations in the training phase without any additional cost in inference. On three public datasets, Charades-STA, ActivityNet Captions and TACoS, our proposed method performs competitive performance.

引用

页码：1525 / 1544

页数：20

共 55 条

[1] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
[2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
Carreira, Joao
Zisserman, Andrew
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
[3] Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
[4] Chen L, 2020, AAAI CONF ARTIF INTE, V34, P10551
[5] Chen SX, 2019, AAAI CONF ARTIF INTE, P8199
[6] Learning Spatiotemporal Features with 3D Convolutional Networks
Du Tran
Bourdev, Lubomir
Fergus, Rob
Torresani, Lorenzo
Paluri, Manohar
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
[7] Gao J., 2021, P IEEE CVF INT C COM, P1523
[8] TALL: Temporal Activity Localization via Language Query
Gao, Jiyang
Sun, Chen
Yang, Zhenheng
Nevatia, Ram
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5277 - 5285
[9] MAC: Mining Activity Concepts for Language-based Temporal Localization
Ge, Runzhou
Gao, Jiyang
Chen, Kan
Nevatia, Ram
[J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 245 - 253
[10] Exploiting long-term temporal dynamics for video captioning
Guo, Yuyu
Zhang, Jingqiu
Gao, Lianli
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (02): : 735 - 749

← 1 2 3 4 5 6 →