Spatiotemporal contrastive modeling for video moment retrieval

被引:2
作者
Wang, Yi [1 ,2 ]
Li, Kun [1 ,2 ]
Chen, Guoliang [1 ,2 ]
Zhang, Yan [1 ,2 ]
Guo, Dan [1 ,2 ]
Wang, Meng [1 ,2 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230601, Anhui, Peoples R China
[2] Hefei Univ Technol, Sch Artificial Intelligence, Hefei 230601, Anhui, Peoples R China
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2023年 / 26卷 / 04期
基金
中国国家自然科学基金;
关键词
Video moment retrieval; Spatiotemporal modeling; Contrastive learning; Language query; Temporal localization; ACTION RECOGNITION;
D O I
10.1007/s11280-022-01105-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the rapid development of social networks, video data has been growing explosively. As one of the important social mediums, spatiotemporal characteristics of videos have attracted considerable attention in recommendation system and video understanding. In this paper, we discuss the video moment retrieval (VMR) task, which locates moments in a video based on different textual queries. Existing methods are of two pipelines: 1) proposal-free approaches are mainly in modifying multi-modal interaction strategy; 2) proposal-based methods are dedicated to designing advanced proposal generation paradigm. Recently, contrastive representation learning has been successfully applied to the field of video understanding. From a new perspective, we propose a new VMR framework, named spatiotemporal contrastive network (STCNet), to learn discriminative boundary features of video grounding by contrast learning. To be specific, we propose a boundary matching sampling module for dense negative sample sampling. The contrast learning can refine the feature representations in the training phase without any additional cost in inference. On three public datasets, Charades-STA, ActivityNet Captions and TACoS, our proposed method performs competitive performance.
引用
收藏
页码:1525 / 1544
页数:20
相关论文
共 55 条
  • [1] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [3] Chen JY, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P162
  • [4] Chen L, 2020, AAAI CONF ARTIF INTE, V34, P10551
  • [5] Chen SX, 2019, AAAI CONF ARTIF INTE, P8199
  • [6] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [7] Gao J., 2021, P IEEE CVF INT C COM, P1523
  • [8] TALL: Temporal Activity Localization via Language Query
    Gao, Jiyang
    Sun, Chen
    Yang, Zhenheng
    Nevatia, Ram
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5277 - 5285
  • [9] MAC: Mining Activity Concepts for Language-based Temporal Localization
    Ge, Runzhou
    Gao, Jiyang
    Chen, Kan
    Nevatia, Ram
    [J]. 2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 245 - 253
  • [10] Exploiting long-term temporal dynamics for video captioning
    Guo, Yuyu
    Zhang, Jingqiu
    Gao, Lianli
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2019, 22 (02): : 735 - 749