Video Moment Retrieval from Text Queries via Single Frame Annotation

被引:19
作者
Cui, Ran [1 ]
Qian, Tianwen [2 ]
Peng, Pai [3 ]
Daskalaki, Elena [1 ]
Chen, Jingjing [2 ]
Guo, Xiaowei [3 ]
Sun, Huyang [3 ]
Jiang, Yu-Gang [2 ]
机构
[1] Australian Natl Univ, Canberra, ACT, Australia
[2] Fudan Univ, Shanghai, Peoples R China
[3] Bilibili, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22) | 2022年
关键词
video moment retrieval; contrastive learning; cross-modal learning; LANGUAGE;
D O I
10.1145/3477495.3532078
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video moment retrieval aims at finding the start and end times-tamps of a moment (part of a video) described by a given natural language query. Fully supervised methods need complete temporal boundary annotations to achieve promising results, which is costly since the annotator needs to watch the whole moment. Weakly supervised methods only rely on the paired video and query, but the performance is relatively poor. In this paper, we look closer into the annotation process and propose a new paradigm called "glance annotation". This paradigm requires the timestamp of only one single random frame, which we refer to as a "glance", within the temporal boundary of the fully supervised counterpart. We argue this is beneficial because comparing to weak supervision, trivial cost is added yet more potential in performance is provided. Under the glance annotation setting, we propose a method named as Video moment retrieval via Glance Annotation (ViGA)(1) based on contrastive learning. ViGA cuts the input video into clips and contrasts between clips and queries, in which glance guided Gaussian distributed weights are assigned to all clips. Our extensive experiments indicate that ViGA achieves better results than the state-of-the-art weakly supervised methods by a large margin, even comparable to fully supervised methods in some cases.
引用
收藏
页码:1033 / 1043
页数:11
相关论文
共 52 条
[11]   DAPs: Deep Action Proposals for Action Understanding [J].
Escorcia, Victor ;
Heilbron, Fabian Caba ;
Niebles, Juan Carlos ;
Ghanem, Bernard .
COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 :768-784
[12]   TALL: Temporal Activity Localization via Language Query [J].
Gao, Jiyang ;
Sun, Chen ;
Yang, Zhenheng ;
Nevatia, Ram .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5277-5285
[13]  
Gao MF, 2019, 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019), P1481
[14]   MAC: Mining Activity Concepts for Language-based Temporal Localization [J].
Ge, Runzhou ;
Gao, Jiyang ;
Chen, Kan ;
Nevatia, Ram .
2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, :245-253
[15]   Momentum Contrast for Unsupervised Visual Representation Learning [J].
He, Kaiming ;
Fan, Haoqi ;
Wu, Yuxin ;
Xie, Saining ;
Girshick, Ross .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :9726-9735
[16]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[17]   Localizing Moments in Video with Natural Language [J].
Hendricks, Lisa Anne ;
Wang, Oliver ;
Shechtman, Eli ;
Sivic, Josef ;
Darrell, Trevor ;
Russell, Bryan .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5804-5813
[18]   Cross-Sentence Temporal and Semantic Relations in Video Activity Localisation [J].
Huang, Jiabo ;
Liu, Yang ;
Gong, Shaogang ;
Jin, Hailin .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :7179-7188
[19]  
Jiang Bin, 2019, P 2019 INT C MULT RE, P217, DOI DOI 10.1145/3323873.3325019
[20]   Three-Dimensional Attention-Based Deep Ranking Model for Video Highlight Detection [J].
Jiao, Yifan ;
Li, Zhetao ;
Huang, Shucheng ;
Yang, Xiaoshan ;
Liu, Bin ;
Zhang, Tianzhu .
IEEE TRANSACTIONS ON MULTIMEDIA, 2018, 20 (10) :2693-2705