Zero-Shot Temporal Action Detection via Vision-Language Prompting

被引:18
作者
Nag, Sauradip [1 ,2 ]
Zhu, Xiatian [1 ,3 ]
Song, Yi-Zhe [1 ,2 ]
Xiang, Tao [1 ,2 ]
机构
[1] Univ Surrey, CVSSP, Guildford, Surrey, England
[2] IFlyTek Surrey Joint Res Ctr Artificial Intellige, London, England
[3] Univ Surrey, Surrey Inst People Ctr Artificial Intelligence, Guildford, Surrey, England
来源
COMPUTER VISION - ECCV 2022, PT III | 2022年 / 13663卷
关键词
Zero-shot transfer; Temporal action localization; Language supervision; Task adaptation; Detection; Dense prediction;
D O I
10.1007/978-3-031-20062-5_39
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g., proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms state-of-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors. The PyTorch implementation of STALE is available on https://github.com/sauradip/STALE.
引用
收藏
页码:681 / 697
页数:17
相关论文
共 52 条
  • [1] MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation
    Abu Farha, Yazan
    Gall, Juergen
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 3570 - 3579
  • [2] Diagnosing Error in Temporal Action Detectors
    Alwassel, Humam
    Heilbron, Fabian Caba
    Escorcia, Victor
    Ghanem, Bernard
    [J]. COMPUTER VISION - ECCV 2018, PT III, 2018, 11207 : 264 - 280
  • [3] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [4] Soft-NMS - Improving Object Detection With One Line of Code
    Bodla, Navaneeth
    Singh, Bharat
    Chellappa, Rama
    Davis, Larry S.
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5562 - 5570
  • [5] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [6] Carion N., 2020, EUROPEAN C COMPUTER, V12346, P213, DOI 10.1007/978-3-030-58452-8_13
  • [7] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [8] Dynamic Convolution: Attention over Convolution Kernels
    Chen, Yinpeng
    Dai, Xiyang
    Liu, Mengchen
    Chen, Dongdong
    Yuan, Lu
    Liu, Zicheng
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, : 11027 - 11036
  • [9] Cheng B., 2021, Per-pixel classification is not all you need for semantic segmentation, V34
  • [10] Dosovitskiy A., 2021, P INT C LEARN REPR, P11929