StochasticFormer: Stochastic Modeling for Weakly Supervised Temporal Action Localization

被引:5
作者
Shi, Haichao [1 ]
Zhang, Xiao-Yu [1 ]
Li, Changsheng [2 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing 100193, Peoples R China
[2] Beijing Inst Technol, Beijing 100081, Peoples R China
基金
中国国家自然科学基金;
关键词
Location awareness; Stochastic processes; Feature extraction; Videos; Transformers; Training; Annotations; Temporal action localization; action recognition; stochastic process;
D O I
10.1109/TIP.2023.3244411
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Weakly supervised temporal action localization (WS-TAL) aims to identify the time intervals corresponding to actions of interest in untrimmed videos with video-level weak supervision. For most existing WS-TAL methods, two commonly encountered challenges are under-localization and over-localization, which inevitably bring about severe performance deterioration. To address the issues, this paper proposes a transformer-structured stochastic process modeling framework, namely StochasticFormer, to fully investigate finer-grained interactions among the intermediate predictions to achieve further refined localization. StochasticFormer is built on a standard attention-based pipeline to derive preliminary frame/snippet-level predictions. Then, the pseudo localization module generates variable-length pseudo action instances with the corresponding pseudo labels. Using the pseudo "action instance - action category " pairs as fine-grained pseudo supervision, the stochastic modeler aims to learn the underlying interaction among the intermediate predictions with an encoder-decoder network. The encoder consists of the deterministic and latent path to capture the local and global information, which are subsequently integrated by the decoder to obtain reliable predictions. The framework is optimized with three carefully designed losses, i.e. the video-level classification loss, the frame-level semantic coherence loss, and the ELBO loss. Extensive experiments on two benchmarks, i.e., THUMOS14 and ActivityNet1.2, have shown the efficacy of StochasticFormer compared with the state-of-the-art methods.
引用
收藏
页码:1379 / 1389
页数:11
相关论文
共 53 条
  • [1] Alwassel H, 2018, Arxiv, DOI arXiv:1706.04269
  • [2] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [4] Rethinking the Faster R-CNN Architecture for Temporal Action Localization
    Chao, Yu-Wei
    Vijayanarasimhan, Sudheendra
    Seybold, Bryan
    Ross, David A.
    Deng, Jia
    Sukthankar, Rahul
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1130 - 1139
  • [5] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [6] SlowFast Networks for Video Recognition
    Feichtenhofer, Christoph
    Fan, Haoqi
    Malik, Jitendra
    He, Kaiming
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6201 - 6210
  • [7] Convolutional Two-Stream Network Fusion for Video Action Recognition
    Feichtenhofer, Christoph
    Pinz, Axel
    Zisserman, Andrew
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 1933 - 1941
  • [8] You Lead, We Exceed: Labor-Free Video Concept Learning by Jointly Exploiting Web Videos and Images
    Gan, Chuang
    Yao, Ting
    Yang, Kuiyuan
    Yang, Yi
    Mei, Tao
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 923 - 932
  • [9] Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames
    Gan, Chuang
    Sun, Chen
    Duan, Lixin
    Gong, Boqing
    [J]. COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 : 849 - 866
  • [10] Gan C, 2015, PROC CVPR IEEE, P2568, DOI 10.1109/CVPR.2015.7298872