A Multitemporal Scale and Spatial-Temporal Transformer Network for Temporal Action Localization

被引:4
作者
Gao, Zan [1 ,2 ]
Cui, Xinglei [1 ]
Zhuo, Tao [1 ]
Cheng, Zhiyong [1 ]
Liu, An-An [3 ]
Wang, Meng [4 ]
Chen, Shenyong [2 ]
机构
[1] Qilu Univ Technol, Shandong Artificial Intelligence Inst, Shandong Acad Sci, Jinan 250014, Peoples R China
[2] Tianjin Univ Technol, Key Lab Comp Vis & Syst, Minist Educ, Tianjin 300384, Peoples R China
[3] Tianjin Univ, Sch Elect & Informat Engn, Tianjin 300072, Peoples R China
[4] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
基金
中国国家自然科学基金;
关键词
Transformers; Semantics; Feature extraction; Proposals; Location awareness; Convolution; Task analysis; Frame-level self-attention (FSA); multiple temporal scales; refined feature pyramids (RFPs); spatial-temporal transformer (STT); temporal action localization (TAL); ACTION RECOGNITION; GRANULARITY;
D O I
10.1109/THMS.2023.3266037
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal action localization plays an important role in video analysis, which aims to localize and classify actions in untrimmed videos. Previous methods often predict actions on a feature space of a single temporal scale. However, the temporal features of a low-level scale lack sufficient semantics for action classification, while a high-level scale cannot provide the rich details of the action boundaries. In addition, the long-range dependencies of video frames are often ignored. To address these issues, a novel multitemporal-scale spatial-temporal transformer (MSST) network is proposed for temporal action localization, which predicts actions on a feature space of multiple temporal scales. Specifically, we first use refined feature pyramids of different scales to pass semantics from high-level scales to low-level scales. Second, to establish the long temporal scale of the entire video, we use a spatial-temporal transformer encoder to capture the long-range dependencies of video frames. Then, the refined features with long-range dependencies are fed into a classifier for coarse action prediction. Finally, to further improve the prediction accuracy, we propose a frame-level self-attention module to refine the classification and boundaries of each action instance. Most importantly, these three modules are jointly explored in a unified framework, and MSST has an anchor-free and end-to-end architecture. Extensive experiments show that the proposed method can outperform state-of-the-art approaches on the THUMOS14 dataset and achieve comparable performance on the ActivityNet1.3 dataset. Compared with A2Net (TIP20, Avg{0.3:0.7}), Sub-Action (CSVT2022, Avg{0.1:0.5}), and AFSD (CVPR21, Avg{0.3:0.7}) on the THUMOS14 dataset, the proposed method can achieve improvements of 12.6%, 17.4%, and 2.2%, respectively.
引用
收藏
页码:569 / 580
页数:12
相关论文
共 60 条
  • [1] ViViT: A Video Vision Transformer
    Arnab, Anurag
    Dehghani, Mostafa
    Heigold, Georg
    Sun, Chen
    Lucic, Mario
    Schmid, Cordelia
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
  • [2] Ba Lei Jimmy, 2016, arXiv
  • [3] Soft-NMS - Improving Object Detection With One Line of Code
    Bodla, Navaneeth
    Singh, Bharat
    Chellappa, Rama
    Davis, Larry S.
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5562 - 5570
  • [4] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [5] Rethinking the Faster R-CNN Architecture for Temporal Action Localization
    Chao, Yu-Wei
    Vijayanarasimhan, Sudheendra
    Seybold, Bryan
    Ross, David A.
    Deng, Jia
    Sukthankar, Rahul
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1130 - 1139
  • [6] Improving Human Action Recognition Using Fusion of Depth Camera and Inertial Sensors
    Chen, Chen
    Jafari, Roozbeh
    Kehtarnavaz, Nasser
    [J]. IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, 2015, 45 (01) : 51 - 61
  • [7] Chen G, 2022, AAAI CONF ARTIF INTE, P248
  • [8] Dosovitskiy A., 2021, P INT C LEARN REPR, P1, DOI DOI 10.48550/ARXIV.2010.11929
  • [9] TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals
    Gao, Jiyang
    Yang, Zhenheng
    Sun, Chen
    Chen, Kan
    Nevatia, Ram
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 3648 - 3656
  • [10] Gao Jiyang, 2017, BMVC