AGPN: Action Granularity Pyramid Network for Video Action Recognition

被引:22
作者
Chen, Yatong [1 ]
Ge, Hongwei [1 ]
Liu, Yuxuan [1 ]
Cai, Xinye [1 ]
Sun, Liang [1 ]
机构
[1] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian 116024, Peoples R China
基金
中国国家自然科学基金;
关键词
Video action recognition; pyramid network; multi-scale; multi-granularity; REPRESENTATIONS;
D O I
10.1109/TCSVT.2023.3235522
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Video action recognition is a fundamental task for video understanding. Action recognition in complex spatio-temporal contexts generally requires fusing of different multi-granularity action information. However, existing works do not consider spatio-temporal information modeling and fusion from the perspective of action granularity. To address this problem, this paper proposes an Action Granularity Pyramid Network (AGPN) for action recognition, which can be flexibly integrated into 2D backbone networks. The core module is the Action Granularity Pyramid Module (AGPM), a hierarchical pyramid structure with residual connections, which is established to fuse multi-granularity action spatio-temporal information. From top to bottom level in the designed pyramid structure, the receptive field decreases and action granularity becomes more refined. To enrich temporal information of the inputs, a Multiple Frame Rate Module (MFM) is proposed to mix different frame rates at a fine-grained pixel-wise level. Moreover, a Spatio-temporal Anchor Module (SAM) is employed to fix spatio-temporal feature anchors to promote the effectiveness of feature extraction. We conduct extensive experiments on three large-scale action recognition datasets, Something-Something V1 & V2 and Kinetics-400. The results demonstrate that our proposed AGPN outperforms the state-of-the-art methods for the tasks of video action recognition.
引用
收藏
页码:3912 / 3923
页数:12
相关论文
共 59 条
  • [1] Bertasius G, 2021, PR MACH LEARN RES, V139
  • [2] Bottou Leon, 2012, Neural Networks: Tricks of the Trade. Second Edition: LNCS 7700, P421, DOI 10.1007/978-3-642-35289-8_25
  • [3] Club Ideas and Exertions: Aggregating Local Predictions for Action Recognition
    Cao, Congqi
    Li, Jiakang
    Xi, Runping
    Zhang, Yanning
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (06) : 2247 - 2259
  • [4] Cross-Modality Compensation Convolutional Neural Networks for RGB-D Action Recognition
    Cheng, Jun
    Ren, Ziliang
    Zhang, Qieshi
    Gao, Xiangyang
    Hao, Fusheng
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (03) : 1498 - 1509
  • [5] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
  • [6] Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
  • [7] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
  • [8] Learning Spatiotemporal Features with 3D Convolutional Networks
    Du Tran
    Bourdev, Lubomir
    Fergus, Rob
    Torresani, Lorenzo
    Paluri, Manohar
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 4489 - 4497
  • [9] X3D: Expanding Architectures for Efficient Video Recognition
    Feichtenhofer, Christoph
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 200 - 210
  • [10] SlowFast Networks for Video Recognition
    Feichtenhofer, Christoph
    Fan, Haoqi
    Malik, Jitendra
    He, Kaiming
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6201 - 6210