Spatial-Temporal Interleaved Network for Efficient Action Recognition

被引:1
|
作者
Jiang, Shengqin [1 ,2 ,3 ]
Zhang, Haokui [4 ]
Qi, Yuankai [5 ]
Liu, Qingshan [6 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Sch Comp Sci, Nanjing 210044, Peoples R China
[2] Nanjing Univ Informat Sci & Technol, Minist Educ, Engn Res Ctr Digital Forens, Nanjing 210044, Peoples R China
[3] Nanjing Univ Informat Sci & Technol, Jiangsu Collaborat Innovat Ctr Atmospher Environm, Nanjing 210044, Peoples R China
[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710129, Peoples R China
[5] Macquarie Univ, Sch Comp, Sydney, NSW 2109, Australia
[6] Nanjing Univ Posts & Telecommun, Sch Comp Sci, Nanjing 210023, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Convolution; Three-dimensional displays; Kernel; Computational modeling; Videos; Transformers; Solid modeling; 3D convolution; action recognition; feature interaction; spatial-temporal features;
D O I
10.1109/TII.2024.3450021
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The decomposition of 3D convolution will considerably reduce the computing complexity of 3D convolutional neural networks, yet simple stacking restricts the performance of neural networks. To this end, we propose a spatial-temporal interleaved network for efficient action recognition. By deeply analyzing this task, it revisits the structure of 3D neural networks in action recognition from the following perspectives. To enhance the learning of robust spatial-temporal features, we initially propose an interleaved feature interaction module to comprehensively explore cross-layer features and capture the most discriminative information among them. With regards to being lightweight, a boosted parallel pseudo-3D module is introduced with the goal of circumventing a substantial number of computations from the lower to middle levels while enhancing temporal and spatial features in parallel at high levels. Furthermore, we exploit a spatial-temporal differential attention mechanism to suppress redundant features in different dimensions while reaping the benefits of nearly negligible parameters. Lastly, extensive experiments on four action recognition benchmarks are given to show the advantages and efficiency of our proposed method. Specifically, our method attains a 15.2% improvement in Top-1 accuracy compared to our baseline, a stack of full 3D convolutional layers, on the Something-Something V1 dataset while utilizing only 18.2% of the parameters.
引用
收藏
页码:178 / 187
页数:10
相关论文
共 50 条
  • [1] Grouped Spatial-Temporal Aggregation for Efficient Action Recognition
    Luo, Chenxu
    Yuille, Alan
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 5511 - 5520
  • [2] Spatial-Temporal Convolutional Attention Network for Action Recognition
    Luo, Huilan
    Chen, Han
    Computer Engineering and Applications, 2023, 59 (09): : 150 - 158
  • [3] Spatial-temporal saliency action mask attention network for action recognition
    Jiang, Min
    Pan, Na
    Kong, Jun
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 71
  • [4] Multi-Branch Spatial-Temporal Network for Action Recognition
    Wang, Yingying
    Li, Wei
    Tao, Ran
    IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (10) : 1556 - 1560
  • [5] Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos
    Du, Wenbin
    Wang, Yali
    Qiao, Yu
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (03) : 1347 - 1360
  • [6] Action Recognition Using a Spatial-Temporal Network for Wild Felines
    Feng, Liqi
    Zhao, Yaqin
    Sun, Yichao
    Zhao, Wenxuan
    Tang, Jiaxi
    ANIMALS, 2021, 11 (02): : 1 - 18
  • [7] Efficient Gait Recognition via Spatial-Temporal Decoupled Network
    Tang, Peisen
    Su, Han
    Gao, Ruixuan
    Zhao, Wensheng
    Tang, Chaoying
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [8] Spatial-Temporal Attention for Action Recognition
    Sun, Dengdi
    Wu, Hanqing
    Ding, Zhuanlian
    Luo, Bin
    Tang, Jin
    ADVANCES IN MULTIMEDIA INFORMATION PROCESSING, PT I, 2018, 11164 : 854 - 864
  • [9] Spatial-Temporal Transformer Network for Continuous Action Recognition in Industrial Assembly
    Huang, Jianfeng
    Liu, Xiang
    Hu, Huan
    Tang, Shanghua
    Li, Chenyang
    Zhao, Shaoan
    Lin, Yimin
    Wang, Kai
    Liu, Zhaoxiang
    Lian, Shiguo
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT X, ICIC 2024, 2024, 14871 : 114 - 130
  • [10] A Channel-Wise Spatial-Temporal Aggregation Network for Action Recognition
    Wang, Huafeng
    Xia, Tao
    Li, Hanlin
    Gu, Xianfeng
    Lv, Weifeng
    Wang, Yuehai
    MATHEMATICS, 2021, 9 (24)