Spatial-Temporal Interleaved Network for Efficient Action Recognition

被引:1
|
作者
Jiang, Shengqin [1 ,2 ,3 ]
Zhang, Haokui [4 ]
Qi, Yuankai [5 ]
Liu, Qingshan [6 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Sch Comp Sci, Nanjing 210044, Peoples R China
[2] Nanjing Univ Informat Sci & Technol, Minist Educ, Engn Res Ctr Digital Forens, Nanjing 210044, Peoples R China
[3] Nanjing Univ Informat Sci & Technol, Jiangsu Collaborat Innovat Ctr Atmospher Environm, Nanjing 210044, Peoples R China
[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710129, Peoples R China
[5] Macquarie Univ, Sch Comp, Sydney, NSW 2109, Australia
[6] Nanjing Univ Posts & Telecommun, Sch Comp Sci, Nanjing 210023, Peoples R China
基金
中国博士后科学基金; 中国国家自然科学基金;
关键词
Convolution; Three-dimensional displays; Kernel; Computational modeling; Videos; Transformers; Solid modeling; 3D convolution; action recognition; feature interaction; spatial-temporal features;
D O I
10.1109/TII.2024.3450021
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The decomposition of 3D convolution will considerably reduce the computing complexity of 3D convolutional neural networks, yet simple stacking restricts the performance of neural networks. To this end, we propose a spatial-temporal interleaved network for efficient action recognition. By deeply analyzing this task, it revisits the structure of 3D neural networks in action recognition from the following perspectives. To enhance the learning of robust spatial-temporal features, we initially propose an interleaved feature interaction module to comprehensively explore cross-layer features and capture the most discriminative information among them. With regards to being lightweight, a boosted parallel pseudo-3D module is introduced with the goal of circumventing a substantial number of computations from the lower to middle levels while enhancing temporal and spatial features in parallel at high levels. Furthermore, we exploit a spatial-temporal differential attention mechanism to suppress redundant features in different dimensions while reaping the benefits of nearly negligible parameters. Lastly, extensive experiments on four action recognition benchmarks are given to show the advantages and efficiency of our proposed method. Specifically, our method attains a 15.2% improvement in Top-1 accuracy compared to our baseline, a stack of full 3D convolutional layers, on the Something-Something V1 dataset while utilizing only 18.2% of the parameters.
引用
收藏
页码:178 / 187
页数:10
相关论文
共 50 条
  • [11] Spatial-Temporal Exclusive Capsule Network for Open Set Action Recognition
    Feng, Yangbo
    Gao, Junyu
    Yang, Shicai
    Xu, Changsheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9464 - 9478
  • [12] Recurrent attention network using spatial-temporal relations for action recognition
    Zhang, Mingxing
    Yang, Yang
    Ji, Yanli
    Xie, Ning
    Shen, Fumin
    SIGNAL PROCESSING, 2018, 145 : 137 - 145
  • [13] Spatial-temporal channel-wise attention network for action recognition
    Lin Chen
    Yungang Liu
    Yongchao Man
    Multimedia Tools and Applications, 2021, 80 : 21789 - 21808
  • [14] Spatial-temporal pyramid based Convolutional Neural Network for action recognition
    Zheng, Zhenxing
    An, Gaoyun
    Wu, Dapeng
    Ruan, Qiuqi
    NEUROCOMPUTING, 2019, 358 : 446 - 455
  • [15] Spatial-temporal channel-wise attention network for action recognition
    Chen, Lin
    Liu, Yungang
    Man, Yongchao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (14) : 21789 - 21808
  • [16] Joint spatial-temporal attention for action recognition
    Yu, Tingzhao
    Guo, Chaoxu
    Wang, Lingfeng
    Gu, Huxiang
    Xiang, Shiming
    Pan, Chunhong
    PATTERN RECOGNITION LETTERS, 2018, 112 : 226 - 233
  • [17] Spatial-Temporal Neural Networks for Action Recognition
    Jing, Chao
    Wei, Ping
    Sun, Hongbin
    Zheng, Nanning
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2018, 2018, 519 : 619 - 627
  • [18] Spatial-temporal pooling for action recognition in videos
    Wang, Jiaming
    Shao, Zhenfeng
    Huang, Xiao
    Lu, Tao
    Zhang, Ruiqian
    Lv, Xianwei
    NEUROCOMPUTING, 2021, 451 : 265 - 278
  • [19] Spatial-temporal interaction module for action recognition
    Luo, Hui-Lan
    Chen, Han
    Cheung, Yiu-Ming
    Yu, Yawei
    JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (04)
  • [20] Efficient Video Transformers via Spatial-temporal Token Merging for Action Recognition
    Feng, Zhanzhou
    Xu, Jiaming
    Ma, Lei
    Zhang, Shiliang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (04)