Spatial-Temporal Interleaved Network for Efficient Action Recognition

被引:2
作者
Jiang, Shengqin [1 ,2 ,3 ]
Zhang, Haokui [4 ]
Qi, Yuankai [5 ]
Liu, Qingshan [6 ]
机构
[1] Nanjing Univ Informat Sci & Technol, Sch Comp Sci, Nanjing 210044, Peoples R China
[2] Nanjing Univ Informat Sci & Technol, Minist Educ, Engn Res Ctr Digital Forens, Nanjing 210044, Peoples R China
[3] Nanjing Univ Informat Sci & Technol, Jiangsu Collaborat Innovat Ctr Atmospher Environm, Nanjing 210044, Peoples R China
[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710129, Peoples R China
[5] Macquarie Univ, Sch Comp, Sydney, NSW 2109, Australia
[6] Nanjing Univ Posts & Telecommun, Sch Comp Sci, Nanjing 210023, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Convolution; Three-dimensional displays; Kernel; Computational modeling; Videos; Transformers; Solid modeling; 3D convolution; action recognition; feature interaction; spatial-temporal features;
D O I
10.1109/TII.2024.3450021
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The decomposition of 3D convolution will considerably reduce the computing complexity of 3D convolutional neural networks, yet simple stacking restricts the performance of neural networks. To this end, we propose a spatial-temporal interleaved network for efficient action recognition. By deeply analyzing this task, it revisits the structure of 3D neural networks in action recognition from the following perspectives. To enhance the learning of robust spatial-temporal features, we initially propose an interleaved feature interaction module to comprehensively explore cross-layer features and capture the most discriminative information among them. With regards to being lightweight, a boosted parallel pseudo-3D module is introduced with the goal of circumventing a substantial number of computations from the lower to middle levels while enhancing temporal and spatial features in parallel at high levels. Furthermore, we exploit a spatial-temporal differential attention mechanism to suppress redundant features in different dimensions while reaping the benefits of nearly negligible parameters. Lastly, extensive experiments on four action recognition benchmarks are given to show the advantages and efficiency of our proposed method. Specifically, our method attains a 15.2% improvement in Top-1 accuracy compared to our baseline, a stack of full 3D convolutional layers, on the Something-Something V1 dataset while utilizing only 18.2% of the parameters.
引用
收藏
页码:178 / 187
页数:10
相关论文
共 39 条
  • [1] Extreme Low-Resolution Action Recognition with Confident Spatial-Temporal Attention Transfer
    Bai, Yucai
    Zou, Qin
    Chen, Xieyuanli
    Li, Lingxi
    Ding, Zhengming
    Chen, Long
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2023, 131 (06) : 1550 - 1565
  • [2] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [3] Chen SF, 2022, ADV NEUR IN
  • [4] Multi-fiber Networks for Video Recognition
    Chen, Yunpeng
    Kalantidis, Yannis
    Li, Jianshu
    Yan, Shuicheng
    Feng, Jiashi
    [J]. COMPUTER VISION - ECCV 2018, PT I, 2018, 11205 : 364 - 380
  • [5] Improved Residual Networks for Image and Video Recognition
    Duta, Ionut Cosmin
    Liu, Li
    Zhu, Fan
    Shao, Ling
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9415 - 9422
  • [6] SlowFast Networks for Video Recognition
    Feichtenhofer, Christoph
    Fan, Haoqi
    Malik, Jitendra
    He, Kaiming
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 6201 - 6210
  • [7] The "something something" video database for learning and evaluating visual common sense
    Goyal, Raghav
    Kahou, Samira Ebrahimi
    Michalski, Vincent
    Materzynska, Joanna
    Westphal, Susanne
    Kim, Heuna
    Haenel, Valentin
    Fruend, Ingo
    Yianilos, Peter
    Mueller-Freitag, Moritz
    Hoppe, Florian
    Thurau, Christian
    Bax, Ingo
    Memisevic, Roland
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5843 - 5851
  • [8] Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
    Hara, Kensho
    Kataoka, Hirokatsu
    Satoh, Yutaka
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6546 - 6555
  • [9] Deep Residual Learning for Image Recognition
    He, Kaiming
    Zhang, Xiangyu
    Ren, Shaoqing
    Sun, Jian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 770 - 778
  • [10] 3D Convolutional Neural Networks for Human Action Recognition
    Ji, Shuiwang
    Xu, Wei
    Yang, Ming
    Yu, Kai
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2013, 35 (01) : 221 - 231