Spatial-Temporal Interleaved Network for Efficient Action Recognition

被引：2

作者：

Jiang, Shengqin ^{[1
,2
,3
]}

Zhang, Haokui ^{[4
]}

Qi, Yuankai ^{[5
]}

Liu, Qingshan ^{[6
]}

机构：

[1] Nanjing Univ Informat Sci & Technol, Sch Comp Sci, Nanjing 210044, Peoples R China

[2] Nanjing Univ Informat Sci & Technol, Minist Educ, Engn Res Ctr Digital Forens, Nanjing 210044, Peoples R China

[3] Nanjing Univ Informat Sci & Technol, Jiangsu Collaborat Innovat Ctr Atmospher Environm, Nanjing 210044, Peoples R China

[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710129, Peoples R China

[5] Macquarie Univ, Sch Comp, Sydney, NSW 2109, Australia

[6] Nanjing Univ Posts & Telecommun, Sch Comp Sci, Nanjing 210023, Peoples R China

来源：

IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS | 2025年 / 21卷 / 01期

基金：

中国国家自然科学基金; 中国博士后科学基金;

关键词：

Convolution; Three-dimensional displays; Kernel; Computational modeling; Videos; Transformers; Solid modeling; 3D convolution; action recognition; feature interaction; spatial-temporal features;

D O I：

10.1109/TII.2024.3450021

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The decomposition of 3D convolution will considerably reduce the computing complexity of 3D convolutional neural networks, yet simple stacking restricts the performance of neural networks. To this end, we propose a spatial-temporal interleaved network for efficient action recognition. By deeply analyzing this task, it revisits the structure of 3D neural networks in action recognition from the following perspectives. To enhance the learning of robust spatial-temporal features, we initially propose an interleaved feature interaction module to comprehensively explore cross-layer features and capture the most discriminative information among them. With regards to being lightweight, a boosted parallel pseudo-3D module is introduced with the goal of circumventing a substantial number of computations from the lower to middle levels while enhancing temporal and spatial features in parallel at high levels. Furthermore, we exploit a spatial-temporal differential attention mechanism to suppress redundant features in different dimensions while reaping the benefits of nearly negligible parameters. Lastly, extensive experiments on four action recognition benchmarks are given to show the advantages and efficiency of our proposed method. Specifically, our method attains a 15.2% improvement in Top-1 accuracy compared to our baseline, a stack of full 3D convolutional layers, on the Something-Something V1 dataset while utilizing only 18.2% of the parameters.

引用

页码：178 / 187

页数：10

共 39 条

[21] Mahdisoltani Farzaneh, 2018, ARXIV
[22] Meng Y., 2021, PROC INT C LEARN REP, P1
[23] Qing ZW, 2024, IEEE T MULTIMEDIA, V26, P218, DOI 10.1109/TMM.2023.3263288
[24] Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks
Qiu, Zhaofan
Yao, Ting
Mei, Tao
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5534 - 5542
[25] Occlusion-Aware Graph Neural Networks for Skeleton Action Recognition
Shi, Wuzhen
Li, Dan
Wen, Yang
Yang, Wu
[J]. IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2023, 19 (10) : 10288 - 10298
[26] Soomro Khurram, 2012, CORR
[27] Stergiou A, 2019, IEEE IMAGE PROC, P1830, DOI [10.1109/icip.2019.8803153, 10.1109/ICIP.2019.8803153]
[28] Human Action Recognition From Various Data Modalities: A Review
Sun, Zehua
Ke, Qiuhong
Rahmani, Hossein
Bennamoun, Mohammed
Wang, Gang
Liu, Jun
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (03) : 3200 - 3225
[29] A Closer Look at Spatiotemporal Convolutions for Action Recognition
Tran, Du
Wang, Heng
Torresani, Lorenzo
Ray, Jamie
LeCun, Yann
Paluri, Manohar
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6450 - 6459
[30] Appearance-and-Relation Networks for Video Classification
Wang, Limin
Li, Wei
Li, Wen
Van Gool, Luc
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1430 - 1439

← 1 2 3 4 →