Multi-receptive field spatiotemporal network for action recognition

被引：0

作者：

Mu Nie

Sen Yang

Zhenhua Wang

Baochang Zhang

Huimin Lu

Wankou Yang

机构：

[1] Southeast University,School of Cyber Science and Engineering

[2] Southeast University,School of Automation

[3] Zhejiang University of Technology,College of Computer Science and Technology

[4] Beihang University,School of Automation Science and Electrical Engineering

[5] Kyushu Institute of Technology,Department of Mechanical and Control Engineering

来源：

International Journal of Machine Learning and Cybernetics | 2023年 / 14卷

关键词：

Action recognition; Spatiotemporal; Multi-receptive field; Visual tempo;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Despite the great progress in action recognition made by deep neural networks, visual tempo may be overlooked in the feature learning process of existing methods. The visual tempo is the dynamic and temporal scale variation of actions. Existing models usually understand spatiotemporal scenes using temporal and spatial convolutions, which are limited in both temporal and spatial dimensions, and they cannot cope with differences in visual tempo changes. To address these issues, we propose a multi-receptive field spatiotemporal (MRF-ST) network to effectively model the spatial and temporal information of different receptive fields. In the proposed network, dilated convolution is utilized to obtain different receptive fields. Meanwhile, dynamic weighting for different dilation rates is designed based on the attention mechanism. Thus, the proposed MRF-ST network can directly caption various tempos in the same network layer without any additional cost. Moreover, the network can improve the accuracy of action recognition by learning more visual tempos of different actions. Extensive evaluations show that MRF-ST reaches the state-of-the-art on three popular benchmarks for action recognition: UCF-101, HMDB-51, and Diving-48. Further analysis also indicates that MRF-ST can significantly improve the performance at the scenes with large variances in visual tempo.

引用

页码：2439 / 2453

页数：14

共 118 条

[1]

Luvizon DC(2021)Multi-task deep learning for real-time 3d human pose estimation and action recognition IEEE Trans Pattern Anal Mach Intell 43 2752-2764

[2]

Picard D(2022)Motion-driven visual tempo learning for video-based action recognition IEEE Trans Image Process 31 4104-4116

[3]

Tabia H(2020)A discriminative deep association learning for facial expression recognition Int J Mach Learn Cybern 11 779-793

[4]

Liu Y(2020)Deep fuzzy hashing network for efficient image retrieval IEEE Trans Fuzzy Syst 29 166-176

[5]

Yuan J(2022)Action recognition based on rgb and skeleton data sets: a survey Neurocomputing 512 287-306

[6]

Tu Z(2022)Hybrid two-stream dynamic CNN for view adaptive human action recognition using ensemble learning Int J Mach Learn Cybern 13 1157-1166

[7]

Jin X(2021)Hybrid two-stream dynamic cnn for view adaptive human action recognition using ensemble learning Int J Mach Learn Cybern 2 1-10

[8]

Sun W(2015)Semantic human activity recognition: a literature review Pattern Recogn 48 2329-2345

[9]

Jin Z(2019)Learning principal orientations and residual descriptor for action recognition Pattern Recogn 86 14-26

[10]

Lu H(2019)Spatial-temporal pyramid based convolutional neural network for action recognition Neurocomputing 358 446-455

← 1 2 3 4 5 6 7 8 9 10 →