Human action recognition with transformer based on convolutional features

被引:4
作者
Shi, Chengcheng [1 ]
Liu, Shuxin [1 ]
机构
[1] Shanghai Dianji Univ, Sch Elect Engn, 300 Shuihua Rd,Pudong New Area, Shanghai 201306, Peoples R China
来源
INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS | 2024年 / 18卷 / 02期
关键词
Human action recognition; convolutional features; pose estimation; transformer; NETWORK;
D O I
10.3233/IDT-240159
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As one of the key research directions in the field of computer vision, human action recognition has a wide range of practical application values and prospects. In the fields of video surveillance, human-computer interaction, sports analysis, and healthcare, human action recognition technology shows a broad application prospect and potential. However, the diversity and complexity of human actions bring many challenges, such as handling complex actions, distinguishing similar actions, coping with changes in viewing angle, and overcoming occlusion problems. To address the challenges, this paper proposes an innovative framework for human action recognition. The framework combines the latest pose estimation algorithms, pre-trained CNN models, and a Vision Transformer to build an efficient system. The first step involves utilizing the latest pose estimation algorithm to accurately extract human pose information from real RGB image frames. Then, a pre-trained CNN model is used to perform feature extraction on the extracted pose information. Finally, the Vision Transformer model is applied for fusion and classification operations on the extracted features. Experimental validation is conducted on two benchmark datasets, UCF 50 and UCF 101, to demonstrate the effectiveness and efficiency of the proposed framework. The applicability and limitations of the framework in different scenarios are further explored through quantitative and qualitative experiments, providing valuable insights and inspiration for future research.
引用
收藏
页码:881 / 896
页数:16
相关论文
共 49 条
[1]   2D Pose-Based Real-Time Human Action Recognition With Occlusion-Handling [J].
Angelini, Federico ;
Fu, Zeyu ;
Long, Yang ;
Shao, Ling ;
Naqvi, Syed Mohsen .
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (06) :1433-1446
[2]   Driving behavior explanation with multi-level fusion [J].
Ben-Younes, Hedi ;
Zablocki, Eloi ;
Perez, Patrick ;
Cord, Matthieu .
PATTERN RECOGNITION, 2022, 123
[3]   Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields [J].
Cao, Zhe ;
Simon, Tomas ;
Wei, Shih-En ;
Sheikh, Yaser .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1302-1310
[4]  
Cho S, 2020, IEEE WINT CONF APPL, P624, DOI [10.1109/wacv45572.2020.9093639, 10.1109/WACV45572.2020.9093639]
[5]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[6]  
Devlin J., 2018, arXiv
[7]  
Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]
[8]  
Gowda SN, 2021, AAAI CONF ARTIF INTE, V35, P1451
[9]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[10]  
Hendrycks D, 2020, Arxiv, DOI [arXiv:1912.02781, 10.48550/arXiv.1912.02781]