Spatiotemporal decoupling attention transformer for 3D skeleton-based driver action recognition

被引:0
作者
Xu, Zhuoyan [1 ]
Xu, Jingke [1 ,2 ,3 ]
机构
[1] Shenyang Jianzhu Univ, Sch Comp Sci & Engn, Shenyang 110168, Liaoning, Peoples R China
[2] Liaoning Prov Big Data Management & Anal Lab Urban, Shenyang 110168, Liaoning, Peoples R China
[3] Natl Special Comp Engn Technol Res Ctr, Shenyang Branch, Shenyang 110168, Peoples R China
关键词
In-vehicle scenarios; Autonomous Driving; Driver action recognition; Action recognition; Skeleton-based;
D O I
10.1007/s40747-025-01811-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Driver action recognition is crucial for in-vehicle safety. We argue that the following factors limit the related research. First, spatial constraints and obstructions in the vehicle restrict the range of motion, resulting in similar action patterns and difficulty collecting the full body posture. Second, in skeleton-based action recognition, establishing the joint dependencies by the self-attention computation is always limited to a single frame, ignoring the effect of body spatial structure on dependence weights and inter-frame. Common convolution in temporal flow only focuses on frame-level temporal features, ignoring motion pattern features at a higher semantic level. Our work proposed a novel spatiotemporal decoupling attention transformer (SDA-TR). The SDA module uses a spatiotemporal decoupling strategy to decouple the weight computation according to body structure and directly establish joint dependencies between multiple frames. The TFA module aggregates sub-action-level and frame-level temporal features to improve similar recognition accuracy. On the Driver Action Recognition dataset Drive&Act using driver upper body skeletons, SDA-TR achieves state-of-the-art performance. SDA-TR also achieved 92.2%/95.8% accuracy under the CS/CV benchmarks of NTU RGB+D 60, 88.6%/89.8% accuracy under the CS/CSet benchmarks of NTU RGB+D 120, on par with other state-of-the-art methods. Our method demonstrates great scalability and generalization for action recognition.
引用
收藏
页数:12
相关论文
共 50 条
[11]   Interpretable 3D Human Action Analysis with Temporal Convolutional Networks [J].
Kim, Tae Soo ;
Reiter, Austin .
2017 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2017, :1623-1631
[12]   Human-Machine Interaction in Driving Assistant Systems for Semi-Autonomous Driving Vehicles [J].
Lee, Heung-Gu ;
Kang, Dong-Hyun ;
Kim, Deok-Hwan .
ELECTRONICS, 2021, 10 (19)
[13]  
Li C, 2017, IEEE INT CONF MULTI
[14]   Symbiotic Graph Neural Networks for 3D Skeleton-Based Human Action Recognition and Motion Prediction [J].
Li, Maosen ;
Chen, Siheng ;
Chen, Xu ;
Zhang, Ya ;
Wang, Yanfeng ;
Tian, Qi .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (06) :3316-3333
[15]   Actional-Structural Graph Convolutional Networks for Skeleton-based Action Recognition [J].
Li, Maosen ;
Chen, Siheng ;
Chen, Xu ;
Zhang, Ya ;
Wang, Yanfeng ;
Tian, Qi .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3590-3598
[16]  
Li P, 2019, IEEE INT C INTELL TR, P3243, DOI 10.1109/ITSC.2019.8916929
[17]   Learning shape and motion representations for view invariant skeleton-based action recognition [J].
Li, Yanshan ;
Xia, Rongjie ;
Liu, Xing .
PATTERN RECOGNITION, 2020, 103
[18]   NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding [J].
Liu, Jun ;
Shahroudy, Amir ;
Perez, Mauricio ;
Wang, Gang ;
Duan, Ling-Yu ;
Kot, Alex C. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (10) :2684-2701
[19]   Enhanced skeleton visualization for view invariant human action recognition [J].
Liu, Mengyuan ;
Liu, Hong ;
Chen, Chen .
PATTERN RECOGNITION, 2017, 68 :346-362
[20]   Graph transformer network with temporal kernel attention for skeleton-based action recognition [J].
Liu, Yanan ;
Zhang, Hao ;
Xu, Dan ;
He, Kangjian .
KNOWLEDGE-BASED SYSTEMS, 2022, 240