Spatiotemporal decoupling attention transformer for 3D skeleton-based driver action recognition

被引:0
作者
Xu, Zhuoyan [1 ]
Xu, Jingke [1 ,2 ,3 ]
机构
[1] Shenyang Jianzhu Univ, Sch Comp Sci & Engn, Shenyang 110168, Liaoning, Peoples R China
[2] Liaoning Prov Big Data Management & Anal Lab Urban, Shenyang 110168, Liaoning, Peoples R China
[3] Natl Special Comp Engn Technol Res Ctr, Shenyang Branch, Shenyang 110168, Peoples R China
关键词
In-vehicle scenarios; Autonomous Driving; Driver action recognition; Action recognition; Skeleton-based;
D O I
10.1007/s40747-025-01811-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Driver action recognition is crucial for in-vehicle safety. We argue that the following factors limit the related research. First, spatial constraints and obstructions in the vehicle restrict the range of motion, resulting in similar action patterns and difficulty collecting the full body posture. Second, in skeleton-based action recognition, establishing the joint dependencies by the self-attention computation is always limited to a single frame, ignoring the effect of body spatial structure on dependence weights and inter-frame. Common convolution in temporal flow only focuses on frame-level temporal features, ignoring motion pattern features at a higher semantic level. Our work proposed a novel spatiotemporal decoupling attention transformer (SDA-TR). The SDA module uses a spatiotemporal decoupling strategy to decouple the weight computation according to body structure and directly establish joint dependencies between multiple frames. The TFA module aggregates sub-action-level and frame-level temporal features to improve similar recognition accuracy. On the Driver Action Recognition dataset Drive&Act using driver upper body skeletons, SDA-TR achieves state-of-the-art performance. SDA-TR also achieved 92.2%/95.8% accuracy under the CS/CV benchmarks of NTU RGB+D 60, 88.6%/89.8% accuracy under the CS/CSet benchmarks of NTU RGB+D 120, on par with other state-of-the-art methods. Our method demonstrates great scalability and generalization for action recognition.
引用
收藏
页数:12
相关论文
共 50 条
[21]   Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [J].
Liu, Ze ;
Lin, Yutong ;
Cao, Yue ;
Hu, Han ;
Wei, Yixuan ;
Zhang, Zheng ;
Lin, Stephen ;
Guo, Baining .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9992-10002
[22]   Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition [J].
Liu, Ziyu ;
Zhang, Hongwen ;
Chen, Zhenghao ;
Wang, Zhiyong ;
Ouyang, Wanli .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :140-149
[23]  
Machin M, 2018, IEEE WIREL COMMUNN, P332, DOI 10.1109/WCNCW.2018.8369029
[24]  
Manchanda TS., 2021, 2021 9 INT C REL INF, P1, DOI DOI 10.1109/ICRITO51393.2021.9596413
[25]   Drive&Act: A Multi-modal Dataset for Fine-grained Driver Behavior Recognition in Autonomous Vehicles [J].
Martin, Manuel ;
Roitberg, Alina ;
Haurilet, Monica ;
Horne, Matthias ;
Reiss, Simon ;
Voit, Michael ;
Stiefelhagen, Rainer .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2801-2810
[26]   Head, Eye, and Hand Patterns for Driver Activity Recognition [J].
Ohn-Bar, Eshed ;
Martin, Sujitha ;
Tawari, Ashish ;
Trivedi, Mohan .
2014 22ND INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2014, :660-665
[27]   Skeleton-based action recognition via spatial and temporal transformer networks [J].
Plizzari, Chiara ;
Cannici, Marco ;
Matteucci, Matteo .
COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 208 (208-209)
[28]  
Qiu HL, 2022, Arxiv, DOI arXiv:2201.02849
[29]   NTU RGB plus D: A Large Scale Dataset for 3D Human Activity Analysis [J].
Shahroudy, Amir ;
Liu, Jun ;
Ng, Tian-Tsong ;
Wang, Gang .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1010-1019
[30]  
Shaotran E.., 2021, 2021 IEEE INT C AUT, P1, DOI [10.1109/ICAS49788.2021.9551186, DOI 10.1109/ICAS49788.2021.9551186]