Motion Guided Attention Learning for Self-Supervised 3D Human Action Recognition

被引:22
作者
Yang, Yang [1 ]
Liu, Guangjun [1 ]
Gao, Xuehao [1 ]
机构
[1] Jiaotong Univ, Autocontrol Inst, Sch Elect & Informat Engn, Xian 710049, Peoples R China
关键词
Skeleton; Task analysis; Semantics; Generators; Costs; Representation learning; Recurrent neural networks; 3D human action recognition; self-supervised learning; prior knowledge; motion attention; FUSION;
D O I
10.1109/TCSVT.2022.3194350
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
3D human action recognition has received increasing attention due to its potential application in video surveillance equipment. To guarantee satisfactory performance, previous studies are mainly based on supervised methods, which have to add a large amount of manual annotation costs. In addition, general deep networks for video sequences suffer from heavy computational costs, thus cannot satisfy the basic requirement of embedded systems. In this paper, a novel Motion Guided Attention Learning (MG-AL) framework is proposed, which formulates the action representation learning as a self-supervised motion attention prediction problem. Specifically, MG-AL is a lightweight network. A set of simple motion priors (e.g., intra-joint variance, inter-frame deviation, intra-joint variance, and cross-joint covariance), which minimizes additional parameters and computational overhead, is regarded as a supervisory signal to guide the attention generation. The encoder is trained via predicting multiple self-attention tasks to capture action-specific feature representations. Extensive evaluations are performed on three challenging benchmark datasets (NTU-RGB+D 60, NTU-RGB+D 120 and NW-UCLA). The proposed method achieves superior performance compared to state-of-the-art methods, while having a very low computational cost.
引用
收藏
页码:8623 / 8634
页数:12
相关论文
共 54 条
[1]   Characterizing Scattered Occlusions for Effective Dense-Mode Crowd Counting [J].
Almalki, Khalid J. ;
Choi, Baek-Young ;
Chen, Yu ;
Song, Sejun .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, :3833-3842
[2]   Fuzzy Integral-Based CNN Classifier Fusion for 3D Skeleton Action Recognition [J].
Banerjee, Avinandan ;
Singh, Pawan Kumar ;
Sarkar, Ram .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (06) :2206-2216
[3]   Skeleton Image Representation for 3D Action Recognition based on Tree Structure and Reference Joints [J].
Caetano, Carlos ;
Bremond, Francois ;
Schwartz, William Robson .
2019 32ND SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 2019, :16-23
[4]   P-CNN: Pose-based CNN Features for Action Recognition [J].
Cheron, Guilhem ;
Laptev, Ivan ;
Schmid, Cordelia .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :3218-3226
[5]   Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition [J].
Dhiman, Chhavi ;
Vishwakarma, Dinesh Kumar ;
Agarwal, Paras .
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (03)
[6]   View-Invariant Deep Architecture for Human Action Recognition Using Two-Stream Motion and Shape Temporal Dynamics [J].
Dhiman, Chhavi ;
Vishwakarma, Dinesh Kumar .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :3835-3844
[7]   A review of state-of-the-art techniques for abnormal human activity recognition [J].
Dhiman, Chhavi ;
Vishwakarma, Dinesh Kumar .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2019, 77 :21-45
[8]   Self-Supervised Video Representation Learning With Odd-One-Out Networks [J].
Fernando, Basura ;
Bilen, Hakan ;
Gavves, Efstratios ;
Gould, Stephen .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :5729-5738
[9]   Efficient Spatio-Temporal Contrastive Learning for Skeleton-Based 3-D Action Recognition [J].
Gao, Xuehao ;
Yang, Yang ;
Zhang, Yimeng ;
Li, Maosen ;
Yu, Jin-Gang ;
Du, Shaoyi .
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :405-417
[10]  
Guo TY, 2022, AAAI CONF ARTIF INTE, P762