Spatiotemporal Self-Attention Mechanism Driven by 3D Pose to Guide RGB Cues for Daily Living Human Activity Recognition

被引:2
作者
Basly, Hend [1 ]
Zayene, Mohamed Amine [1 ]
Sayadi, Fatma Ezahra [1 ]
机构
[1] Natl Engn Sch Sousse ENISO, NOCCS Labb Networked Objects Control & Commun Syst, BP 264, Erriadh 4023, Sousse, Tunisia
基金
英国科研创新办公室;
关键词
Transformer; Self-attention mechanism; Daily living activity recognition; Bilinear pooling attention; NETWORK;
D O I
10.1007/s10846-023-01926-y
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The field of human activity recognition is evolving at a quick pace. Indeed, over the last two decades, several approaches have been proposed to recognize human activities from generic videos, but still limited for daily living videos which have more characteristics that make them much more complex to manage. In fact, they present several challenges to overcome, such as; camera view variations, time information representation, inter-class variation between similar actions, fine-grained actions representation and high intra-class variation. Generally, the recognition of the action requires the extraction of spatial and temporal information in the videos. To extract temporal information, several works based on the LSTM network have been published. Although, they have proven their great potential in this field, they fail to model long range temporal information in very long video sequences. We have hence thought of using Transformer networks to propose a new pose-guided self-attention mechanism combined to 3D convolutional neural networks (3D CNN) by a Bilinear Pooling Attention module (BPA) which allows the spatial-temporal skeleton features to recalibrate the RGB features for Daily Living Activity (DLA) recognition. In addition, the majority of the implemented datasets are static and do not show strong variations in movement over time. We then thought of going towards a large-scale dataset called NTU RGB+D, since it contains RGB-D human actions that evolve much more over time. The Experimental results demonstrate that our Spatial Temporal Self Attention mechanism combined to 3D CNN through BPA module (ST-SA-BPA) outperforms state-of-the-art methods in terms of performance.
引用
收藏
页数:14
相关论文
共 64 条
  • [1] Araei S., 2021, 26 INT COMP C COMP S, P1
  • [2] Baradel F, 2018, BMVC 2018 29 BRIT MA, P1
  • [3] Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points
    Baradel, Fabien
    Wolf, Christian
    Mille, Julien
    Taylor, Graham W.
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 469 - 478
  • [4] Human Action Recognition: Pose-based Attention draws focus to Hands
    Baradel, Fabien
    Wolf, Christian
    Mille, Julien
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 604 - 613
  • [5] Basly Hend, 2020, Image and Signal Processing. 9th International Conference, ICISP 2020. Proceedings. Lecture Notes in Computer Science (LNCS 12119), P271, DOI 10.1007/978-3-030-51935-3_29
  • [6] LAHAR-CNN: human activity recognition from one image using convolutional neural network learning approach
    Basly, Hend
    Ouarda, Wael
    Sayadi, Fatma Ezahra
    Ouni, Bouraoui
    Alimi, Adel M.
    [J]. INTERNATIONAL JOURNAL OF BIOMETRICS, 2021, 13 (04) : 385 - 408
  • [7] DTR-HAR: deep temporal residual representation for human activity recognition
    Basly, Hend
    Ouarda, Wael
    Sayadi, Fatma Ezahra
    Ouni, Bouraoui
    Alimi, Adel M.
    [J]. VISUAL COMPUTER, 2022, 38 (03) : 993 - 1013
  • [8] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [9] A Semisupervised Recurrent Convolutional Attention Model for Human Activity Recognition
    Chen, Kaixuan
    Yao, Lina
    Zhang, Dalin
    Wang, Xianzhi
    Chang, Xiaojun
    Nie, Feiping
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (05) : 1747 - 1756
  • [10] P-CNN: Pose-based CNN Features for Action Recognition
    Cheron, Guilhem
    Laptev, Ivan
    Schmid, Cordelia
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 3218 - 3226