Multi-Pose Human Action Recognition using deep learning ConvLSTM

被引：0

作者：

Sharma, Vijeta ^{[1
]}

Shandilya, Utkarsh ^{[2
]}

Mishra, Deepti ^{[1
]}

机构：

[1] Norwegian Univ Sci & Technol, Dept Comp Sci IDI, Educ Technol Lab, N-2815 Gjovik, Norway

[2] Cent Univ Haryana, Dept Comp Sci & Engn, Jant Pali 123031, Haryana, India

来源：

2024 12TH EUROPEAN WORKSHOP ON VISUAL INFORMATION PROCESSING, EUVIP 2024 | 2024年

关键词：

Human Action Recognition; ConvLSTM; Robotics; multi-pose recognition; deep learning;

D O I：

10.1109/EUVIP61797.2024.10772902

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Multi-pose Human action recognition (HAR) is a pivotal task in computer vision with applications spanning surveillance, human-robot interaction (HRI), behaviour analysis, etc. When applying the HAR method to the robotic platform, traditional methods often struggle with variations in pose, occlusions, and dynamic backgrounds. In this work, we propose a deep Convolutional Long Short-Term Memory (ConvLSTM) architecture for multi-pose human action recognition on a dynamic dataset RVD24, HRI-centric self-developed in the lab environment. The proposed model is comprised of 4 stacked layers of ConvLSTM with LeakyRelu and BatchNormalisation followed by a fully connected layer and two dense layers. The Softmax layer was used to predict the 24 action categories at the end of the model. Here, the challenge is how precisely human actions are recognised to interact with robots further when deployed on a robot operating system (ROS) based platform. The proposed deep ConvLSTM method leverages the spatial hierarchies captured by convolutional layers and the temporal dependencies handled by LSTM layers, making it adept at recognising complex actions across varying poses and environments. We evaluated this model on the RVD24 dataset with 24 action categories, demonstrating its robustness with 82.12% accuracy. This model achieved notable improvements in recognition accuracy compared to state-of-the-art ConvLSTM models on benchmark action recognition datasets. Our findings suggest that the proposed deep ConvLSTM-based framework is highly effective for multi-pose human action recognition, offering a reliable solution for real-world robotics applications.

引用

页数：6

共 31 条

[1] Multimodal Engagement Prediction in Multiperson Human-Robot Interaction [J].

Abdelrahman, Ahmed A. ;

Strazdas, Dominykas ;

Khalifa, Aly ;

Hintz, Jan ;

Hempel, Thorsten ;

Al-Hamadi, Ayoub .

IEEE ACCESS, 2022, 10 :61980-61991

[2] ViViT: A Video Vision Transformer [J].

Arnab, Anurag ;

Dehghani, Mostafa ;

Heigold, Georg ;

Sun, Chen ;

Lucic, Mario ;

Schmid, Cordelia .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826

[3] Speeded-Up Robust Features (SURF) [J].

Bay, Herbert ;

Ess, Andreas ;

Tuytelaars, Tinne ;

Van Gool, Luc .

COMPUTER VISION AND IMAGE UNDERSTANDING, 2008, 110 (03) :346-359

[4]

Bhatia S., 2023, 2023 6 INT C INF SYS, P1

[5] A dataset for automatic violence detection in videos [J].

Bianculli, Miriana ;

Falcionelli, Nicola ;

Sernani, Paolo ;

Tomassini, Selene ;

Contardo, Paolo ;

Lombardi, Mara ;

Dragoni, Aldo Franco .

DATA IN BRIEF, 2020, 33

[6]

Chang A. T.-d.-P. Jen-Yen, 2019 INT C ROB AUT I

[7]

Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878

[8]

Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929

[9] FlowNet: Learning Optical Flow with Convolutional Networks [J].

Dosovitskiy, Alexey ;

Fischer, Philipp ;

Ilg, Eddy ;

Haeusser, Philip ;

Hazirbas, Caner ;

Golkov, Vladimir ;

van der Smagt, Patrick ;

Cremers, Daniel ;

Brox, Thomas .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2758-2766

[10] Learning Spatiotemporal Features with 3D Convolutional Networks [J].

Du Tran ;

Bourdev, Lubomir ;

Fergus, Rob ;

Torresani, Lorenzo ;

Paluri, Manohar .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497

← 1 2 3 4 →