Video-Based Human Activity Recognition Using Deep Learning Approaches

被引:30
作者
Surek, Guilherme Augusto Silva [1 ]
Seman, Laio Oriel [2 ]
Stefenon, Stefano Frizzo [3 ,4 ]
Mariani, Viviana Cocco [5 ,6 ]
Coelho, Leandro dos Santos [1 ,5 ]
机构
[1] Pontif Catholic Univ Parana PUCPR, Ind & Syst Engn Grad Program PPGEPS, BR-80215901 Curitiba, Brazil
[2] Univ Vale Itajai, Grad Program Appl Comp Sci, BR-88302901 Itajai, Brazil
[3] Fdn Bruno Kessler, Digital Ind Ctr, I-38123 Trento, Italy
[4] Univ Udine, Dept Math Comp Sci & Phys, I-33100 Udine, Italy
[5] Fed Univ Parana UFPR, Dept Elect Engn, BR-81530000 Curitiba, Brazil
[6] Pontif Catholic Univ Parana, Mech Engn Grad Program PPGEM, BR-80215901 Curitiba, Brazil
关键词
convolutional neural network; deep learning; self-DIstillation with NO labels (DINO); video human action recognition; vision transformer architecture; NETWORK;
D O I
10.3390/s23146384
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Due to its capacity to gather vast, high-level data about human activity from wearable or stationary sensors, human activity recognition substantially impacts people's day-to-day lives. Multiple people and things may be seen acting in the video, dispersed throughout the frame in various places. Because of this, modeling the interactions between many entities in spatial dimensions is necessary for visual reasoning in the action recognition task. The main aim of this paper is to evaluate and map the current scenario of human actions in red, green, and blue videos, based on deep learning models. A residual network (ResNet) and a vision transformer architecture (ViT) with a semi-supervised learning approach are evaluated. The DINO (self-DIstillation with NO labels) is used to enhance the potential of the ResNet and ViT. The evaluated benchmark is the human motion database (HMDB51), which tries to better capture the richness and complexity of human actions. The obtained results for video classification with the proposed ViT are promising based on performance metrics and results from the recent literature. The results obtained using a bi-dimensional ViT with long short-term memory demonstrated great performance in human action recognition when applied to the HMDB51 dataset. The mentioned architecture presented 96.7 & PLUSMN; 0.35% and 41.0 & PLUSMN; 0.27% in terms of accuracy (mean & PLUSMN; standard deviation values) in the train and test phases of the HMDB51 dataset, respectively.
引用
收藏
页数:15
相关论文
共 80 条
[1]  
Anguita D., 2013, 21 EUROPEAN S ARTIFI, V3, P437
[2]  
Babiker Mohanad., 2017, 2017 IEEE 4 INT C SM, P1
[3]   Vision-based human activity recognition: a survey [J].
Beddiar, Djamila Romaissa ;
Nini, Brahim ;
Sabokrou, Mohammad ;
Hadid, Abdenour .
MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (41-42) :30509-30555
[4]   Machine Fault Detection Using a Hybrid CNN-LSTM Attention-Based Model [J].
Borre, Andressa ;
Seman, Laio Oriel ;
Camponogara, Eduardo ;
Stefenon, Stefano Frizzo ;
Mariani, Viviana Cocco ;
Coelho, Leandro dos Santos .
SENSORS, 2023, 23 (09)
[5]   Wavelet LSTM for Fault Forecasting in Electrical Power Grids [J].
Branco, Nathielle Waldrigues ;
Matos Cavalca, Mariana Santos ;
Stefenon, Stefano Frizzo ;
Quietinho Leithardt, Valderi Reis .
SENSORS, 2022, 22 (21)
[6]  
Caron M., 2021, P IEEECVF INT C COMP, DOI [10.48550/arXiv.2104.14294, DOI 10.48550/ARXIV.2104.14294]
[7]  
Carreira J., 2018, arXiv
[8]  
Carreira J, 2019, Arxiv, DOI arXiv:1907.06987
[9]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[10]  
Cherian A., 2017, P C COMP VIS PATT RE, P1631