Towards efficient video-based action recognition: context-aware memory attention network

被引:2
作者
Koh, Thean Chun [1 ]
Yeo, Chai Kiat [1 ]
Jing, Xuan [1 ,2 ]
Sivadas, Sunil [2 ]
机构
[1] Nanyang Technol Univ, Sch Comp Sci & Engn, 50 Nanyang Ave, Singapore 639798, Singapore
[2] NCS Pte Ltd, Ang Mo Kio St 62, Singapore 569141, Singapore
来源
SN APPLIED SCIENCES | 2023年 / 5卷 / 12期
关键词
Action recognition; Deep learning; Convolutional neural network; Attention; BIDIRECTIONAL LSTM; CLASSIFICATION;
D O I
10.1007/s42452-023-05568-5
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Given the prevalence of surveillance cameras in our daily lives, human action recognition from videos holds significant practical applications. A persistent challenge in this field is to develop more efficient models capable of real-time recognition with high accuracy for widespread implementation. In this research paper, we introduce a novel human action recognition model named Context-Aware Memory Attention Network (CAMA-Net), which eliminates the need for optical flow extraction and 3D convolution which are computationally intensive. By removing these components, CAMA-Net achieves superior efficiency compared to many existing approaches in terms of computation efficiency. A pivotal component of CAMA-Net is the Context-Aware Memory Attention Module, an attention module that computes the relevance score between key-value pairs obtained from the 2D ResNet backbone. This process establishes correspondences between video frames. To validate our method, we conduct experiments on four well-known action recognition datasets: ActivityNet, Diving48, HMDB51 and UCF101. The experimental results convincingly demonstrate the effectiveness of our proposed model, surpassing the performance of existing 2D-CNN based baseline models.Article HighlightsRecent human action recognition models are not yet ready for practical applications due to high computation needs.We propose a 2D CNN-based human action recognition method to reduce the computation load.The proposed method achieves competitive performance compared to most SOTA 2D CNN-based methods on public datasets.
引用
收藏
页数:12
相关论文
共 74 条
  • [1] Human Activity Analysis: A Review
    Aggarwal, J. K.
    Ryoo, M. S.
    [J]. ACM COMPUTING SURVEYS, 2011, 43 (03)
  • [2] Ba JMY, 2015, Arxiv, DOI arXiv:1412.7755
  • [3] Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473]
  • [4] Heilbron FC, 2015, PROC CVPR IEEE, P961, DOI 10.1109/CVPR.2015.7298698
  • [5] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [6] Cheng J., 2016, EMNLP, P551, DOI [DOI 10.18653/V1/D16-1053, 10.18653/v1/D16-1053]
  • [7] A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector
    Das Dawn, Debapratim
    Shaikh, Soharab Hossain
    [J]. VISUAL COMPUTER, 2016, 32 (03) : 289 - 306
  • [8] Diba A, 2017, Arxiv, DOI arXiv:1711.08200
  • [9] Donahue J, 2015, PROC CVPR IEEE, P2625, DOI 10.1109/CVPR.2015.7298878
  • [10] Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]