Shots segmentation-based optimized dual-stream framework for robust human activity recognition in surveillance video

被引:11
作者
Hussain, Altaf [1 ]
Khan, Samee Ullah [1 ]
Khan, Noman [1 ]
Ullah, Waseem [1 ]
Alkhayyat, Ahmed [2 ]
Alharbi, Meshal [3 ]
Baik, Sung Wook [1 ]
机构
[1] Sejong Univ, Seoul 143747, South Korea
[2] Islamic Univ, Najaf 54001, Iraq
[3] Prince Sattam Bin Abdulaziz Univ, Coll Comp Engn & Sci, Dept Comp Sci, Alkharj 11942, Saudi Arabia
基金
新加坡国家研究基金会;
关键词
Activity Recognition; Video Classification; Surveillance System; Lowlight Image Enhancement; Dual Stream Network; Transformer Network; Convolutional Neural Network; ATTENTION; LSTM; FEATURES; NETWORK;
D O I
10.1016/j.aej.2023.11.017
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Nowadays, for controlling crime, surveillance cameras are typically installed in all public places to ensure urban safety and security. However, automating Human Activity Recognition (HAR) using computer vision techniques faces several challenges such as lowlighting, complex spatiotemporal features, clutter backgrounds, and inefficient utilization of surveillance system resources. Existing attempts in HAR designed straightforward networks by analyzing either spatial or motion patterns resulting in limited performance while the dual streams methods are entirely based on Convolutional Neural Networks (CNN) that are inadequate to learning the long-range temporal information for HAR. To overcome the above-mentioned challenges, this paper proposes an optimized dual stream framework for HAR which mainly consists of three steps. First, a shots segmentation module is introduced in the proposed framework to efficiently utilize the surveillance system resources by enhancing the lowlight video stream and then it detects salient video frames that consist of human. This module is trained on our own challenging Lowlight Human Surveillance Dataset (LHSD) which consists of both normal and different levels of lowlighting data to recognize humans in complex uncertain environments. Next, to learn HAR from both contextual and motion information, a dual stream approach is used in the feature extraction. In the first stream, it freezes the learned weights of the backbone Vision Transformer (ViT) B-16 model to select the discriminative contextual information. In the second stream, ViT features are then fused with the intermediate encoder layers of FlowNet2 model for optical flow to extract a robust motion feature vector. Finally, a two stream Parallel Bidirectional Long Short-Term Memory (PBiLSTM) is proposed for sequence learning to capture the global semantics of activities, followed by Dual Stream Multi-Head Attention (DSMHA) with a late fusion strategy to optimize the huge features vector for accurate HAR. To assess the strength of the proposed framework, extensive empirical results are conducted on real-world surveillance scenarios and various benchmark HAR datasets that achieve 78.6285%, 96.0151%, and 98.875% accuracies on HMDB51, UCF101, and YouTube Action, respectively. Our results show that the proposed strategy outperforms State-of-the-Art (SOTA) methods. The proposed framework gives superior performance in HAR, providing accurate and reliable recognition of human activities in surveillance systems.
引用
收藏
页码:632 / 647
页数:16
相关论文
共 74 条
[1]   The computation of optical flow [J].
Beauchemin, SS ;
Barron, JL .
ACM COMPUTING SURVEYS, 1995, 27 (03) :433-467
[2]   Activity Recognition based on a Magnitude-Orientation Stream Network [J].
Caetano, Carlos ;
de Melo, Victor H. C. ;
dos Santos, Jefersson A. ;
Schwartz, William Robson .
2017 30TH SIBGRAPI CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 2017, :47-54
[3]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[4]   Infrared Action Detection in the Dark via Cross-Stream Attention Mechanism [J].
Chen, Xu ;
Gao, Chenqiang ;
Li, Chaoyu ;
Yang, Yi ;
Meng, Deyu .
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 :288-300
[5]  
Chen ZX, 2020, Arxiv, DOI arXiv:1811.07059
[6]   Exact histogram specification [J].
Coltuc, D ;
Bolon, P ;
Chassery, JM .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2006, 15 (05) :1143-1152
[7]   Human action recognition using two-stream attention based LSTM networks [J].
Dai, Cheng ;
Liu, Xingang ;
Lai, Jinfeng .
APPLIED SOFT COMPUTING, 2020, 86
[8]  
Dosovitskiy A, 2020, INT C LEARN REPR
[9]   Learning Spatiotemporal Features with 3D Convolutional Networks [J].
Du Tran ;
Bourdev, Lubomir ;
Fergus, Rob ;
Torresani, Lorenzo ;
Paluri, Manohar .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :4489-4497
[10]   Actions as space-time shapes [J].
Gorelick, Lena ;
Blank, Moshe ;
Shechtman, Eli ;
Irani, Michal ;
Basri, Ronen .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2007, 29 (12) :2247-2253