A Hybrid Transformer Framework for Efficient Activity Recognition Using Consumer Electronics

被引:9
作者
Hussain, Altaf [1 ]
Khan, Samee Ullah [2 ]
Khan, Noman [1 ]
Bhatt, Mohammed Wasim [3 ]
Farouk, Ahmed [4 ]
Bhola, Jyoti [5 ]
Baik, Sung Wook [1 ]
机构
[1] Sejong Univ, Seoul 143747, South Korea
[2] Kyungpook Natl Univ, Sch Elect Engn, Daegu 41566, South Korea
[3] Model Inst Engn & Technol, Dept Comp Sci & Engn, Jammu 181122, India
[4] South Valley Univ, Fac Comp & Artificial Intelligence, Dept Comp Sci, Hurghada 83523, Egypt
[5] Chitkara Univ, Inst Engn & Technol, Rajpura 140401, India
基金
新加坡国家研究基金会;
关键词
Feature extraction; Consumer electronics; Human activity recognition; Computational modeling; Transformers; Visualization; Computer architecture; Human action recognition; wireless visual sensor networks; consumer electronics; video classification; surveillance system; transformer network; NEURAL-NETWORKS; FEATURES; LSTM;
D O I
10.1109/TCE.2024.3373824
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In the field of research on wireless visual sensor networks, human activity recognition (HAR) using consumer electronics is now an emerging research area in both the academic and industrial sectors, with a diverse range of applications. However, the implementation of HAR through computer vision methods is highly challenging on consumer electronic devices, due to their limited computational capabilities. This means that mainstream approaches in which computationally complex contextual networks and variants of recurrent neural networks are used to learn long-range spatiotemporal dependencies have achieved limited performance. To address these challenges, this paper presents an efficient framework for robust HAR for consumer electronics devices, which is divided into two main stages. In the first stage, convolutional features from the multiply_17 layer of a lightweight MobileNetV3 are employed to balance the computational complexity and extract the most salient contextual features ( $7\times 7\times 576\times 30$ ) from each video. In the second stage, a sequential residual transformer network (SRTN) is designed in a residual fashion to effectively learn the long-range temporal dependencies across multiple video frames. The temporal multi-head self-attention module and residual strategy of the SRTN enable the proposed method to discard non-relevant features and to optimise the spatiotemporal feature vector for efficient HAR. The performance of the proposed model is evaluated on three challenging HAR datasets, and is found to yield high levels of accuracy of 76.1428%, 96.6399%, and 97.3130% on the HMDB51, UCF101, and UCF50 datasets, respectively, outperforming a state-of-the-art method for HAR.
引用
收藏
页码:6800 / 6807
页数:8
相关论文
共 35 条
[1]   Human Activity Recognition Based on Deep-Temporal Learning Using Convolution Neural Networks Features and Bidirectional Gated Recurrent Unit With Features Selection [J].
Ahmad, Tariq ;
Wu, Jinsong ;
Alwageed, Hathal Salamah ;
Khan, Faheem ;
Khan, Jawad ;
Lee, Youngmoon .
IEEE ACCESS, 2023, 11 :33148-33159
[2]   Searching for MobileNetV3 [J].
Howard, Andrew ;
Sandler, Mark ;
Chu, Grace ;
Chen, Liang-Chieh ;
Chen, Bo ;
Tan, Mingxing ;
Wang, Weijun ;
Zhu, Yukun ;
Pang, Ruoming ;
Vasudevan, Vijay ;
Le, Quoc V. ;
Adam, Hartwig .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :1314-1324
[3]   Overview of behavior recognition based on deep learning [J].
Hu, Kai ;
Jin, Junlan ;
Zheng, Fei ;
Weng, Liguo ;
Ding, Yiwu .
ARTIFICIAL INTELLIGENCE REVIEW, 2023, 56 (03) :1833-1865
[4]   A review of video action recognition based on 3D convolution [J].
Huang, Xiankai ;
Cai, Zhibin .
COMPUTERS & ELECTRICAL ENGINEERING, 2023, 108
[5]  
Hussain A., 2022, Comput. Intell. Neurosci., V13
[6]   Human centric attention with deep multiscale feature fusion framework for activity recognition in Internet of Medical Things [J].
Hussain, Altaf ;
Khan, Samee Ullah ;
Rida, Imad ;
Khan, Noman ;
Baik, Sung Wook .
INFORMATION FUSION, 2024, 106
[7]   Low-light aware framework for human activity recognition via optimized dual stream parallel network [J].
Hussain, Altaf ;
Khan, Samee Ullah ;
Khan, Noman ;
Rida, Imad ;
Alharbi, Meshal ;
Baik, Sung Wook .
ALEXANDRIA ENGINEERING JOURNAL, 2023, 74 :569-583
[8]   Enhanced Spatial Stream of Two-Stream Network Using Optical Flow for Human Action Recognition [J].
Khan, Shahbaz ;
Hassan, Ali ;
Hussain, Farhan ;
Perwaiz, Aqib ;
Riaz, Farhan ;
Alsabaan, Maazen ;
Abdul, Wadood .
APPLIED SCIENCES-BASEL, 2023, 13 (14)
[9]   MoViNets: Mobile Video Networks for Efficient Video Recognition [J].
Kondratyuk, Dan ;
Yuan, Liangzhe ;
Li, Yandong ;
Zhang, Li ;
Tan, Mingxing ;
Brown, Matthew ;
Gong, Boqing .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :16015-16025
[10]  
Kuehne H, 2011, IEEE I CONF COMP VIS, P2556, DOI 10.1109/ICCV.2011.6126543