EMO-MoviNet: Enhancing Action Recognition in Videos with EvoNorm, Mish Activation, and Optimal Frame Selection for Efficient Mobile Deployment

被引:1
作者
Hussain, Tarique [1 ]
Memon, Zulfiqar Ali [1 ]
Qureshi, Rizwan [1 ]
Alam, Tanvir [2 ]
机构
[1] Natl Univ Comp & Emerging Sci, Fast Sch Comp, Karachi Campus, Karachi 75030, Pakistan
[2] Hamad Bin Khalifa Univ, Coll Sci & Engn, Doha, Qatar
关键词
mobile networks; video classification; action recognition; deep learning; NETWORK;
D O I
10.3390/s23198106
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
The primary goal of this study is to develop a deep neural network for action recognition that enhances accuracy and minimizes computational costs. In this regard, we propose a modified EMO-MoviNet-A2* architecture that integrates Evolving Normalization (EvoNorm), Mish activation, and optimal frame selection to improve the accuracy and efficiency of action recognition tasks in videos. The asterisk notation indicates that this model also incorporates the stream buffer concept. The Mobile Video Network (MoviNet) is a member of the memory-efficient architectures discovered through Neural Architecture Search (NAS), which balances accuracy and efficiency by integrating spatial, temporal, and spatio-temporal operations. Our research implements the MoviNet model on the UCF101 and HMDB51 datasets, pre-trained on the kinetics dataset. Upon implementation on the UCF101 dataset, a generalization gap was observed, with the model performing better on the training set than on the testing set. To address this issue, we replaced batch normalization with EvoNorm, which unifies normalization and activation functions. Another area that required improvement was key-frame selection. We also developed a novel technique called Optimal Frame Selection (OFS) to identify key-frames within videos more effectively than random or densely frame selection methods. Combining OFS with Mish nonlinearity resulted in a 0.8-1% improvement in accuracy in our UCF101 20-classes experiment. The EMO-MoviNet-A2* model consumes 86% fewer FLOPs and approximately 90% fewer parameters on the UCF101 dataset, with a trade-off of 1-2% accuracy. Additionally, it achieves 5-7% higher accuracy on the HMDB51 dataset while requiring seven times fewer FLOPs and ten times fewer parameters compared to the reference model, Motion-Augmented RGB Stream (MARS).
引用
收藏
页数:18
相关论文
共 60 条
[1]   Human Action Recognition in Videos Using Kinematic Features and Multiple Instance Learning [J].
Ali, Saad ;
Shah, Mubarak .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2010, 32 (02) :288-303
[2]   Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications [J].
Brattoli, Biagio ;
Tighe, Joseph ;
Zhdanov, Fedor ;
Perona, Pietro ;
Chalupka, Krzysztof .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :4612-4622
[3]   Automatic video classification: A survey of the literature [J].
Brezeale, Darin ;
Cook, Diane J. .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2008, 38 (03) :416-430
[4]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[5]   Extremely Lightweight Skeleton-Based Action Recognition With ShiftGCN plus [J].
Cheng, Ke ;
Zhang, Yifan ;
He, Xiangyu ;
Cheng, Jian ;
Lu, Hanqing .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 :7333-7348
[6]   MARS: Motion-Augmented RGB Stream for Action Recognition [J].
Crasto, Nieves ;
Weinzaepfel, Philippe ;
Alahari, Karteek ;
Schmid, Cordelia .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7874-7883
[7]   X3D: Expanding Architectures for Efficient Video Recognition [J].
Feichtenhofer, Christoph .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :200-210
[8]  
Howard AG, 2017, Arxiv, DOI arXiv:1704.04861
[9]   A survey on instance segmentation: state of the art [J].
Hafiz, Abdul Mueed ;
Bhat, Ghulam Mohiuddin .
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2020, 9 (03) :171-189
[10]   Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? [J].
Hara, Kensho ;
Kataoka, Hirokatsu ;
Satoh, Yutaka .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6546-6555