Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

被引:64
作者
Moutik, Oumaima [1 ]
Sekkat, Hiba [1 ]
Tigani, Smail [1 ]
Chehri, Abdellah [2 ]
Saadane, Rachid [3 ]
Tchakoucht, Taha Ait [1 ]
Paul, Anand [4 ]
机构
[1] Euro Mediterranean Univ, Euromed Res Ctr, Engn Unit, Fes 30030, Morocco
[2] Royal Mil Coll Canada, Dept Math & Comp Sci, Kingston, ON K7K 7B4, Canada
[3] Hassania Sch Publ Works, SIRC LaGeS, Casablanca 8108, Morocco
[4] Kyungpook Natl Univ, Sch Comp Sci & Engn, Daegu 41566, South Korea
关键词
convolutional neural networks; vision transformers; recurrent neural networks; conversational systems; action recognition; natural language understanding; action recognitions; COMPUTER VISION; REPRESENTATION; ATTENTION;
D O I
10.3390/s23020734
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis's outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.
引用
收藏
页数:21
相关论文
共 118 条
[41]  
Howard A. G., 2017, ARXIV
[42]  
Huang TS, 1996, CERN REPORT, V96, P21
[43]   RECEPTIVE FIELDS, BINOCULAR INTERACTION AND FUNCTIONAL ARCHITECTURE IN CATS VISUAL CORTEX [J].
HUBEL, DH ;
WIESEL, TN .
JOURNAL OF PHYSIOLOGY-LONDON, 1962, 160 (01) :106-&
[44]   Evaluating fusion of RGB-D and inertial sensors for multimodal human action recognition [J].
Imran, Javed ;
Raman, Balasubramanian .
JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2020, 11 (01) :189-208
[45]  
Kalfaoglu M., 2020, arXiv
[46]   Large-scale Video Classification with Convolutional Neural Networks [J].
Karpathy, Andrej ;
Toderici, George ;
Shetty, Sanketh ;
Leung, Thomas ;
Sukthankar, Rahul ;
Fei-Fei, Li .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :1725-1732
[47]  
Kay W., 2017, arXiv
[48]   Transformers in Vision: A Survey [J].
Khan, Salman ;
Naseer, Muzammal ;
Hayat, Munawar ;
Zamir, Syed Waqas ;
Khan, Fahad Shahbaz ;
Shah, Mubarak .
ACM COMPUTING SURVEYS, 2022, 54 (10S)
[49]  
Koot R, 2021, Arxiv, DOI arXiv:2111.09641
[50]  
Kuehne H, 2011, IEEE I CONF COMP VIS, P2556, DOI 10.1109/ICCV.2011.6126543