Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?

被引:64
作者
Moutik, Oumaima [1 ]
Sekkat, Hiba [1 ]
Tigani, Smail [1 ]
Chehri, Abdellah [2 ]
Saadane, Rachid [3 ]
Tchakoucht, Taha Ait [1 ]
Paul, Anand [4 ]
机构
[1] Euro Mediterranean Univ, Euromed Res Ctr, Engn Unit, Fes 30030, Morocco
[2] Royal Mil Coll Canada, Dept Math & Comp Sci, Kingston, ON K7K 7B4, Canada
[3] Hassania Sch Publ Works, SIRC LaGeS, Casablanca 8108, Morocco
[4] Kyungpook Natl Univ, Sch Comp Sci & Engn, Daegu 41566, South Korea
关键词
convolutional neural networks; vision transformers; recurrent neural networks; conversational systems; action recognition; natural language understanding; action recognitions; COMPUTER VISION; REPRESENTATION; ATTENTION;
D O I
10.3390/s23020734
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis's outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.
引用
收藏
页数:21
相关论文
共 118 条
[1]  
Abu-El-Haija S., 2016, arXiv
[2]   Review of deep learning: concepts, CNN architectures, challenges, applications, future directions [J].
Alzubaidi, Laith ;
Zhang, Jinglan ;
Humaidi, Amjad J. ;
Al-Dujaili, Ayad ;
Duan, Ye ;
Al-Shamma, Omran ;
Santamaria, J. ;
Fadhel, Mohammed A. ;
Al-Amidie, Muthana ;
Farhan, Laith .
JOURNAL OF BIG DATA, 2021, 8 (01)
[3]  
[Anonymous], 2013, arXiv
[4]  
[Anonymous], 1989, ADV NEURAL INFORM PR
[5]   Medical Image Analysis using Convolutional Neural Networks: A Review [J].
Anwar, Syed Muhammad ;
Majid, Muhammad ;
Qayyum, Adnan ;
Awais, Muhammad ;
Alnowami, Majdi ;
Khan, Muhammad Khurram .
JOURNAL OF MEDICAL SYSTEMS, 2018, 42 (11)
[6]   ViViT: A Video Vision Transformer [J].
Arnab, Anurag ;
Dehghani, Mostafa ;
Heigold, Georg ;
Sun, Chen ;
Lucic, Mario ;
Schmid, Cordelia .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :6816-6826
[7]   MEDICAL COMPUTER VISION, VIRTUAL-REALITY AND ROBOTICS [J].
AYACHE, N .
IMAGE AND VISION COMPUTING, 1995, 13 (04) :295-313
[8]   Fuzzy Integral-Based CNN Classifier Fusion for 3D Skeleton Action Recognition [J].
Banerjee, Avinandan ;
Singh, Pawan Kumar ;
Sarkar, Ram .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (06) :2206-2216
[9]  
Beltagy Iz, 2020, ARXIV
[10]  
Carion Nicolas, 2020, EUROPEAN C COMPUTER