A survey of the vision transformers and their CNN-transformer based variants

被引：0

作者：

Asifullah Khan

Zunaira Rauf

Anabia Sohail

Abdul Rehman Khan

Hifsa Asif

Aqsa Asif

Umair Farooq

机构：

[1] Pakistan Institute of Engineering & Applied Sciences,Pattern Recognition Lab, Department of Computer & Information Sciences

[2] PIEAS Artificial Intelligence Center (PAIC),Center for Mathematical Sciences

[3] Pakistan Institute of Engineering & Applied Sciences,Department of Electrical Engineering and Computer Science

[4] Pakistan Institute of Engineering & Applied Sciences,undefined

[5] Khalifa University of Science and Technology,undefined

[6] Air University,undefined

来源：

Artificial Intelligence Review | 2023年 / 56卷

关键词：

Auto encoder; Channel boosting; Computer vision; Convolutional neural networks; Deep learning; Hybrid vision transformers; Image processing; Self-attention; Transformer;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Vision transformers have become popular as a possible substitute to convolutional neural networks (CNNs) for a variety of computer vision applications. These transformers, with their ability to focus on global relationships in images, offer large learning capacity. However, they may suffer from limited generalization as they do not tend to model local correlation in images. Recently, in vision transformers hybridization of both the convolution operation and self-attention mechanism has emerged, to exploit both the local and global image representations. These hybrid vision transformers, also referred to as CNN-Transformer architectures, have demonstrated remarkable results in vision applications. Given the rapidly growing number of hybrid vision transformers, it has become necessary to provide a taxonomy and explanation of these hybrid architectures. This survey presents a taxonomy of the recent vision transformer architectures and more specifically that of the hybrid vision transformers. Additionally, the key features of these architectures such as the attention mechanisms, positional embeddings, multi-scale processing, and convolution are also discussed. In contrast to the previous survey papers that are primarily focused on individual vision transformer architectures or CNNs, this survey uniquely emphasizes the emerging trend of hybrid vision transformers. By showcasing the potential of hybrid vision transformers to deliver exceptional performance across a range of computer vision tasks, this survey sheds light on the future directions of this rapidly evolving architecture.

引用

页码：2917 / 2970

页数：53

共 506 条

[1]

Agbo-Ajala O(2021)Deep learning approach for facial age classification: a survey of the state-of-the-art Artif Intell Rev 54 179-213

[2]

Viriri S(2022)Transformers in remote sensing: a survey Remote Sensing 22 7024-188

[3]

Aleissaee AA(2023)Vision transformers in image restoration: a survey Sensors 152 2470-229

[4]

Kumar A(2022)HEA-Net: attention and MLP hybrid encoder architecture for medical image segmentation Sensors 10 178-457

[5]

Anwer RM(2023)Hybrid-scale contextual fusion network for medical image segmentation Comput Biol Med 2021 213-82

[6]

Ali AM(2021)CNN variants for computer vision: history, architecture, application, challenges and future scope Electron 12346 445-534

[7]

Benjdira B(2021)Transformer in computer vision IEEE Int Conf Comput Sci Electron Inf Eng Intell Control Technol CEI 130 71-9366

[8]

Koubaa A(2020)End-to-end object detection with transformers Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 91 371-3977

[9]

An L(2022)GasHis-transformer: a multi-scale visual transformer approach for gastric histopathological image detection Pattern Recognit 527 521-1148

[10]

Wang L(2023)Shape-former: bridging CNN and transformer via ShapeConv for multimodal image matching Inf Fusion 15 9355-1126

← 1 2 3 4 5 6 7 8 9 10 →