A survey of the vision transformers and their CNN-transformer based variants

被引：0

作者：

Asifullah Khan

Zunaira Rauf

Anabia Sohail

Abdul Rehman Khan

Hifsa Asif

Aqsa Asif

Umair Farooq

机构：

[1] Pakistan Institute of Engineering & Applied Sciences,Pattern Recognition Lab, Department of Computer & Information Sciences

[2] PIEAS Artificial Intelligence Center (PAIC),Center for Mathematical Sciences

[3] Pakistan Institute of Engineering & Applied Sciences,Department of Electrical Engineering and Computer Science

[4] Pakistan Institute of Engineering & Applied Sciences,undefined

[5] Khalifa University of Science and Technology,undefined

[6] Air University,undefined

来源：

Artificial Intelligence Review | 2023年 / 56卷

关键词：

Auto encoder; Channel boosting; Computer vision; Convolutional neural networks; Deep learning; Hybrid vision transformers; Image processing; Self-attention; Transformer;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Vision transformers have become popular as a possible substitute to convolutional neural networks (CNNs) for a variety of computer vision applications. These transformers, with their ability to focus on global relationships in images, offer large learning capacity. However, they may suffer from limited generalization as they do not tend to model local correlation in images. Recently, in vision transformers hybridization of both the convolution operation and self-attention mechanism has emerged, to exploit both the local and global image representations. These hybrid vision transformers, also referred to as CNN-Transformer architectures, have demonstrated remarkable results in vision applications. Given the rapidly growing number of hybrid vision transformers, it has become necessary to provide a taxonomy and explanation of these hybrid architectures. This survey presents a taxonomy of the recent vision transformer architectures and more specifically that of the hybrid vision transformers. Additionally, the key features of these architectures such as the attention mechanisms, positional embeddings, multi-scale processing, and convolution are also discussed. In contrast to the previous survey papers that are primarily focused on individual vision transformer architectures or CNNs, this survey uniquely emphasizes the emerging trend of hybrid vision transformers. By showcasing the potential of hybrid vision transformers to deliver exceptional performance across a range of computer vision tasks, this survey sheds light on the future directions of this rapidly evolving architecture.

引用

页码：2917 / 2970

页数：53

共 506 条

[31]

Ding J(2023)DeepFake detection algorithm based on improved vision transformer Appl Intell 120 781-20

[32]

Yu Y(2021)Facial expression recognition with grid-wise attention and visual transformer Inf Sci (NY) 20 361-57

[33]

Gong W(2023)FPDT: a multi-scale feature pyramidal object detection transformer J Appl Remote Sensing 150 230-10973

[34]

Chen J(2023)FRE-Net: full-region enhanced network for nuclei segmentation in histopathology images Biocybern Biomed Eng 18 2754-10893

[35]

Hong H(2023)WetMapFormer: a unified deep CNN and vision transformer for complex wetland mapping Int J Appl Earth Obs Geoinf 89 1089-42

[36]

Song B(2023)Masked vision-language transformer in fashion Mach Intell Res 2 5288-208

[37]

Chen J(2022)TransCUNet: UNet cross fused transformer for medical image segmentation Comput Biol Med 85 1211-18209

[38]

Zhang Y(2021)TransGAN: two pure transformers can make one strong GAN, and that can scale up Adv Neural Inf Process Syst 53 1236-712

[39]

Pan Y(2022)The encoding method of position embeddings in vision transformer J vis Commun Image Represent 37 3-5673

[40]

Chu X(2022)Face mask recognition system using CNN model Neurosci Inform 141 5521-479

← 1 2 3 4 5 6 7 8 9 10 →