ViTs as backbones: Leveraging vision transformers for feature extraction

被引:1
作者
Elharrouss, Omar [1 ]
Himeur, Yassine [2 ]
Mahmood, Yasir [1 ]
Alrabaee, Saed [1 ]
Ouamane, Abdelmalik [3 ]
Bensaali, Faycal [4 ]
Bechqito, Yassine [1 ]
Chouchane, Ammar [3 ,5 ]
机构
[1] Coll Informat Technol, Dept Comp Sci & Software Engn, Al Ain, U Arab Emirates
[2] Univ Dubai, Coll Engn & Informat Technol, Dubai, U Arab Emirates
[3] Univ Biskra, Dept Elect Engn, LI3C Lab, Biskra, Algeria
[4] Qatar Univ, Dept Elect Engn, Doha, Qatar
[5] Univ Ctr Barika, Amdoukal Rd, Barika 05001, Algeria
关键词
Vision transformers; Transformers; Deep learning; Computer vision; Attention; ATTENTION; EFFICIENT; NETWORKS; IMAGES;
D O I
10.1016/j.inffus.2025.102951
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The emergence of Vision Transformers (ViTs) has marked a significant shift in the field of computer vision, presenting new methodologies that challenge traditional convolutional neural networks (CNNs). This review offers a thorough exploration of ViTs, unpacking their foundational principles, including the self-attention mechanism and multi-head attention, while examining their diverse applications. We delve into the core mechanics of ViTs, such as image patching, positional encoding, and the datasets that underpin their training. By categorizing and comparing ViTs, CNNs, and hybrid models, we shed light on their respective strengths and limitations, offering a nuanced perspective on their roles in advancing computer vision. A critical evaluation of notable ViT architectures-including DeiT, DeepViT, and Swin-Transformer-highlights their efficacy in feature extraction and domain-specific tasks. The review extends its scope to illustrate the versatility of ViTs in applications like image classification, medical imaging, object detection, and visual question answering, supported by case studies on benchmark datasets such as ImageNet and COCO. While ViTs demonstrate remarkable potential, they are not without challenges, including high computational demands, extensive data requirements, and generalization difficulties. To address these limitations, we propose future research directions aimed at improving scalability, efficiency, and adaptability, especially in resource-constrained settings. By providing a comprehensive overview and actionable insights, this review serves as an essential guide for researchers and practitioners navigating the evolving field of vision-based deep learning.
引用
收藏
页数:48
相关论文
共 389 条
[1]   Study of the e+e- → π+ π- ω process at center-of-mass energies between 4.0 and 4.6 GeV [J].
Ablikim, M. ;
Achasov, M. N. ;
Adlarson, P. ;
Albrecht, M. ;
Aliberti, R. ;
Amoroso, A. ;
An, M. R. ;
An, Q. ;
Bai, Y. ;
Bakina, O. ;
Baldini Ferroli, R. ;
Balossino, I. ;
Ban, Y. ;
Batozskaya, V. ;
Becker, D. ;
Begzsuren, K. ;
Berger, N. ;
Bertani, M. ;
Bettoni, D. ;
Bianchi, F. ;
Bianco, E. ;
Bloms, J. ;
Bortone, A. ;
Boyko, I. ;
Briere, R. A. ;
Brueggemann, A. ;
Cai, H. ;
Cai, X. ;
Calcaterra, A. ;
Cao, G. F. ;
Cao, N. ;
Cetin, S. A. ;
Chang, J. F. ;
Chang, W. L. ;
Che, G. R. ;
Chelkov, G. ;
Chen, C. ;
Chen, Chao ;
Chen, G. ;
Chen, H. S. ;
Chen, M. L. ;
Chen, S. J. ;
Chen, S. M. ;
Chen, T. ;
Chen, X. R. ;
Chen, X. T. ;
Chen, Y. B. ;
Chen, Z. J. ;
Cheng, W. S. ;
Choi, S. K. .
JOURNAL OF HIGH ENERGY PHYSICS, 2023, (08)
[2]  
Ahmed Tasnim, 2024, IEEE Transactions on Artificial Intelligence, V5, P4972, DOI 10.1109/TAI.2024.3394797
[3]  
Akkaya IB, 2023, Arxiv, DOI arXiv:2305.08551
[4]  
Alamri F, 2021, Arxiv, DOI arXiv:2108.00045
[5]  
Ali A., 2021, Adv Neural Inf Process Syst, P20014
[6]   Vision Transformers in Image Restoration: A Survey [J].
Ali, Anas M. ;
Benjdira, Bilel ;
Koubaa, Anis ;
El-Shafai, Walid ;
Khan, Zahid ;
Boulila, Wadii .
SENSORS, 2023, 23 (05)
[7]   Improving satellite image classification accuracy using GAN-based data augmentation and vision transformers [J].
Alzahem, Ayyub ;
Boulila, Wadii ;
Koubaa, Anis ;
Khan, Zahid ;
Alturki, Ibrahim .
EARTH SCIENCE INFORMATICS, 2023, 16 (04) :4169-4186
[8]   DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale [J].
Aminabadi, Reza Yazdani ;
Rajbhandari, Samyam ;
Awan, Ammar Ahmad ;
Li, Cheng ;
Li, Du ;
Zheng, Elton ;
Ruwase, Olatunji ;
Smith, Shaden ;
Zhang, Minjia ;
Rasley, Jeff ;
He, Yuxiong .
SC22: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2022,
[9]   Vision Transformers for Remote Sensing Image Classification [J].
Bazi, Yakoub ;
Bashmal, Laila ;
Rahhal, Mohamad M. Al ;
Dayil, Reham Al ;
Ajlan, Naif Al .
REMOTE SENSING, 2021, 13 (03) :1-20
[10]  
Beal J., 2020, arXiv, DOI DOI 10.48550/ARXIV.2012.09958