A Survey of Visual Transformers

被引:220
作者
Liu, Yang [1 ,2 ]
Zhang, Yao [1 ,2 ]
Wang, Yixin [3 ]
Hou, Feng [1 ,2 ]
Yuan, Jin [4 ]
Tian, Jiang [5 ]
Zhang, Yang [5 ]
Shi, Zhongchao [5 ]
Fan, Jianping [5 ]
He, Zhiqiang [1 ,6 ,7 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing 100000, Peoples R China
[2] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 100000, Peoples R China
[3] Stanford Univ, Sch Engn, Palo Alto, CA 94305 USA
[4] Southeast Univ, Sch Comp Sci & Engn, Nanjing 214135, Peoples R China
[5] AI Lab, Lenovo Res, Beijing 100000, Peoples R China
[6] Univ Chinese Acad Sci, Beijing 100000, Peoples R China
[7] Lenovo Ltd, Beijing 100000, Peoples R China
关键词
Classification; computer vision (CV); detection; point clouds; segmentation; self-supervision; visual-linguistic pretraining; visual Transformer; BOTTOM-UP; TOP-DOWN; DEEP; ATTENTION;
D O I
10.1109/TNNLS.2022.3227717
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing Transformer-liked architectures in the computer vision (CV) field, which have demonstrated their effectiveness on three fundamental CV tasks (classification, detection, and segmentation) as well as multiple sensory data stream (images, point clouds, and vision-language data). Because of their competitive modeling capabilities, the visual Transformers have achieved impressive performance improvements over multiple benchmarks as compared with modern Convolution Neural Networks (CNNs). In this survey, we have reviewed over one hundred of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, where a taxonomy is proposed to organize the representative methods according to their motivations, structures, and application scenarios. Because of their differences on training settings and dedicated vision tasks, we have also evaluated and compared all these existing visual Transformers under different configurations. Furthermore, we have revealed a series of essential but unexploited aspects that may empower such visual Transformers to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between the visual Transformers and the sequential ones. Finally, three promising research directions are suggested for future investment. We will continue to update the latest articles and their released source codes at https://github.com/liuyang-ict/awesome-visual-transformers.
引用
收藏
页码:7478 / 7498
页数:21
相关论文
共 241 条
[1]  
Abnar M., 2020, ARXIV
[2]  
Abnar S., 2020, ARXIV200600555
[3]   ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes [J].
Achlioptas, Panos ;
Abdelreheem, Ahmed ;
Xia, Fei ;
Elhoseiny, Mohamed ;
Guibas, Leonidas .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :422-440
[4]   Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering [J].
Anderson, Peter ;
He, Xiaodong ;
Buehler, Chris ;
Teney, Damien ;
Johnson, Mark ;
Gould, Stephen ;
Zhang, Lei .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :6077-6086
[5]  
[Anonymous], 2017, INT J COMPUT VISION, V123, P32
[6]  
[Anonymous], Robot Learning, DOI [10.48550/arXiv.2110.06922, DOI 10.48550/ARXIV.2110.06922]
[7]  
[Anonymous], 2020, P IEEE CVF C COMP VI, DOI DOI 10.1109/ICCWAMTIP51612.2020.9317476
[8]  
[Anonymous], 2021, P IEEE CVF C COMP VI, DOI DOI 10.1109/TSMC.2019.2958072
[9]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[10]  
Ba J.L., 2016, CORR