PVT v2: Improved baselines with Pyramid Vision Transformer

被引：1175

作者：

Wang, Wenhai ^{[1
,2
]}

Xie, Enze ^{[3
]}

Li, Xiang ^{[4
]}

Fan, Deng-Ping ^{[5
]}

Song, Kaitao ^{[4
]}

Liang, Ding ^{[6
]}

Lu, Tong ^{[2
]}

Luo, Ping ^{[3
]}

Shao, Ling ^{[7
]}

机构：

[1] Shanghai AI Lab, Shanghai 200232, Peoples R China

[2] Nanjing Univ, Dept Comp Sci & Technol, Nanjing 210023, Peoples R China

[3] Univ Hong Kong, Dept Comp Sci, Hong Kong 999077, Peoples R China

[4] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210014, Peoples R China

[5] Swiss Fed Inst Technol, Comp Vis Lab, CH-8092 Zurich, Switzerland

[6] SenseTime, Beijing 100080, Peoples R China

[7] Incept Inst Artificial Intelligence, Abu Dhabi, U Arab Emirates

来源：

COMPUTATIONAL VISUAL MEDIA | 2022年 / 8卷 / 03期

基金：

中国国家自然科学基金;

关键词：

transformers; dense prediction; image classification; object detection; semantic segmentation;

D O I：

10.1007/s41095-022-0274-8

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, detection, and segmentation. In particular, PVT v2 achieves comparable or better performance than recent work such as the Swin transformer. We hope this work will facilitate state-of-the-art transformer research in computer vision. Code is available at https://github.com/whai362/PVT.

引用

页码：415 / 424

页数：10

共 41 条

[1] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].

Chen, Chun-Fu ;

Fan, Quanfu ;

Panda, Rameswar .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356

[2]

Chen K., 2019, CoRR abs/1906.07155

[3] DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs [J].

Chen, Liang-Chieh ;

Papandreou, George ;

Kokkinos, Iasonas ;

Murphy, Kevin ;

Yuille, Alan L. .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) :834-848

[4]

Chu XX, 2021, ADV NEUR IN

[5]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[6]

Dong B., 2021, ARXIV PREPRINT ARXIV

[7]

Dosovitskiy A, 2021, ICLR

[8]

Glorot X., 2010, P 13 INT C ART INT S, P249

[9] LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference [J].

Graham, Ben ;

El-Nouby, Alaaeldin ;

Touvron, Hugo ;

Stock, Pierre ;

Joulin, Armand ;

Jegou, Herve ;

Douze, Matthijs .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :12239-12249

[10]

Han K., 2021, P NIPS 21 P 35 INT C

← 1 2 3 4 5 →