Conformer: Local Features Coupling Global Representations for Visual Recognition

被引：492

作者：

Peng, Zhiliang ^{[1
]}

Huang, Wei ^{[1
]}

Gu, Shanzhi ^{[3
]}

Xie, Lingxi ^{[2
]}

Wang, Yaowei ^{[3
]}

Jiao, Jianbin ^{[1
]}

Ye, Qixiang ^{[1
,3
]}

机构：

[1] Univ Chinese Acad Sci, Beijing, Peoples R China

[2] Huawei Inc, Shenzhen, Peoples R China

[3] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

基金：

中国国家自然科学基金;

关键词：

SCALE;

D O I：

10.1109/ICCV48922.2021.00042

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network. Code is available at github.com/pengzhiliang/Conformer.

引用

页码：357 / 366

页数：10

共 58 条

[1]

Abnar S, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4190

[2]

[Anonymous], 2005, Computer Vision and Pattern Recognition-Workshops

[3]

Ba J., 2016, ARXIV160706450, V1050, P21

[4]

Beal Josh, ARXIV PREPRINT ARXIV

[5] Attention Augmented Convolutional Networks [J].

Bello, Irwan ;

Zoph, Barret ;

Vaswani, Ashish ;

Shlens, Jonathon ;

Le, Quoc V. .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294

[6]

Belongie, 2011, CNS T 2011 001

[7]

Brown Tom B, 2020, P ADV NEUR INF PROC

[8] A non-local algorithm for image denoising [J].

Buades, A ;

Coll, B ;

Morel, JM .

2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2005, :60-65

[9] GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond [J].

Cao, Yue ;

Xu, Jiarui ;

Lin, Stephen ;

Wei, Fangyun ;

Hu, Han .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :1971-1980

[10]

Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13

← 1 2 3 4 5 6 →