Conformer: Local Features Coupling Global Representations for Visual Recognition

被引：492

作者：

Peng, Zhiliang ^{[1
]}

Huang, Wei ^{[1
]}

Gu, Shanzhi ^{[3
]}

Xie, Lingxi ^{[2
]}

Wang, Yaowei ^{[3
]}

Jiao, Jianbin ^{[1
]}

Ye, Qixiang ^{[1
,3
]}

机构：

[1] Univ Chinese Acad Sci, Beijing, Peoples R China

[2] Huawei Inc, Shenzhen, Peoples R China

[3] Peng Cheng Lab, Shenzhen, Peoples R China

来源：

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年

基金：

中国国家自然科学基金;

关键词：

SCALE;

D O I：

10.1109/ICCV48922.2021.00042

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network. Code is available at github.com/pengzhiliang/Conformer.

引用

页码：357 / 366

页数：10

共 58 条

[11]

Chen Hanting, ARXIV PREPRINT ARXIV

[12]

Chen Kai, 2019, arXiv preprint arXiv:1906.07155

[13]

Chen M, 2020, PR MACH LEARN RES, V119

[14] Randaugment: Practical automated data augmentation with a reduced search space [J].

Cubuk, Ekin D. ;

Zoph, Barret ;

Shlens, Jonathon ;

Le, Quoc, V .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, :3008-3017

[15] Deformable Convolutional Networks [J].

Dai, Jifeng ;

Qi, Haozhi ;

Xiong, Yuwen ;

Li, Yi ;

Zhang, Guodong ;

Hu, Han ;

Wei, Yichen .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773

[16]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[17]

Devlin J, 2018, ARXIV

[18]

Dosovitskiy A, 2021, ICLR 2021 9 INT C LE

[19]

Gao Wei, 2021, ARXIV PREPRINT ARXIV

[20]

He KM, 2020, IEEE T PATTERN ANAL, V42, P386, DOI [10.1109/TPAMI.2018.2844175, 10.1109/ICCV.2017.322]

← 1 2 3 4 5 6 →