Conformer: Local Features Coupling Global Representations for Visual Recognition

被引:492
作者
Peng, Zhiliang [1 ]
Huang, Wei [1 ]
Gu, Shanzhi [3 ]
Xie, Lingxi [2 ]
Wang, Yaowei [3 ]
Jiao, Jianbin [1 ]
Ye, Qixiang [1 ,3 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Huawei Inc, Shenzhen, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
来源
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年
基金
中国国家自然科学基金;
关键词
SCALE;
D O I
10.1109/ICCV48922.2021.00042
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network. Code is available at github.com/pengzhiliang/Conformer.
引用
收藏
页码:357 / 366
页数:10
相关论文
共 58 条
[11]  
Chen Hanting, ARXIV PREPRINT ARXIV
[12]  
Chen Kai, 2019, arXiv preprint arXiv:1906.07155
[13]  
Chen M, 2020, PR MACH LEARN RES, V119
[14]   Randaugment: Practical automated data augmentation with a reduced search space [J].
Cubuk, Ekin D. ;
Zoph, Barret ;
Shlens, Jonathon ;
Le, Quoc, V .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, :3008-3017
[15]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[16]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[17]  
Devlin J, 2018, ARXIV
[18]  
Dosovitskiy A, 2021, ICLR 2021 9 INT C LE
[19]  
Gao Wei, 2021, ARXIV PREPRINT ARXIV
[20]  
He KM, 2020, IEEE T PATTERN ANAL, V42, P386, DOI [10.1109/TPAMI.2018.2844175, 10.1109/ICCV.2017.322]