Conformer: Local Features Coupling Global Representations for Visual Recognition

被引:492
作者
Peng, Zhiliang [1 ]
Huang, Wei [1 ]
Gu, Shanzhi [3 ]
Xie, Lingxi [2 ]
Wang, Yaowei [3 ]
Jiao, Jianbin [1 ]
Ye, Qixiang [1 ,3 ]
机构
[1] Univ Chinese Acad Sci, Beijing, Peoples R China
[2] Huawei Inc, Shenzhen, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
来源
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021) | 2021年
基金
中国国家自然科学基金;
关键词
SCALE;
D O I
10.1109/ICCV48922.2021.00042
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Within Convolutional Neural Network (CNN), the convolution operations are good at extracting local features but experience difficulty to capture global representations. Within visual transformer, the cascaded self-attention modules can capture long-distance feature dependencies but unfortunately deteriorate local feature details. In this paper, we propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Conformer roots in the Feature Coupling Unit (FCU), which fuses local features and global representations under different resolutions in an interactive fashion. Conformer adopts a concurrent structure so that local features and global representations are retained to the maximum extent. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet. On MSCOCO, it outperforms ResNet-101 by 3.7% and 3.6% mAPs for object detection and instance segmentation, respectively, demonstrating the great potential to be a general backbone network. Code is available at github.com/pengzhiliang/Conformer.
引用
收藏
页码:357 / 366
页数:10
相关论文
共 58 条
[1]  
Abnar S, 2020, 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), P4190
[2]  
[Anonymous], 2005, Computer Vision and Pattern Recognition-Workshops
[3]  
Ba J., 2016, ARXIV160706450, V1050, P21
[4]  
Beal Josh, ARXIV PREPRINT ARXIV
[5]   Attention Augmented Convolutional Networks [J].
Bello, Irwan ;
Zoph, Barret ;
Vaswani, Ashish ;
Shlens, Jonathon ;
Le, Quoc V. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294
[6]  
Belongie, 2011, CNS T 2011 001
[7]  
Brown Tom B, 2020, P ADV NEUR INF PROC
[8]   A non-local algorithm for image denoising [J].
Buades, A ;
Coll, B ;
Morel, JM .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2005, :60-65
[9]   GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond [J].
Cao, Yue ;
Xu, Jiarui ;
Lin, Stephen ;
Wei, Fangyun ;
Hu, Han .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW), 2019, :1971-1980
[10]  
Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13