Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

被引:45
作者
Hou, Qibin [1 ]
Lu, Cheng-Ze [1 ]
Cheng, Ming-Ming [1 ]
Feng, Jiashi [2 ]
机构
[1] Nankai Univ, Sch Comp Sci, Tianjin 300192, Peoples R China
[2] ByteDance, Singapore 048583, Singapore
关键词
Convolution; Transformers; Visualization; Kernel; Task analysis; Modulation; Convolutional codes; Convolutional neural networks; vision transformer; convolutional modulation; large-kernel convolution;
D O I
10.1109/TPAMI.2024.3401450
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision Transformers have been the most popular network architecture in visual recognition recently due to the strong ability of encode global information. However, its high computational cost when processing high-resolution images limits the applications in downstream tasks. In this paper, we take a deep look at the internal structure of self-attention and present a simple Transformer style convolutional neural network (ConvNet) for visual recognition. By comparing the design principles of the recent ConvNets and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation. We show that such a simple approach can better take advantage of the large kernels (>= 7 x 7) nested in convolutional layers and we observe a consistent performance improvement when gradually increasing the kernel size from 5 x 5 to 21 x 21. We build a family of hierarchical ConvNets using the proposed convolutional modulation, termed Conv2Former. Our network is simple and easy to follow. Experiments show that our Conv2Former outperforms existent popular ConvNets and vision Transformers, like Swin Transformer and ConvNeXt in all ImageNet classification, COCO object detection and ADE20 k semantic segmentation.
引用
收藏
页码:8274 / 8283
页数:10
相关论文
共 87 条
[21]  
Guo MH, 2022, Arxiv, DOI [arXiv:2202.09741, DOI 10.48550/ARXIV.2202.09741, 10.48550/arXiv.2202.09741]
[22]  
Han K, 2021, Arxiv, DOI arXiv:2103.00112
[23]  
Han Q, 2022, Arxiv, DOI [arXiv:2106.04263, DOI 10.48550/ARXIV.2106.04263]
[24]  
He KM, 2017, IEEE I CONF COMP VIS, P2980, DOI [10.1109/TPAMI.2018.2844175, 10.1109/ICCV.2017.322]
[25]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[26]   Bag of Tricks for Image Classification with Convolutional Neural Networks [J].
He, Tong ;
Zhang, Zhi ;
Zhang, Hang ;
Zhang, Zhongyue ;
Xie, Junyuan ;
Li, Mu .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :558-567
[27]  
Hendrycks D, 2020, Arxiv, DOI arXiv:1606.08415
[28]  
Heo B., 2021, arXiv
[29]   Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition [J].
Hou, Qibin ;
Jiang, Zihang ;
Yuan, Li ;
Cheng, Ming-Ming ;
Yan, Shuicheng ;
Feng, Jiashi .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) :1328-1334
[30]   Coordinate Attention for Efficient Mobile Network Design [J].
Hou, Qibin ;
Zhou, Daquan ;
Feng, Jiashi .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :13708-13717