Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition

被引：45

作者：

Hou, Qibin ^{[1
]}

Lu, Cheng-Ze ^{[1
]}

Cheng, Ming-Ming ^{[1
]}

Feng, Jiashi ^{[2
]}

机构：

[1] Nankai Univ, Sch Comp Sci, Tianjin 300192, Peoples R China

[2] ByteDance, Singapore 048583, Singapore

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2024年 / 46卷 / 12期

关键词：

Convolution; Transformers; Visualization; Kernel; Task analysis; Modulation; Convolutional codes; Convolutional neural networks; vision transformer; convolutional modulation; large-kernel convolution;

D O I：

10.1109/TPAMI.2024.3401450

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision Transformers have been the most popular network architecture in visual recognition recently due to the strong ability of encode global information. However, its high computational cost when processing high-resolution images limits the applications in downstream tasks. In this paper, we take a deep look at the internal structure of self-attention and present a simple Transformer style convolutional neural network (ConvNet) for visual recognition. By comparing the design principles of the recent ConvNets and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation. We show that such a simple approach can better take advantage of the large kernels (>= 7 x 7) nested in convolutional layers and we observe a consistent performance improvement when gradually increasing the kernel size from 5 x 5 to 21 x 21. We build a family of hierarchical ConvNets using the proposed convolutional modulation, termed Conv2Former. Our network is simple and easy to follow. Experiments show that our Conv2Former outperforms existent popular ConvNets and vision Transformers, like Swin Transformer and ConvNeXt in all ImageNet classification, COCO object detection and ADE20 k semantic segmentation.

引用

页码：8274 / 8283

页数：10

共 87 条

[1]

[Anonymous], 2017, P ADV NEUR INF PROC

[2]

Ba JL, 2016, arXiv

[3] STORM-GAN: Spatio-Temporal Meta-GAN for Cross-City Estimation of Human Mobility Responses to COVID- [J].

Bao, Han ;

Zhou, Xun ;

Xie, Yiqun ;

Li, Yanhua ;

Jia, Xiaowei .

2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, :1-10

[4]

Brock A, 2021, Arxiv, DOI [arXiv:2102.06171, 10.48550/arXiv.2102.06171]

[5]

Cai H, 2024, Arxiv, DOI [arXiv:2205.14756, DOI 10.48550/ARXIV.2205.14756]

[6] Cascade R-CNN: High Quality Object Detection and Instance Segmentation [J].

Cai, Zhaowei ;

Vasconcelos, Nuno .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (05) :1483-1498

[7] Global Context Networks [J].

Cao, Yue ;

Xu, Jiarui ;

Lin, Stephen ;

Wei, Fangyun ;

Hu, Han .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) :6881-6895

[8]

Chen Chun-Fu, 2021, PREPRINT

[9]

Chen K, 2019, Arxiv, DOI arXiv:1906.07155

[10] Mobile-Former: Bridging MobileNet and Transformer [J].

Chen, Yinpeng ;

Dai, Xiyang ;

Chen, Dongdong ;

Liu, Mengchen ;

Dong, Xiaoyi ;

Yuan, Lu ;

Liu, Zicheng .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :5260-5269

← 1 2 3 4 5 6 7 8 9 →