CycleMLP: A MLP-Like Architecture for Dense Visual Predictions

被引：22

作者：

Chen, Shoufa ^{[1
]}

Xie, Enze ^{[1
]}

Ge, Chongjian ^{[1
]}

Chen, Runjian ^{[1
]}

Liang, Ding ^{[2
]}

Luo, Ping ^{[1
,3
]}

机构：

[1] Univ Hong Kong, Dept Comp Sci, Hong Kong, Peoples R China

[2] SenseTime Res, Beijing 100195, Peoples R China

[3] Shanghai AI Lab, Shanghai 200232, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2023年 / 45卷 / 12期

基金：

国家重点研发计划;

关键词：

Multilayer perceptron; MLP; dense visual prediction; deep neural network;

D O I：

10.1109/TPAMI.2023.3303397

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This article presents a simple yet effective multilayer perceptron (MLP) architecture, namely CycleMLP, which is a versatile neural backbone network capable of solving various tasks of dense visual predictions such as object detection, segmentation, and human pose estimation. Compared to recent advanced MLP architectures such as MLP-Mixer (Tolstikhin et al. 2021), ResMLP (Touvron et al. 2021), and gMLP (Liu et al. 2021), whose architectures are sensitive to image size and are infeasible in dense prediction tasks, CycleMLP has two appealing advantages: 1) CycleMLP can cope with various spatial sizes of images; 2) CycleMLP achieves linear computational complexity with respect to the image size by using local windows. In contrast, previous MLPs have O(N-2) computational complexity due to their full connections in space. 3) The relationship between convolution, multi-head self-attention in Transformer, and CycleMLP are discussed through an intuitive theoretical analysis. We build a family of models that can surpass state-of-the-art MLP and Transformer models e.g., Swin Transformer (Liu et al. 2021), while using fewer parameters and FLOPs. CycleMLP expands the MLP-like models' applicability, making them versatile backbone networks that achieve competitive results on dense prediction tasks For example, CycleMLP-Tiny outperforms Swin-Tiny by 1.3% mIoU on ADE20 K dataset with fewer FLOPs. Moreover, CycleMLP also shows excellent zero-shot robustness on ImageNet-C dataset.

引用

页码：14284 / 14300

页数：17

共 126 条

[1] ViViT: A Video Vision Transformer
Arnab, Anurag
Dehghani, Mostafa
Heigold, Georg
Sun, Chen
Lucic, Mario
Schmid, Cordelia
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 6816 - 6826
[2] Ba JL, 2016, arXiv
[3] Bertasius G, 2021, PR MACH LEARN RES, V139
[4] Brown TB, 2020, ADV NEUR IN, V33
[5] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
[6] Cazenavette George, 2021, arXiv, DOI DOI 10.48550/ARXIV.2105.14110
[7] Chen K, 2019, Arxiv, DOI arXiv:1906.07155
[8] Chen LC, 2017, Arxiv, DOI arXiv:1706.05587
[9] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
Chen, Liang-Chieh
Zhu, Yukun
Papandreou, George
Schroff, Florian
Adam, Hartwig
[J]. COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 : 833 - 851
[10] Choe J, 2022, Arxiv, DOI arXiv:2111.11187

← 1 2 3 4 5 6 7 8 9 10 →