CoAtFormer: Vision Transformer with Composite Attention

被引：0

作者：

Chang, Zhiyong ^{[1
]}

Yin, Mingjun ^{[2
]}

Wang, Yan ^{[3
]}

机构：

[1] Peking Univ, Beijing, Peoples R China

[2] Univ Melbourne, Melbourne, Vic, Australia

[3] Zuoyebang, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024 | 2024年

关键词：

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Transformer has recently gained significant attention and achieved state-of-the-art performance in various computer vision applications, including image classification, instance segmentation, and object detection. However, the self-attention mechanism underlying the transformer leads to quadratic computational cost with respect to image size, limiting its widespread adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and effective attention module we call Composite Attention. It features parallel branches, enabling the modeling of various global dependencies. In each composite attention module, one branch employs a dynamic channel attention module to capture global channel dependencies, while the other branch utilizes an efficient spatial attention module to extract long-range spatial interactions. In addition, we effectively blending composite attention module with convolutions, and accordingly develop a simple hierarchical vision backbone, dubbed CoAtFormer, by simply repeating the basic building block over multiple stages. Extensive experiments show our CoAtFormer achieves state-of-the-art results on various different tasks. Without any pre-training and extra data, CoAtFormer-Tiny, CoAt-Former-Small, and CoAtFormer-Base achieve 84.4%, 85.3%, and 85.9% top-1 accuracy on ImageNet-1K with 24M, 37M, and 73M parameters, respectively. Furthermore, CoAtFormer also consistently outperform prior work in other vision tasks such as object detection, instance segmentation, and semantic segmentation. When further pretraining on the larger dataset ImageNet-22k, we achieve 88.7% Top-1 accuracy on ImageNet-1K.

引用

页码：614 / 622

页数：9

共 49 条

[1] Cascade R-CNN: Delving into High Quality Object Detection
Cai, Zhaowei
Vasconcelos, Nuno
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6154 - 6162
[2] Chen Y., 2017, NEURAL INFORM PROCES
[3] Chu X., 2021, INT C LEARN REPR
[4] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[5] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6] Ding M., 2022, arXiv
[7] CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
Dong, Xiaoyi
Bao, Jianmin
Chen, Dongdong
Zhang, Weiming
Yu, Nenghai
Yuan, Lu
Chen, Dong
Guo, Baining
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, : 12114 - 12124
[8] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[9] El-Nouby A, 2021, Arxiv, DOI arXiv:2106.09681
[10] Howard AG, 2017, Arxiv, DOI [arXiv:1704.04861, DOI 10.48550/ARXIV.1704.04861]

← 1 2 3 4 5 →