Visual attention network

被引:403
作者
Guo, Meng-Hao [1 ]
Lu, Cheng-Ze [2 ]
Liu, Zheng-Ning [3 ]
Cheng, Ming-Ming [2 ]
Hu, Shi-Min [1 ]
机构
[1] Tsinghua Univ, Dept Comp Sci, Beijing, Peoples R China
[2] Nankai Univ, Tianjin, Peoples R China
[3] Fitten Tech, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
vision backbone; deep learning; ConvNets; attention; REPRESENTATION;
D O I
10.1007/s41095-023-0364-2
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision: (1) treating images as 1D sequences neglects their 2D structures; (2) the quadratic complexity is too expensive for high-resolution images; (3) it only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel linear attention named large kernel attention (LKA) to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings. Furthermore, we present a neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple, VAN achieves comparable results with similar size convolutional neural networks (CNNs) and vision transformers (ViTs) in various tasks, including image classification, object detection, semantic segmentation, panoptic segmentation, pose estimation, etc. For example, VAN-B6 achieves 87.8% accuracy on ImageNet benchmark, and sets new state-of-the-art performance (58.2 PQ) for panoptic segmentation. Besides, VAN-B2 surpasses Swin-T 4 mIoU (50.1 vs. 46.1) for semantic segmentation on ADE20K benchmark, 2.6 AP (48.8 vs. 46.2) for object detection on COCO dataset. It provides a novel method and a simple yet strong baseline for the community.
引用
收藏
页码:733 / 752
页数:20
相关论文
共 129 条
[1]  
[Anonymous], 2020, OPENMMLAB PRETRAININ
[2]  
[Anonymous], 2020, Openmmlab pose estimation toolbox and benchmark
[3]  
Bai SJ, 2018, Arxiv, DOI [arXiv:1803.01271, DOI 10.48550/ARXIV.1803.01271]
[4]  
Bao H., 2022, P INT C LEARNING REP
[5]  
Bello I., 2021, P 35 C NEUR INF PROC
[6]  
Bello I., 2021, P INT C LEARNING REP
[7]   Attention Augmented Convolutional Networks [J].
Bello, Irwan ;
Zoph, Barret ;
Vaswani, Ashish ;
Shlens, Jonathon ;
Le, Quoc V. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294
[8]  
Branson S., 2010, 2010001 CAL I TECHN
[9]  
Brown TB, 2020, ADV NEUR IN, V33
[10]   Cascade R-CNN: High Quality Object Detection and Instance Segmentation [J].
Cai, Zhaowei ;
Vasconcelos, Nuno .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (05) :1483-1498