TRS: Transformers for Remote Sensing Scene Classification

被引:95
作者
Zhang, Jianrong [1 ,2 ]
Zhao, Hongwei [1 ,2 ]
Li, Jiao [3 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Changchun 130012, Peoples R China
[2] Jilin Univ, Minist Educ, Key Lab Symbol Computat & Knowledge Engn, Changchun 130012, Peoples R China
[3] Jilin Univ, Dept Jilin Univ Lib, Changchun 130012, Peoples R China
关键词
transformers; deep convolutional neural networks; multi-head self-attention; remote sensing scene classification; CONVOLUTIONAL NEURAL-NETWORKS; FEATURES; ATTENTION; MODEL; SCALE;
D O I
10.3390/rs13204143
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Remote sensing scene classification remains challenging due to the complexity and variety of scenes. With the development of attention-based methods, Convolutional Neural Networks (CNNs) have achieved competitive performance in remote sensing scene classification tasks. As an important method of the attention-based model, the Transformer has achieved great success in the field of natural language processing. Recently, the Transformer has been used for computer vision tasks. However, most existing methods divide the original image into multiple patches and encode the patches as the input of the Transformer, which limits the model's ability to learn the overall features of the image. In this paper, we propose a new remote sensing scene classification method, Remote Sensing Transformer (TRS), a powerful "pure CNNs -> Convolution + Transformer -> pure Transformers " structure. First, we integrate self-attention into ResNet in a novel way, using our proposed Multi-Head Self-Attention layer instead of 3 x 3 spatial revolutions in the bottleneck. Then we connect multiple pure Transformer encoders to further improve the representation learning performance completely depending on attention. Finally, we use a linear classifier for classification. We train our model on four public remote sensing scene datasets: UC-Merced, AID, NWPU-RESISC45, and OPTIMAL-31. The experimental results show that TRS exceeds the state-of-the-art methods and achieves higher accuracy.
引用
收藏
页数:24
相关论文
共 89 条
[1]  
Abnar S., 2020, 200600555 ARXIV
[2]  
[Anonymous], 2010, ICML
[3]  
[Anonymous], 2017, P INT C NEURAL INFOR
[4]  
Aral RA, 2018, IEEE INT CONF BIG DA, P2058, DOI 10.1109/BigData.2018.8622212
[5]  
Ba J., 2016, ARXIV160706450, V1050, P21
[6]   Vision Transformers for Remote Sensing Image Classification [J].
Bazi, Yakoub ;
Bashmal, Laila ;
Rahhal, Mohamad M. Al ;
Dayil, Reham Al ;
Ajlan, Naif Al .
REMOTE SENSING, 2021, 13 (03) :1-20
[7]   Simple Yet Effective Fine-Tuning of Deep CNNs Using an Auxiliary Classification Loss for Remote Sensing Scene Classification [J].
Bazi, Yakoub ;
Al Rahhal, Mohamad M. ;
Alhichri, Haikel ;
Alajlan, Naif .
REMOTE SENSING, 2019, 11 (24)
[8]   Attention Augmented Convolutional Networks [J].
Bello, Irwan ;
Zoph, Barret ;
Vaswani, Ashish ;
Shlens, Jonathon ;
Le, Quoc V. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294
[9]   APDC-Net: Attention Pooling-Based Convolutional Network for Aerial Scene Classification [J].
Bi, Qi ;
Qin, Kun ;
Zhang, Han ;
Xie, Jiafen ;
Li, Zhili ;
Xu, Kai .
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2020, 17 (09) :1603-1607
[10]  
Bin Ma, 2014, Applied Mechanics and Materials, V644-650, P2018, DOI 10.4028/www.scientific.net/AMM.644-650.2018