DSViT: Dynamically Scalable Vision Transformer for Remote Sensing Image Segmentation and Classification

被引：8

作者：

Wang, Falin ^{[1
]}

Ji, Jian ^{[1
]}

Wang, Yuan ^{[1
]}

机构：

[1] Xidian Univ, Sch Comp Sci & Technol, Xian 710071, Peoples R China

来源：

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING | 2023年 / 16卷

基金：

中国国家自然科学基金;

关键词：

Transformers; Remote sensing; Feature extraction; Computational modeling; Convolutional neural networks; Convolution; Task analysis; CNN; classification; remote sensing image; semantic segmentation; transformer;

D O I：

10.1109/JSTARS.2023.3285259

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

The relationship between the foreground target and the background of remote sensing image is very complex. The vision task of remote sensing image faces the problems of complex targets and unbalanced categories. These problems make the modeling method have further improvement space. Therefore, this article proposes a dynamically scalable attention model that combines convolutional features and Transformer features. It can dynamically select the model depth according to the size of the input image, which alleviates the problem of insufficient global information extraction of the single convolution model and the computational overhead limitation of the pure Transformer model. We validated the model on two public remote sensing image classifications and two remote sensing image segmentation datasets. The accuracy and mean pixel accuracy (mPA) of the method in this article reached 96.16% and 93.44%, respectively, on the university of california (UC) Merced classification dataset. Compared with some recent work, the method has a net improvement of 5.0% and 4.82% over the pyramid vision transformer (PVT) model. On the Potsdam segmentation dataset, the accuracy and F1 of the transformer and CNN hybrid neural network (TCHNN) model are 91.5% and 92.86%, respectively. The performance of the method has improved 0.64% and 1.0%, and the other two datasets have also achieved the best results.

引用

页码：5441 / 5452

页数：12

共 61 条

[1]

Arnaudo E., 2022, ARXIV

[2]

Bahdanau D, 2016, Arxiv, DOI [arXiv:1409.0473, 10.48550/arXiv.1409.0473,1409.0473, DOI 10.48550/ARXIV.1409.0473,1409.0473]

[3]

Brown TB, 2020, ADV NEUR IN, V33

[4]

Cai J, 2022, P IEEECVF C COMPUTER, P8090

[5] Diverse Image Style Transfer via Invertible Cross-Space Mapping [J].

Chen, Haibo ;

Zhao, Lei ;

Zhang, Huiming ;

Wang, Zhizhong ;

Zuo, Zhiwen ;

Li, Ailin ;

Xing, Wei ;

Lu, Dongming .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :14860-14869

[6]

Chen J., 2021, arXiv

[7] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation [J].

Chen, Liang-Chieh ;

Zhu, Yukun ;

Papandreou, George ;

Schroff, Florian ;

Adam, Hartwig .

COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :833-851

[8]

Cheng Z., 2022, IEEE T GEOSCI ELECT, V60

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10]

Di W., 2023, IEEE T GEOSCI ELECT, V61

← 1 2 3 4 5 6 7 →