Multi-granularity vision transformer via semantic token for hyperspectral image classification

被引：12

作者：

Li, Bin ^{[1
]}

Ouyang, Er ^{[1
]}

Hu, Wenjing ^{[1
]}

Zhang, Guoyun ^{[1
]}

Zhao, Lin ^{[1
]}

Wu, Jianhui ^{[1
]}

机构：

[1] Hunan Inst Sci & Technol, Sch Informat Sci & Engn, Yueyang 414000, Peoples R China

来源：

INTERNATIONAL JOURNAL OF REMOTE SENSING | 2022年 / 43卷 / 17期

关键词：

Hyperspectral image classification; convolutional neural networks; transformer; word embedding; long-distance dependence;

D O I：

10.1080/01431161.2022.2142078

中图分类号：

TP7 [遥感技术];

学科分类号：

081102 ; 0816 ; 081602 ; 083002 ; 1404 ;

摘要：

The superior local context modelling capability of convolutional neural networks (CNNs) in representing features allows greatly enhanced performance in hyperspectral image (HSI) classification tasks by CNN-based methods. However, most of these methods suffer from a restricted receptive field and poor performance in the continuous data domain. To address these issues, we propose a multi-granularity vision transformer via semantic token (MSTViT) for HSI classification, which differs from the existing transformer view by modelling the HSI classification tasks as word embedding problems. Specifically, the MSTViT model extracts multi-level semantic features by a ladder feature extractor and applies a multi-granularity patch embedding module to embed these features simultaneously as different-scale tokens. Moreover, different-granularity tokens are fed to the vision transformer to capture the long-distance dependencies among the different tokens. A depth-wise separable convolution multi-layer perceptron is used to assist the attention mechanism for further excavation of the deep information of HSI. Finally, the performance of HSI classification is improved by fusing the coarse- and fine-granularity representations to generate stronger features. Experimental results on four standard datasets verify the marked improvement of the MSTVIT over state-of-the-art CNN and transformer structures. The code of this work is available at https://github.com/zhaolin6/MSTViT for the sake of reproducibility.

引用

页码：6538 / 6560

页数：23

共 32 条

[1]

Ba J. L., 2016, Advances in Neural Information Processing Systems (NeurIPS), P1

[2] 3-D Deep Learning Approach for Remote Sensing Image Classification [J].

Ben Hamida, Amina ;

Benoit, Alexandre ;

Lambert, Patrick ;

Ben Amar, Chokri .

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2018, 56 (08) :4420-4434

[3] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].

Chen, Chun-Fu ;

Fan, Quanfu ;

Panda, Rameswar .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356

[4] Deep Feature Extraction and Classification of Hyperspectral Images Based on Convolutional Neural Networks [J].

Chen, Yushi ;

Jiang, Hanlu ;

Li, Chunyang ;

Jia, Xiuping ;

Ghamisi, Pedram .

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2016, 54 (10) :6232-6251

[5] GEOLOGICAL MAPPING USING LANDSAT THEMATIC MAPPER IMAGERY IN ALMERIA PROVINCE, SOUTHEAST SPAIN [J].

CROSTA, AP ;

MOORE, JM .

INTERNATIONAL JOURNAL OF REMOTE SENSING, 1989, 10 (03) :505-514

[6]

Dehghani M., 2019, P INT C LEARN REPR

[7]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[8]

Dosovitskiy A, 2020, ARXIV

[9] APPLICATIONS OF NOAA-AVHRR 1 KM DATA FOR ENVIRONMENTAL MONITORING [J].

EHRLICH, D ;

ESTES, JE ;

SINGH, A .

INTERNATIONAL JOURNAL OF REMOTE SENSING, 1994, 15 (01) :145-161

[10]

Gao K., 2021, Academic J. Comput. Inf. Sci., V4, P11

← 1 2 3 4 →