AST: Adaptive Self-supervised Transformer for optical remote sensing representation

被引：21

作者：

He, Qibin ^{[1
,2
,3
,4
]}

Sun, Xian ^{[1
,2
,3
,4
]}

Yan, Zhiyuan ^{[1
,2
]}

Wang, Bing ^{[1
,2
,3
,4
]}

Zhu, Zicong ^{[1
,2
,3
,4
]}

Diao, Wenhui ^{[1
,2
,3
,4
]}

Yang, Michael Ying ^{[5
]}

机构：

[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing 100190, Peoples R China

[2] Chinese Acad Sci, Aerosp Informat Res Inst, Key Lab Network Informat Syst Technol NIST, Beijing 100190, Peoples R China

[3] Univ Chinese Acad Sci, Beijing 100190, Peoples R China

[4] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 100190, Peoples R China

[5] Univ Twente, Fac Geoinformat Sci & Earth Observ ITC, Enschede, Netherlands

来源：

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING | 2023年 / 200卷

基金：

中国国家自然科学基金;

关键词：

Cross-scale transformer; Interpretation; Masked image modeling; Optical remote sensing; Representation learning; CONVOLUTIONAL NEURAL-NETWORKS; SEMANTIC SEGMENTATION; SCENE CLASSIFICATION; OBJECTS; IMAGES;

D O I：

10.1016/j.isprsjprs.2023.04.003

中图分类号：

P9 [自然地理学];

学科分类号：

0705 ; 070501 ;

摘要：

Due to the variation in spatial resolution and the diversity of object scales, the interpretation of optical remote sensing images is extremely challenging. Deep learning has become the mainstream solution to interpret such complex scenes. However, the explosion of deep learning model architectures has resulted in the need for hundreds of millions of remote sensing images for which labels are very costly or often unavailable publicly. This paper provides an in-depth analysis of the main reasons for this data thirst, i.e., (i) limited representational power for model learning, and (ii) underutilization of unlabeled remote sensing data. To overcome the above difficulties, we present a scalable and adaptive self-supervised Transformer (AST) for optical remote sensing image interpretation. By performing masked image modeling in pre-training, the proposed AST releases the rich supervision signals in massive unlabeled remote sensing data and learns useful multi-scale semantics. Specifically, a cross-scale Transformer architecture is designed to collaboratively learn global dependencies and local details by introducing a pyramid structure, to facilitate multi-granular feature interactions and generate scale-invariant representations. Furthermore, a masking token strategy relying on correlation mapping is proposed to achieve adaptive masking of partial patches without affecting key structures, which enhances the understanding of visually important regions. Extensive experiments on various optical remote sensing interpretation tasks show that AST has good generalization capability and competitiveness.

引用

页码：41 / 54

页数：14

共 90 条

[1] Self-Supervised Material and Texture Representation Learning for Remote Sensing Tasks [J].

Akiva, Peri ;

Purri, Matthew ;

Leotta, Matthew .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :8193-8205

[2] Binary patterns encoded convolutional neural networks for texture recognition and remote sensing scene classification [J].

Anwer, Rao Muhammad ;

Khan, Fahad Shahbaz ;

van de Weijer, Joost ;

Molinier, Matthieu ;

Laaksonen, Jorma .

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2018, 138 :74-85

[3] Geography-Aware Self-Supervised Learning [J].

Ayush, Kumar ;

Uzkent, Burak ;

Meng, Chenlin ;

Tanmay, Kumar ;

Burke, Marshall ;

Lobell, David ;

Ermon, Stefano .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :10161-10170

[4]

Bao H., 2021, arXiv, DOI DOI 10.48550/ARXIV.2106.08254

[5] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[6] Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation [J].

Chen, Liang-Chieh ;

Zhu, Yukun ;

Papandreou, George ;

Schroff, Florian ;

Adam, Hartwig .

COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :833-851

[7] Remote Sensing Scene Classification via Multi-Branch Local Attention Network [J].

Chen, Si-Bao ;

Wei, Qing-Song ;

Wang, Wen-Zhong ;

Tang, Jin ;

Luo, Bin ;

Wang, Zu-Yuan .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 :99-109

[8] Exploring Simple Siamese Representation Learning [J].

Chen, Xinlei ;

He, Kaiming .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :15745-15753

[9]

Chen Z., 2018, P BMVC, P266

[10] Cross-Scale Feature Fusion for Object Detection in Optical Remote Sensing Images [J].

Cheng, Gong ;

Si, Yongjie ;

Hong, Hailong ;

Yao, Xiwen ;

Guo, Lei .

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2021, 18 (03) :431-435

← 1 2 3 4 5 6 7 8 9 →