Multi-label remote sensing classification with self-supervised gated multi-modal transformers

被引：1

作者：

Liu, Na ^{[1
]}

Yuan, Ye ^{[1
]}

Wu, Guodong ^{[2
]}

Zhang, Sai ^{[2
]}

Leng, Jie ^{[2
]}

Wan, Lihong ^{[2
]}

机构：

[1] Univ Shanghai Sci & Technol, Inst Machine Intelligence, Shanghai, Peoples R China

[2] Origin Dynam Intelligent Robot Co Ltd, Zhengzhou, Peoples R China

来源：

FRONTIERS IN COMPUTATIONAL NEUROSCIENCE | 2024年 / 18卷

关键词：

self-supervised learning; pre-training; vision transformer; multi-modal; gated units; BENCHMARK-ARCHIVE; LARGE-SCALE; BIGEARTHNET;

D O I：

10.3389/fncom.2024.1404623

中图分类号：

Q [生物科学];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Introduction With the great success of Transformers in the field of machine learning, it is also gradually attracting widespread interest in the field of remote sensing (RS). However, the research in the field of remote sensing has been hampered by the lack of large labeled data sets and the inconsistency of data modes caused by the diversity of RS platforms. With the rise of self-supervised learning (SSL) algorithms in recent years, RS researchers began to pay attention to the application of "pre-training and fine-tuning" paradigm in RS. However, there are few researches on multi-modal data fusion in remote sensing field. Most of them choose to use only one of the modal data or simply splice multiple modal data roughly.Method In order to study a more efficient multi-modal data fusion scheme, we propose a multi-modal fusion mechanism based on gated unit control (MGSViT). In this paper, we pretrain the ViT model based on BigEarthNet dataset by combining two commonly used SSL algorithms, and propose an intra-modal and inter-modal gated fusion unit for feature learning by combining multispectral (MS) and synthetic aperture radar (SAR). Our method can effectively combine different modal data to extract key feature information.Results and discussion After fine-tuning and comparison experiments, we outperform the most advanced algorithms in all downstream classification tasks. The validity of our proposed method is verified.

引用

页数：15

共 38 条

[1] Abnar S., 2021, arXiv
[2] Alabdulmohsin Ibrahim M, 2022, Advances in Neural Information Processing Systems, V35, P22300
[3] Arevalo J, 2017, Arxiv, DOI arXiv:1702.01992
[4] Carion N., 2020, LNCS, V12346, P213, DOI [DOI 10.1007/978-3-030-58452-813, 10.1007/978- 3- 030-58452-8 13, DOI 10.1007/978-3-030-58452-8_13]
[5] Emerging Properties in Self-Supervised Vision Transformers
Caron, Mathilde
Touvron, Hugo
Misra, Ishan
Jegou, Herve
Mairal, Julien
Bojanowski, Piotr
Joulin, Armand
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9630 - 9640
[6] An Empirical Study of Training Self-Supervised Vision Transformers
Chen, Xinlei
Xie, Saining
He, Kaiming
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9620 - 9629
[7] Cong Yezhen, 2022, ADV NEURAL INFORM PR
[8] Convolutional Neural Network for Remote-Sensing Scene Classification: Transfer Learning Analysis
de Lima, Rafael Pires
Marfurt, Kurt
[J]. REMOTE SENSING, 2020, 12 (01)
[9] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[10] Fuller A, 2022, Arxiv, DOI arXiv:2209.14969

← 1 2 3 4 →