Modality Fusion Vision Transformer for Hyperspectral and LiDAR Data Collaborative Classification

被引:55
作者
Yang, Bin [1 ]
Wang, Xuan [2 ]
Xing, Ying [2 ,3 ]
Cheng, Chen [4 ]
Jiang, Weiwei [5 ,6 ]
Feng, Quanlong [7 ]
机构
[1] China Unicom Res Inst, Graph neural network & artificial intelligence tea, Beijing 100032, Peoples R China
[2] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Beijing 100876, Peoples R China
[3] Yunnan Univ, Yunnan Key Lab Software Engn, Kunming 650500, Peoples R China
[4] China Unicom Res Inst, Network Technol Res Ctr, Beijing 100032, Peoples R China
[5] Beijing Univ Posts & Telecommun, Sch Informat & Commun Engn, Beijing 100876, Peoples R China
[6] Anhui Univ, Key Lab Unive Wireless Commun, Minist Educ, Anhui Prov Key Lab Multimodal Cognit Computat, Hefei 230039, Peoples R China
[7] China Agr Univ, Geog Informat Engn, Beijing 100083, Peoples R China
关键词
Feature extraction; Laser radar; Transformers; Hyperspectral imaging; Data mining; Data models; Vectors; Cross-attention (CA); hyperspectral image (HSI); light detection and ranging (LiDAR); modality fusion; vision transformer (ViT); EXTINCTION PROFILES;
D O I
10.1109/JSTARS.2024.3415729
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In recent years, collaborative classification of multimodal data, e.g., hyperspectral image (HSI) and light detection and ranging (LiDAR), has been widely used to improve remote sensing image classification accuracy. However, existing fusion approaches for HSI and LiDAR suffer from limitations. Fusing the heterogeneous features of HSI and LiDAR proved to be challenging, leading to incomplete utilization of information for category representation. In addition, during the extraction of spatial features from HSI, the spectral and spatial information are often disjointed. It leads to the difficulty of fully exploiting the rich spectral information in hyperspectral data. To address these issues, we proposed a multimodal data fusion framework specifically designed for HSI and LiDAR fusion classification, called modality fusion vision transformer. We have designed a stackable modality fusion block as the core of our model. Specifically, these blocks mainly consist of multimodal cross-attention modules and spectral self-attention modules. The proposed novel multimodal cross-attention module for feature fusion addresses the issue of insufficient fusion of heterogeneous features from HSI and LiDAR for category representation. Compared to other cross-attention methods, it reduces the alignment requirements between modal feature spaces during cross-modal fusion. The spectral self-attention module can preserve spatial features while exploiting the rich spectral information and participating in the process of extracting spatial features from HSI. Ultimately, we achieve overall classification accuracies of 99.91%, 99.59%, and 96.98% on three benchmark datasets respectively, surpassing all state-of-the-art methods, demonstrating the stability and effectiveness of our model.
引用
收藏
页码:17052 / 17065
页数:14
相关论文
共 43 条
[1]   Hyperspectral Image Classification-Traditional to Deep Models: A Survey for Future Prospects [J].
Ahmad, Muhammad ;
Shabbir, Sidrah ;
Roy, Swalpa Kumar ;
Hong, Danfeng ;
Wu, Xin ;
Yao, Jing ;
Khan, Adil Mehmood ;
Mazzara, Manuel ;
Distefano, Salvatore ;
Chanussot, Jocelyn .
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2022, 15 :968-999
[2]  
Ba J L., LAYER NORMALIZATION
[3]   2-D LIDAR-Based Approach for Activity Identification and Fall Detection [J].
Bouazizi, Mondher ;
Ye, Chen ;
Ohtsuki, Tomoaki .
IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (13) :10872-10890
[4]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[5]   Deep Fusion of Remote Sensing Data for Accurate Classification [J].
Chen, Yushi ;
Li, Chunyang ;
Ghamisi, Pedram ;
Jia, Xiuping ;
Gu, Yanfeng .
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2017, 14 (08) :1253-1257
[6]   What does BERT look at? An Analysis of BERT's Attention [J].
Clark, Kevin ;
Khandelwal, Urvashi ;
Levy, Omer ;
Manning, Christopher D. .
BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, :276-286
[7]  
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
[8]   Morphological Attribute Profiles for the Analysis of Very High Resolution Images [J].
Dalla Mura, Mauro ;
Benediktsson, Jon Atli ;
Waske, Bjoern ;
Bruzzone, Lorenzo .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2010, 48 (10) :3747-3762
[9]   Hyperspectral and LiDAR Data Fusion: Outcome of the 2013 GRSS Data Fusion Contest [J].
Debes, Christian ;
Merentitis, Andreas ;
Heremans, Roel ;
Hahn, Juergen ;
Frangiadakis, Nikolaos ;
van Kasteren, Tim ;
Liao, Wenzhi ;
Bellens, Rik ;
Pizurica, Aleksandra ;
Gautama, Sidharta ;
Philips, Wilfried ;
Prasad, Saurabh ;
Du, Qian ;
Pacifici, Fabio .
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2014, 7 (06) :2405-2418
[10]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171