Pixel Representation Augmented through Cross-Attention for High-Resolution Remote Sensing Imagery Segmentation

被引:3
作者
Luo, Yiyun [1 ,2 ]
Wang, Jinnian [1 ,2 ]
Yang, Xiankun [1 ,2 ]
Yu, Zhenyu [1 ,2 ]
Tan, Zixuan [1 ,2 ]
机构
[1] Guangzhou Univ, Sch Geog & Remote Sensing, Guangzhou 510006, Peoples R China
[2] Guangzhou Univ, Ctr Remote Sensing Big Data Intelligence Applicat, Guangzhou 510006, Peoples R China
基金
国家重点研发计划;
关键词
land cover classification; transformer; cross-attention; object embedding queries; LAND-COVER CLASSIFICATION; SEMANTIC SEGMENTATION; NETWORK;
D O I
10.3390/rs14215415
中图分类号
X [环境科学、安全科学];
学科分类号
08 ; 0830 ;
摘要
Natural imagery segmentation has been transferred to land cover classification in remote sensing imagery with excellent performance. However, two key issues have been overlooked in the transfer process: (1) some objects were easily overwhelmed by the complex backgrounds; (2) interclass information for indistinguishable classes was not fully utilized. The attention mechanism in the transformer is capable of modeling long-range dependencies on each sample for per-pixel context extraction. Notably, per-pixel context from the attention mechanism can aggregate category information. Therefore, we proposed a semantic segmentation method based on pixel representation augmentation. In our method, a simplified feature pyramid was designed to decode the hierarchical pixel features from the backbone, and then decode the category representations into learnable category object embedding queries by cross-attention in the transformer decoder. Finally, pixel representation is augmented by an additional cross-attention in the transformer encoder under the supervision of auxiliary segmentation heads. The results of extensive experiments on the aerial image dataset Potsdam and satellite image dataset Gaofen Image Dataset with 15 categories (GID-15) demonstrate that the cross-attention is effective, and our method achieved the mean intersection over union (mIoU) of 86.2% and 62.5% on the Potsdam test set and GID-15 validation set, respectively. Additionally, we achieved an inference speed of 76 frames per second (FPS) on the Potsdam test dataset, higher than all the state-of-the-art models we tested on the same device.
引用
收藏
页数:20
相关论文
共 50 条
[1]  
[Anonymous], What is the difference between levofloxacin and ofloxacin. Antibiotic Ciprofloxacin: description, indications for use and medicinal properties of the drug
[2]  
[Anonymous], 2D Semantic Labeling Contest-Potsdam
[3]   SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation [J].
Badrinarayanan, Vijay ;
Kendall, Alex ;
Cipolla, Roberto .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) :2481-2495
[4]   Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks [J].
Bell, Sean ;
Zitnick, C. Lawrence ;
Bala, Kavita ;
Girshick, Ross .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :2874-2883
[5]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[6]   DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs [J].
Chen, Liang-Chieh ;
Papandreou, George ;
Kokkinos, Iasonas ;
Murphy, Kevin ;
Yuille, Alan L. .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (04) :834-848
[7]  
Cheng B., 2021, ARXIV210706278
[8]   Multiple criteria for evaluating machine learning algorithms for land cover classification from satellite data [J].
DeFries, RS ;
Chan, JCW .
REMOTE SENSING OF ENVIRONMENT, 2000, 74 (03) :503-515
[9]  
Dosovitskiy A, 2021, ICLR
[10]   Unsupervised classification of satellite imagery: choosing a good algorithm [J].
Duda, T ;
Canty, M .
INTERNATIONAL JOURNAL OF REMOTE SENSING, 2002, 23 (11) :2193-2212