Building extraction from remote sensing images via the multiscale information fusion method under the Transformer architecture

被引:0
作者
Liu, Yi [1 ]
Zhang, Yinjie [1 ]
Ao, Yang [1 ]
Jiang, Dalong [1 ]
Zhang, Zhaorui [1 ]
机构
[1] School of Geodesy and Geomatics, Wuhan University, Wuhan
基金
中国国家自然科学基金;
关键词
building extraction; class-scale attention; deep learning; image feature pyramid; remote sensing images; Transformer;
D O I
10.11834/jrs.20233017
中图分类号
学科分类号
摘要
As deep learning develops, researchers are paying increasing attention to its application in remote sensing building extraction. Many experiments on multiscale feature fusion, which boosts performance during the feature inference stage, and multiscale output fusion have been conducted to achieve a trade-off between accuracy and efficiency and obtain enhanced details and overall effects. However, current multiscale feature fusion methods consider only the nearest feature, which is insufficient for cross-scale feature fusion. The functions of multiscale output fusion are also limited in a unary correlation, which only considers the scale element. To address these problems, we propose a feature fusion method and a result fusion module to improve the accuracy of building extraction from remote sensing images. This study proposes the Triple-Feature Pyramid Network (Tri-FPN) and Class-Scale Attention Module (CSA-Module) based on Segformer to extract buildings in remote sensing images. The whole network structure is divided into three components: feature extraction, feature fusion, and classification head. In the feature extraction component, the Segformer structure is adopted to extract multiscale features. Segformer utilizes the self-attention function to extract feature maps of different scales. To adaptively enlarge the receptive fields, Segformer uses a strided convolution kernel to shrink the key and value vector in the self-attention computation process. The calculation cost decreases considerably. The goal of the feature fusion component is to fuse multiscale features from different parts of the feature extraction network. Tri-FPN consists of three feature pyramid networks. The fusion follows the sequence top-down, bottom-up, and top-down, thus enlarging the scale-receptive field. The basic fusion blocks are 3×3 convolution with feature element-wise addition and 1×1 convolution with channel concatenation. This design helps maintain the spatial diversity and inner-class feature consistency. In the classification head component, each pixel is assigned a predicted label. First, the feature map goes through 1×1 convolution to obtain a coarse result. Second, the feature map shrinks in the channel dimension via 1×1 convolution. Third, the shrunk feature map is concatenated with the coarse result and up-sampled two times. Fourth, the mixed feature is segmented by 5×5 convolution. A height×width× class attention map, which considers class information, scale diversity, and spatial details, is calculated by a 3×3 convolution block on the mixed feature at the same time. Last, the coarse and mixed-feature results are fused in the attention map. A series of experiments is conducted on WHU Building and INRIA datasets. For the WHU Building dataset, the precision reaches 95.42%, the recall is 96.25%, and the Intersection Over Union (IOU) value is 91.53%. For the INRIA dataset, the precision, recall, and IOU value reach 89.33%, 91.10%, and 81.7%, respectively. The increments in recall and IOU exceed 1% relative to the backbone. These results prove that the proposed method has strong feature fusion and segmentation abilities. Tri-FPN effectively improves building extraction accuracy and overall efficiency, especially on the boundaries and holes in the building area, thus verifying the validity of multiscale feature fusion. By considering class, scale, and spatial attention, the CSA-Module can considerably improve accuracy with negligible parameters. The structure demonstrates an improved ability to predict small buildings and details in remote sensing images by adopting Tri-FPN and CSA-Module. © 2024 Science Press. All rights reserved.
引用
收藏
页码:3173 / 3183
页数:10
相关论文
共 30 条
[1]  
Awrangjeb M, Zhang C, Fraser C S., Improved building detection using texture information, International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, pp. 143-148, (2011)
[2]  
Chen K Q, Fu K, Gao X, Yan M L, Sun X, Zhang H., Building extraction from remote sensing images with deep learning in a supervised manner, Proceedings of 2017 IEEE International Geoscience and Remote Sensing Symposium, pp. 1672-1675, (2017)
[3]  
Chen K Y, Zou Z X, Shi Z W., Building extraction from remote sensing images with sparse token transformers, Remote Sensing, 13, 21, (2021)
[4]  
Chen L C, Zhu Y K, Papandreou G, Schroff F, Adam H., Encoder-decoder with atrous separable convolution for semantic image segmentation, Proceedings of the 15th European Conference on Computer Vision, pp. 833-851, (2018)
[5]  
Chen M, Wu J J, Liu L Z, Zhao W H, Tian F, Shen Q, Zhao B Y, Du R H., DR-Net: an improved network for building extraction from high resolution remote sensing image, Remote Sensing, 13, 2, (2021)
[6]  
Deng W J, Shi Q, Li J., Attention-gate-based encoder-decoder network for automatical building extraction, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, pp. 2611-2620, (2021)
[7]  
Guo H N, Su X, Tang S K, Du B, Zhang L P., Scale-robust deep-supervision network for mapping building footprints from high-resolution remote sensing images, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14, pp. 10091-10100, (2021)
[8]  
Guo M Q, Liu H, Xu Y Y, Huang Y., Building extraction based on U-Net with an attention block and multiple losses, Remote Sensing, 12, 9, (2020)
[9]  
Hu L, Niu C, Ren S H, Dong M H, Zheng C L, Zhang W, Liang J M., Discriminative context-aware network for target extraction in remote sensing imagery, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15, pp. 700-715, (2022)
[10]  
Ji S P, Wei S Q, Lu M., Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set, IEEE Transactions on Geoscience and Remote Sensing, 57, 1, pp. 574-586, (2019)