Extracting Building Footprint From Remote Sensing Images by an Enhanced Vision Transformer Network

被引:3
作者
Zhang, Hua [1 ]
Dou, Hu [1 ]
Miao, Zelang [2 ]
Zheng, Nanshan [1 ]
Hao, Ming [1 ]
Shi, Wenzhong [3 ]
机构
[1] China Univ Min & Technol, Sch Environm & Spatial Informat, Xuzhou 221116, Peoples R China
[2] Cent South Univ, Sch Geosci & Infophys, Changsha 410083, Peoples R China
[3] Hong Kong Polytech Univ, Dept Land Surveying & Geoinformat, Hong Kong, Peoples R China
来源
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING | 2024年 / 62卷
关键词
Feature extraction; Transformers; Buildings; Data mining; Computer architecture; Semantic segmentation; Remote sensing; Boundary refinement; building footprint extraction; vision transformer (ViT); MULTISCALE;
D O I
10.1109/TGRS.2024.3421651
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Automatic extraction of building footprints from images is one of the vital means for obtaining building footprint data. However, due to the varied appearances, scales, and intricate structures of buildings, this task still remains challenging. Recently, the vision transformer (ViT) has exhibited significant promise in semantic segmentation, thanks to its efficient capability in obtaining long-range dependencies. This article employs the ViT for extracting building footprints. Yet, utilizing ViT often encounters limitations: extensive computational costs and insufficient preservation of local details in the process of extracting features. To address these challenges, a network based on an enhanced ViT (EViT) is proposed. In this network, one convolutional neural network (CNN)-based branch is introduced to extract comprehensive spatial details. Another branch, consisting of several multiscale enhanced ViT (EV) blocks, is developed to capture global dependencies. Subsequently, a multiscale and enhanced boundary feature extraction block is developed to fuse global dependencies and local details and perform boundary features enhancement, thereby yielding multiscale global-local contextual information with enhanced boundary feature. Specifically, we present a window-based cascaded multihead self-attention (W-CMSA) mechanism, characterized by linear complexity in relation to the window size, which not only reduces computational costs but also enhances attention diversity. The EViT has undergone comprehensive evaluation alongside other state-of-the-art (SOTA) approaches using three benchmark datasets. The findings illustrate that EViT exhibits promising performance in extracting building footprints and surpasses SOTA approaches. Specifically, it achieved 82.45%, 91.76%, and 77.14% IoU on the SpaceNet, WHU, and Massachusetts datasets, respectively. The implementation of EViT is available at https://github.com/dh609/EViT.
引用
收藏
页数:14
相关论文
共 43 条
[1]   MHA-Net: Multipath Hybrid Attention Network for Building Footprint Extraction From High-Resolution Remote Sensing Imagery [J].
Cai, Jihong ;
Chen, Yimin .
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 :5807-5817
[2]  
Chen J., 2021, arXiv
[3]   ASF-Net: Adaptive Screening Feature Network for Building Footprint Extraction From Remote-Sensing Images [J].
Chen, Jun ;
Jiang, Yuxuan ;
Luo, Linbo ;
Gong, Wenping .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[4]   Multiscale Feature Learning by Transformer for Building Extraction From Satellite Images [J].
Chen, Xin ;
Qiu, Chunping ;
Guo, Wenyue ;
Yu, Anzhu ;
Tong, Xiaochong ;
Schmitt, Michael .
IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
[5]  
Clevert DA, 2016, Arxiv, DOI [arXiv:1511.07289, DOI 10.48550/ARXIV.1511.07289]
[6]  
Cordonnier JB, 2021, Arxiv, DOI [arXiv:2006.16362, DOI 10.48550/ARXIV.2006.16362]
[7]  
Demir I., 2018, P IEEE C COMP VIS PA, P3146
[8]   Attention-Gate-Based Encoder-Decoder Network for Automatical Building Extraction [J].
Deng, Wenjing ;
Shi, Qian ;
Li, Jun .
IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 (14) :2611-2620
[9]  
Dosovitskiy A., 2021, 9 INT C LEARN REPR I
[10]   Dual Attention Network for Scene Segmentation [J].
Fu, Jun ;
Liu, Jing ;
Tian, Haijie ;
Li, Yong ;
Bao, Yongjun ;
Fang, Zhiwei ;
Lu, Hanqing .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3141-3149