TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

被引:12
作者
Xiang, Xuyang [1 ]
Gong, Wenping [1 ]
Li, Shuailong [1 ]
Chen, Jun [2 ]
Ren, Tianhe [1 ]
机构
[1] China Univ Geosci, Fac Engn, Wuhan 430074, Peoples R China
[2] China Univ Geosci, Sch Automat, Wuhan 430074, Peoples R China
关键词
Convolutional Neural Network (CNN); feature fusion; remote sensing images; semantic segmentation; Transformer;
D O I
10.1109/JSTARS.2024.3349625
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Semantic segmentation of remote sensing images plays a critical role in areas such as urban change detection, environmental protection, and geohazard identification. Convolutional Neural Networks (CNNs) have been excessively employed for semantic segmentation over the past few years; however, a limitation of the CNN is that there exists a challenge in extracting the global context of remote sensing images, which is vital for semantic segmentation, due to the locality of the convolution operation. It is informed that the recently developed Transformer is equipped with powerful global modeling capabilities. A network called TCNet is proposed in this article, and a parallel-in-branch architecture of the Transformer and the CNN is adopted in the TCNet. As such, the TCNet takes advantage of both Transformer and CNN, and both global context and low-level spatial details could be captured in a much shallower manner. In addition, a novel fusion technique called Interactive Self-attention is advanced to fuse the multilevel features extracted from both branches. To bridge the semantic gap between regions, a skip connection module called Windowed Self-attention Gating is further developed and added to the progressive upsampling network. Experiments on three public datasets (i.e., Bijie Landslide Dataset, WHU Building Dataset, and Massachusetts Buildings Dataset) depict that TCNet yields superior performance over state-of-the-art models. The IoU values obtained by TCNet for these three datasets are 75.34% (ranked first among 10 models compared), 91.16% (ranked first among 13 models compared), and 76.21% (ranked first among 13 models compared), respectively.
引用
收藏
页码:3123 / 3136
页数:14
相关论文
共 72 条
  • [1] Deep semantic segmentation of natural and medical images: a review
    Asgari Taghanaki, Saeid
    Abhishek, Kumar
    Cohen, Joseph Paul
    Cohen-Adad, Julien
    Hamarneh, Ghassan
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2021, 54 (01) : 137 - 178
  • [2] Contextual Attention Network: Transformer Meets U-Net
    Azad, Reza
    Heidari, Moein
    Wu, Yuli
    Merhof, Dorit
    [J]. MACHINE LEARNING IN MEDICAL IMAGING, MLMI 2022, 2022, 13583 : 377 - 386
  • [3] SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
    Badrinarayanan, Vijay
    Kendall, Alex
    Cipolla, Roberto
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) : 2481 - 2495
  • [4] An Active Deep Learning Approach for Minimally Supervised PolSAR Image Classification
    Bi, Haixia
    Xu, Feng
    Wei, Zhiqiang
    Xue, Yong
    Xu, Zongben
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2019, 57 (11): : 9378 - 9395
  • [5] Cao Hu, 2023, Computer Vision - ECCV 2022 Workshops: Proceedings. Lecture Notes in Computer Science (13803), P205, DOI 10.1007/978-3-031-25066-8_9
  • [6] Carion Nicolas, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12346), P213, DOI 10.1007/978-3-030-58452-8_13
  • [7] CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
    Chen, Chun-Fu
    Fan, Quanfu
    Panda, Rameswar
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 347 - 356
  • [8] Chen J., 2021, arXiv
  • [9] ASF-Net: Adaptive Screening Feature Network for Building Footprint Extraction From Remote-Sensing Images
    Chen, Jun
    Jiang, Yuxuan
    Luo, Linbo
    Gong, Wenping
    [J]. IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
  • [10] Chen LC, 2017, Arxiv, DOI arXiv:1706.05587