Implicit Ray Transformers for Multiview Remote Sensing Image Segmentation

被引:10
作者
Qi, Zipeng [1 ,2 ,3 ,4 ]
Chen, Hao [1 ,2 ,3 ,4 ]
Liu, Chenyang [1 ,2 ,3 ,4 ]
Shi, Zhenwei [1 ,2 ,3 ,4 ]
Zou, Zhengxia [4 ,5 ]
机构
[1] Beihang Univ, Image Proc Ctr, Sch Astronaut, Beijing 100191, Peoples R China
[2] Beihang Univ, Beijing Key Lab Digital Media, Beijing 100191, Peoples R China
[3] Beihang Univ, State Key Lab Virtual Real Technol & Syst, Beijing 100191, Peoples R China
[4] Shanghai Artificial Intelligence Lab, Shanghai 200232, Peoples R China
[5] Beihang Univ, Sch Astronaut, Dept Guidance Nav & Control, Beijing 100191, Peoples R China
来源
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING | 2023年 / 61卷
基金
中国国家自然科学基金; 北京市自然科学基金;
关键词
Three-dimensional displays; Feature extraction; Semantics; Task analysis; Remote sensing; Transformers; Annotations; Implicit neural representation (INR); remote sensing (RS); semantic segmentation; transformer; SCENES;
D O I
10.1109/TGRS.2023.3285659
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
The mainstream convolutional neural network (CNN)-based remote sensing (RS) image semantic segmentation approaches typically rely on massively labeled training data. Such a paradigm struggles with the problem of RS multiview scene segmentation with limited labeled views due to the lack of consideration of 3-D information within the scene. In this article, we propose "implicit ray transformer (IRT)" based on implicit neural representation (INR) for RS scene semantic segmentation with sparse labels (5% of the images being labeled). We explore a new way of introducing the multiview 3-D structure priors to the task for accurate and view-consistent semantic segmentation. The proposed method includes a two-stage learning process. In the first stage, we optimize a neural field to encode the color and 3-D structure of the RS scene based on multiview images. In the second stage, we design a ray transformer to leverage the relations between the neural field 3-D features and 2-D texture features for learning better semantic representations. Different from previous methods that only consider 3-D priors or 2-D features, we incorporate additional 2-D texture information and 3-D priors by broadcasting CNN features to different point features along the sampled ray. To verify the effectiveness of the proposed method, we construct a challenging dataset containing six synthetic sub-datasets collected from the Carla platform and three real sub-datasets from Google Maps. Experiments show that the proposed method outperforms the CNN-based methods and the state-of-the-art INR-based segmentation methods in quantitative and qualitative metrics. The ablation study shows that under a limited number of fully annotated images, the combination of the 3-D structure priors and 2-D texture can significantly improve the performance and effectively complete missing semantic information in novel views. Experiments also demonstrate that the proposed method could yield geometry-consistent segmentation results against illumination changes and viewpoint changes. Our data and code will be public.
引用
收藏
页数:15
相关论文
共 70 条
[1]   Land Cover Classification from fused DSM and UAV Images Using Convolutional Neural Networks [J].
Al-Najjar, Husam A. H. ;
Kalantar, Bahareh ;
Pradhan, Biswajeet ;
Saeidi, Vahideh ;
Halin, Alfian Abdul ;
Ueda, Naonori ;
Mansor, Shattri .
REMOTE SENSING, 2019, 11 (12)
[2]   SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation [J].
Badrinarayanan, Vijay ;
Kendall, Alex ;
Cipolla, Roberto .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (12) :2481-2495
[3]   Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields [J].
Barron, Jonathan T. ;
Mildenhall, Ben ;
Tancik, Matthew ;
Hedman, Peter ;
Martin-Brualla, Ricardo ;
Srinivasan, Pratul P. .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :5835-5844
[4]  
Brown TB, 2020, ADV NEUR IN, V33
[5]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[6]   Remote Sensing Image Change Detection With Transformers [J].
Chen, Hao ;
Qi, Zipeng ;
Shi, Zhenwei .
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60
[7]   Building Extraction from Remote Sensing Images with Sparse Token Transformers [J].
Chen, Keyan ;
Zou, Zhengxia ;
Shi, Zhenwei .
REMOTE SENSING, 2021, 13 (21)
[8]  
Chen LC, 2017, Arxiv, DOI [arXiv:1706.05587, 10.48550/arXiv.1706.05587]
[9]   Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation [J].
Chen, Liang-Chieh ;
Zhu, Yukun ;
Papandreou, George ;
Schroff, Florian ;
Adam, Hartwig .
COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 :833-851
[10]   Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [J].
Chen, Xiaokang ;
Lin, Kwan-Yee ;
Wang, Jingbo ;
Wu, Wayne ;
Qian, Chen ;
Li, Hongsheng ;
Zeng, Gang .
COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :561-577