Cross-modal Unsupervised Domain Adaptation for 3D Semantic Segmentation via Bidirectional Fusion-then-Distillation

被引：1

作者：

Wu, Yao ^{[1
]}

Xing, Mingwei ^{[2
]}

Zhang, Yachao ^{[3
]}

Xie, Yuan ^{[4
,5
]}

Fan, Jianping ^{[6
]}

Shi, Zhongchao ^{[6
]}

Qu, Yanyun ^{[2
]}

机构：

[1] Xiamen Univ, Sch Informat, Xiamen, Peoples R China

[2] Xiamen Univ, Inst Artificial Intelligence, Xiamen, Peoples R China

[3] Tsinghua Univ, Shenzhen, Peoples R China

[4] East China Normal Univ, Shanghai, Peoples R China

[5] East China Normal Univ, Chongqing Inst, Chongqing, Peoples R China

[6] Lenovo Res, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

中国博士后科学基金; 中国国家自然科学基金;

关键词：

3D semantic segmentation; Unsupervised domain adaptation;

D O I：

10.1145/3581783.3612013

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Cross-modal Unsupervised Domain Adaptation (UDA) becomes a research hotspot because it reduces the laborious annotation of target domain samples. Existing methods only mutually mimic the outputs of cross-modality in each domain, which enforces the class probability distribution agreeable in different domains. However, these methods ignore the complementarity brought by the modality fusion representation in cross-modal learning. In this paper, we propose a cross-modal UDA method for 3D semantic segmentation via Bidirectional Fusion-then-Distillation, named BFtD-xMUDA, which explores cross-modal fusion in UDA and realizes distribution consistency between outputs of two domains not only for 2D image and 3D point cloud but also for 2D/3D and fusion. Our method contains three significant components: Model-agnostic Feature Fusion Module (MFFM), Bidirectional Distillation (B-Distill), and Cross-modal Debiased Pseudo-Labeling (xDPL). MFFM is employed to generate cross-modal fusion features for establishing a latent space, which enforces maximum correlation and complementarity between two heterogeneous modalities. B-Distill is introduced to exploit bidirectional knowledge distillation which includes cross-modality and cross-domain fusion distillation, and well-achieving domain-modality alignment. xDPL is designed to model the uncertainty of pseudo-labels by self-training scheme. Extensive experimental results demonstrate that our method outperforms state-of-the-art competitors in several adaptation scenarios.

引用

页码：490 / 498

页数：9

共 50 条

[1] 3D Semantic Parsing of Large-Scale Indoor Spaces [J].

Armeni, Iro ;

Sener, Ozan ;

Zamir, Amir R. ;

Jiang, Helen ;

Brilakis, Ioannis ;

Fischer, Martin ;

Savarese, Silvio .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1534-1543

[2] SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences [J].

Behley, Jens ;

Garbade, Martin ;

Milioto, Andres ;

Quenzel, Jan ;

Behnke, Sven ;

Stachniss, Cyrill ;

Gall, Juergen .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9296-9306

[3] nuScenes: A multimodal dataset for autonomous driving [J].

Caesar, Holger ;

Bankiti, Varun ;

Lang, Alex H. ;

Vora, Sourabh ;

Liong, Venice Erin ;

Xu, Qiang ;

Krishnan, Anush ;

Pan, Yu ;

Baldan, Giancarlo ;

Beijbom, Oscar .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11618-11628

[4] (AF)2-S3Net: Attentive Feature Fusion with Adaptive Feature Selection for Sparse Semantic Segmentation Network [J].

Cheng, Ran ;

Razani, Ryan ;

Taghavi, Ehsan ;

Li, Enxu ;

Liu, Bingbing .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :12542-12551

[5] 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks [J].

Choy, Christopher ;

Gwak, JunYoung ;

Savarese, Silvio .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3070-3079

[6]

Cortinhal Tiago, 2020, Advances in Visual Computing. 15th International Symposium, ISVC 2020. Proceedings. Lecture Notes in Computer Science (LNCS 12510), P207, DOI 10.1007/978-3-030-64559-5_16

[7] ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes [J].

Dai, Angela ;

Chang, Angel X. ;

Savva, Manolis ;

Halber, Maciej ;

Funkhouser, Thomas ;

Niessner, Matthias .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2432-2443

[8]

El Madawi K, 2019, IEEE INT C INTELL TR, P7, DOI [10.1109/ITSC.2019.8917447, 10.1109/itsc.2019.8917447]

[9] Learning 3D Semantic Segmentation with only 2D Image Supervision [J].

Genova, Kyle ;

Yin, Xiaoqi ;

Kundu, Abhijit ;

Pantofaru, Caroline ;

Cole, Forrester ;

Sud, Avneesh ;

Brewington, Brian ;

Shucker, Brian ;

Funkhouser, Thomas .

2021 INTERNATIONAL CONFERENCE ON 3D VISION (3DV 2021), 2021, :361-372

[10] 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks [J].

Graham, Benjamin ;

Engelcke, Martin ;

van der Maaten, Laurens .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :9224-9232

← 1 2 3 4 5 →