Spatial-aware multi-modal contrastive learning for RGB-D salient object detection and beyond

被引：0

作者：

Chen, Hao ^{[1
,2
]}

Chen, Zichao ^{[1
]}

Wu, Yongliang ^{[1
]}

Chen, Hongzhuo ^{[1
]}

机构：

[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing 210096, Peoples R China

[2] Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing, Peoples R China

来源：

INFORMATION FUSION | 2025年 / 124卷

基金：

中国国家自然科学基金;

关键词：

RGB-D; Contrastive learning; Salient object detection; Semantic segmentation; NETWORK;

D O I：

10.1016/j.inffus.2025.103362

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

RGB-D dense prediction is widely studied in various instrumentation and measurement systems. The prior deep learning-based methods mainly focus on the design of cross-modal fusion patterns and typically resort to the ImageNet pre-trained weights as initialization to sidestep the scarcity of labeled multi-modal data. However, the domain gap between ImageNet and downstream datasets and the modality gap always lead to biased feature learning and insufficient cross-modal fusion. Instead, we overcome the data insufficiency and fill the gaps crossing domains/modalities by designing a self-supervised multi-modal learning method. We fully use the natural paired spatial information in RGB-D data by crafting a Spatial-aware Multi-modal Multi-scale Contrastive Learning (termed as "MMCL") framework. By optimizing the proposed multi-scale cross-modal contrastive (dis)similarity loss, our pre-training method allows us to extract multi-scale modal-specific cues and heterogeneous cross-modal correlations, thereby facilitating the downstream multi-modal fusion. Comprehensive experiments on RGB-D salient object detection and semantic segmentation demonstrate the efficacy of our multi-modal self-supervised learning framework and the customized multi-scale contrastive loss for multi-modal dense prediction.

引用

页数：10

共 68 条

[41] Hierarchical Dynamic Filtering Network for RGB-D Salient Object Detection [J].

Pang, Youwei ;

Zhang, Lihe ;

Zhao, Xiaoqi ;

Lu, Huchuan .

COMPUTER VISION - ECCV 2020, PT XXV, 2020, 12370 :235-252

[42] RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation [J].

Park, Seong-Jin ;

Hong, Ki-Sang ;

Lee, Seungyong .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4990-4999

[43]

Peng H., 2014, P EUR C COMP VIS ECC

[44] Depth-induced Multi-scale Recurrent Attention Network for Saliency Detection [J].

Piao, Yongri ;

Ji, Wei ;

Li, Jingjing ;

Zhang, Miao ;

Lu, Huchuan .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7253-7262

[45]

Radford A, 2021, PR MACH LEARN RES, V139

[46]

Song SR, 2015, PROC CVPR IEEE, P567, DOI 10.1109/CVPR.2015.7298655

[47] Exploring fusion strategies for accurate RGBT visual object tracking [J].

Tang, Zhangyong ;

Xu, Tianyang ;

Li, Hui ;

Wu, Xiao-Jun ;

Zhu, XueFeng ;

Kittler, Josef .

INFORMATION FUSION, 2023, 99

[48] Contrastive Multiview Coding [J].

Tian, Yonglong ;

Krishnan, Dilip ;

Isola, Phillip .

COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :776-794

[49]

van den Oord A, 2019, Arxiv, DOI arXiv:1807.03748

[50] Learning Discriminative Cross-Modality Features for RGB-D Saliency Detection [J].

Wang, Fengyun ;

Pan, Jinshan ;

Xu, Shoukun ;

Tang, Jinhui .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 :1285-1297

← 1 2 3 4 5 6 7 →