Spatial-aware multi-modal contrastive learning for RGB-D salient object detection and beyond

被引:0
作者
Chen, Hao [1 ,2 ]
Chen, Zichao [1 ]
Wu, Yongliang [1 ]
Chen, Hongzhuo [1 ]
机构
[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing 210096, Peoples R China
[2] Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
RGB-D; Contrastive learning; Salient object detection; Semantic segmentation; NETWORK;
D O I
10.1016/j.inffus.2025.103362
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
RGB-D dense prediction is widely studied in various instrumentation and measurement systems. The prior deep learning-based methods mainly focus on the design of cross-modal fusion patterns and typically resort to the ImageNet pre-trained weights as initialization to sidestep the scarcity of labeled multi-modal data. However, the domain gap between ImageNet and downstream datasets and the modality gap always lead to biased feature learning and insufficient cross-modal fusion. Instead, we overcome the data insufficiency and fill the gaps crossing domains/modalities by designing a self-supervised multi-modal learning method. We fully use the natural paired spatial information in RGB-D data by crafting a Spatial-aware Multi-modal Multi-scale Contrastive Learning (termed as "MMCL") framework. By optimizing the proposed multi-scale cross-modal contrastive (dis)similarity loss, our pre-training method allows us to extract multi-scale modal-specific cues and heterogeneous cross-modal correlations, thereby facilitating the downstream multi-modal fusion. Comprehensive experiments on RGB-D salient object detection and semantic segmentation demonstrate the efficacy of our multi-modal self-supervised learning framework and the customized multi-scale contrastive loss for multi-modal dense prediction.
引用
收藏
页数:10
相关论文
共 68 条
[41]   Hierarchical Dynamic Filtering Network for RGB-D Salient Object Detection [J].
Pang, Youwei ;
Zhang, Lihe ;
Zhao, Xiaoqi ;
Lu, Huchuan .
COMPUTER VISION - ECCV 2020, PT XXV, 2020, 12370 :235-252
[42]   RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation [J].
Park, Seong-Jin ;
Hong, Ki-Sang ;
Lee, Seungyong .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4990-4999
[43]  
Peng H., 2014, P EUR C COMP VIS ECC
[44]   Depth-induced Multi-scale Recurrent Attention Network for Saliency Detection [J].
Piao, Yongri ;
Ji, Wei ;
Li, Jingjing ;
Zhang, Miao ;
Lu, Huchuan .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7253-7262
[45]  
Radford A, 2021, PR MACH LEARN RES, V139
[46]  
Song SR, 2015, PROC CVPR IEEE, P567, DOI 10.1109/CVPR.2015.7298655
[47]   Exploring fusion strategies for accurate RGBT visual object tracking [J].
Tang, Zhangyong ;
Xu, Tianyang ;
Li, Hui ;
Wu, Xiao-Jun ;
Zhu, XueFeng ;
Kittler, Josef .
INFORMATION FUSION, 2023, 99
[48]   Contrastive Multiview Coding [J].
Tian, Yonglong ;
Krishnan, Dilip ;
Isola, Phillip .
COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :776-794
[49]  
van den Oord A, 2019, Arxiv, DOI arXiv:1807.03748
[50]   Learning Discriminative Cross-Modality Features for RGB-D Saliency Detection [J].
Wang, Fengyun ;
Pan, Jinshan ;
Xu, Shoukun ;
Tang, Jinhui .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 :1285-1297