Spatial-aware multi-modal contrastive learning for RGB-D salient object detection and beyond

被引:0
作者
Chen, Hao [1 ,2 ]
Chen, Zichao [1 ]
Wu, Yongliang [1 ]
Chen, Hongzhuo [1 ]
机构
[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing 210096, Peoples R China
[2] Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
RGB-D; Contrastive learning; Salient object detection; Semantic segmentation; NETWORK;
D O I
10.1016/j.inffus.2025.103362
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
RGB-D dense prediction is widely studied in various instrumentation and measurement systems. The prior deep learning-based methods mainly focus on the design of cross-modal fusion patterns and typically resort to the ImageNet pre-trained weights as initialization to sidestep the scarcity of labeled multi-modal data. However, the domain gap between ImageNet and downstream datasets and the modality gap always lead to biased feature learning and insufficient cross-modal fusion. Instead, we overcome the data insufficiency and fill the gaps crossing domains/modalities by designing a self-supervised multi-modal learning method. We fully use the natural paired spatial information in RGB-D data by crafting a Spatial-aware Multi-modal Multi-scale Contrastive Learning (termed as "MMCL") framework. By optimizing the proposed multi-scale cross-modal contrastive (dis)similarity loss, our pre-training method allows us to extract multi-scale modal-specific cues and heterogeneous cross-modal correlations, thereby facilitating the downstream multi-modal fusion. Comprehensive experiments on RGB-D salient object detection and semantic segmentation demonstrate the efficacy of our multi-modal self-supervised learning framework and the customized multi-scale contrastive loss for multi-modal dense prediction.
引用
收藏
页数:10
相关论文
共 68 条
[1]  
Achanta R, 2009, PROC CVPR IEEE, P1597, DOI 10.1109/CVPRW.2009.5206596
[2]   Salient Object Detection: A Benchmark [J].
Borji, Ali ;
Sihite, Dicky N. ;
Itti, Laurent .
COMPUTER VISION - ECCV 2012, PT II, 2012, 7573 :414-429
[3]  
Caron M, 2020, ADV NEUR IN, V33
[4]  
Chen G., 2024, IEEE Trans. Neural Netw. Learn. Syst.
[5]   Three-Stream Attention-Aware Network for RGB-D Salient Object Detection [J].
Chen, Hao ;
Li, Youfu .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (06) :2825-2835
[6]   CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse [J].
Chen, Hao ;
Li, Youfu ;
Deng, Yongjian ;
Lin, Guosheng .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (07) :2076-2096
[7]   Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection [J].
Chen, Hao ;
Li, Youfu ;
Su, Dan .
PATTERN RECOGNITION, 2019, 86 :376-385
[8]   Life regression based patch slimming for vision transformers [J].
Chen, Jiawei ;
Chen, Lin ;
Yang, Jiang ;
Shi, Tianqi ;
Cheng, Lechao ;
Feng, Zunlei ;
Song, Mingli .
NEURAL NETWORKS, 2024, 176
[9]   CFIDNet: cascaded feature interaction decoder for RGB-D salient object detection [J].
Chen, Tianyou ;
Hu, Xiaoguang ;
Xiao, Jin ;
Zhang, Guofeng ;
Wang, Shaojie .
NEURAL COMPUTING & APPLICATIONS, 2022, 34 (10) :7547-7563
[10]  
Chen T, 2020, PR MACH LEARN RES, V119