Spatial-aware multi-modal contrastive learning for RGB-D salient object detection and beyond

被引：0

作者：

Chen, Hao ^{[1
,2
]}

Chen, Zichao ^{[1
]}

Wu, Yongliang ^{[1
]}

Chen, Hongzhuo ^{[1
]}

机构：

[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing 210096, Peoples R China

[2] Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing, Peoples R China

来源：

INFORMATION FUSION | 2025年 / 124卷

基金：

中国国家自然科学基金;

关键词：

RGB-D; Contrastive learning; Salient object detection; Semantic segmentation; NETWORK;

D O I：

10.1016/j.inffus.2025.103362

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

RGB-D dense prediction is widely studied in various instrumentation and measurement systems. The prior deep learning-based methods mainly focus on the design of cross-modal fusion patterns and typically resort to the ImageNet pre-trained weights as initialization to sidestep the scarcity of labeled multi-modal data. However, the domain gap between ImageNet and downstream datasets and the modality gap always lead to biased feature learning and insufficient cross-modal fusion. Instead, we overcome the data insufficiency and fill the gaps crossing domains/modalities by designing a self-supervised multi-modal learning method. We fully use the natural paired spatial information in RGB-D data by crafting a Spatial-aware Multi-modal Multi-scale Contrastive Learning (termed as "MMCL") framework. By optimizing the proposed multi-scale cross-modal contrastive (dis)similarity loss, our pre-training method allows us to extract multi-scale modal-specific cues and heterogeneous cross-modal correlations, thereby facilitating the downstream multi-modal fusion. Comprehensive experiments on RGB-D salient object detection and semantic segmentation demonstrate the efficacy of our multi-modal self-supervised learning framework and the customized multi-scale contrastive loss for multi-modal dense prediction.

引用

页数：10

共 68 条

[1]

Achanta R, 2009, PROC CVPR IEEE, P1597, DOI 10.1109/CVPRW.2009.5206596

[2] Salient Object Detection: A Benchmark [J].

Borji, Ali ;

Sihite, Dicky N. ;

Itti, Laurent .

COMPUTER VISION - ECCV 2012, PT II, 2012, 7573 :414-429

[3]

Caron M, 2020, ADV NEUR IN, V33

[4]

Chen G., 2024, IEEE Trans. Neural Netw. Learn. Syst.

[5] Three-Stream Attention-Aware Network for RGB-D Salient Object Detection [J].

Chen, Hao ;

Li, Youfu .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (06) :2825-2835

[6] CNN-Based RGB-D Salient Object Detection: Learn, Select, and Fuse [J].

Chen, Hao ;

Li, Youfu ;

Deng, Yongjian ;

Lin, Guosheng .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (07) :2076-2096

[7] Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection [J].

Chen, Hao ;

Li, Youfu ;

Su, Dan .

PATTERN RECOGNITION, 2019, 86 :376-385

[8] Life regression based patch slimming for vision transformers [J].

Chen, Jiawei ;

Chen, Lin ;

Yang, Jiang ;

Shi, Tianqi ;

Cheng, Lechao ;

Feng, Zunlei ;

Song, Mingli .

NEURAL NETWORKS, 2024, 176

[9] CFIDNet: cascaded feature interaction decoder for RGB-D salient object detection [J].

Chen, Tianyou ;

Hu, Xiaoguang ;

Xiao, Jin ;

Zhang, Guofeng ;

Wang, Shaojie .

NEURAL COMPUTING & APPLICATIONS, 2022, 34 (10) :7547-7563

[10]

Chen T, 2020, PR MACH LEARN RES, V119

← 1 2 3 4 5 6 7 →