Spatial-aware multi-modal contrastive learning for RGB-D salient object detection and beyond

被引：0

作者：

Chen, Hao ^{[1
,2
]}

Chen, Zichao ^{[1
]}

Wu, Yongliang ^{[1
]}

Chen, Hongzhuo ^{[1
]}

机构：

[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing 210096, Peoples R China

[2] Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing, Peoples R China

来源：

INFORMATION FUSION | 2025年 / 124卷

基金：

中国国家自然科学基金;

关键词：

RGB-D; Contrastive learning; Salient object detection; Semantic segmentation; NETWORK;

D O I：

10.1016/j.inffus.2025.103362

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

RGB-D dense prediction is widely studied in various instrumentation and measurement systems. The prior deep learning-based methods mainly focus on the design of cross-modal fusion patterns and typically resort to the ImageNet pre-trained weights as initialization to sidestep the scarcity of labeled multi-modal data. However, the domain gap between ImageNet and downstream datasets and the modality gap always lead to biased feature learning and insufficient cross-modal fusion. Instead, we overcome the data insufficiency and fill the gaps crossing domains/modalities by designing a self-supervised multi-modal learning method. We fully use the natural paired spatial information in RGB-D data by crafting a Spatial-aware Multi-modal Multi-scale Contrastive Learning (termed as "MMCL") framework. By optimizing the proposed multi-scale cross-modal contrastive (dis)similarity loss, our pre-training method allows us to extract multi-scale modal-specific cues and heterogeneous cross-modal correlations, thereby facilitating the downstream multi-modal fusion. Comprehensive experiments on RGB-D salient object detection and semantic segmentation demonstrate the efficacy of our multi-modal self-supervised learning framework and the customized multi-scale contrastive loss for multi-modal dense prediction.

引用

页数：10

共 68 条

[11]

Chen XL, 2020, Arxiv, DOI arXiv:2003.04297

[12] Locality-Sensitive Deconvolution Networks with Gated Fusion for RGB-D Indoor Semantic Segmentation [J].

Cheng, Yanhua ;

Cai, Rui ;

Li, Zhiwei ;

Zhao, Xin ;

Huang, Kaiqi .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1475-1483

[13]

Cheng Y, 2014, IEEE INT CON MULTI

[14] When Does Contrastive Visual Representation Learning Work? [J].

Cole, Elijah ;

Yang, Xuan ;

Wilber, Kimberly ;

Mac Aodha, Oisin ;

Belongie, Serge .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :14735-14744

[15] The Pascal Visual Object Classes (VOC) Challenge [J].

Everingham, Mark ;

Van Gool, Luc ;

Williams, Christopher K. I. ;

Winn, John ;

Zisserman, Andrew .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) :303-338

[16] BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network [J].

Fan, Deng-Ping ;

Zhai, Yingjie ;

Borji, Ali ;

Yang, Jufeng ;

Shao, Ling .

COMPUTER VISION - ECCV 2020, PT XII, 2020, 12357 :275-292

[17] Structure-measure: A New Way to Evaluate Foreground Maps [J].

Fan, Deng-Ping ;

Cheng, Ming-Ming ;

Liu, Yun ;

Li, Tao ;

Borji, Ali .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4558-4567

[18]

Fan DP, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P698

[19] Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks [J].

Fan, Deng-Ping ;

Lin, Zheng ;

Zhang, Zhao ;

Zhu, Menglong ;

Cheng, Ming-Ming .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (05) :2075-2089

[20]

Grill J., 2020, NeurIPS, V33, P21271

← 1 2 3 4 5 6 7 →