Spatial-aware multi-modal contrastive learning for RGB-D salient object detection and beyond

被引:0
作者
Chen, Hao [1 ,2 ]
Chen, Zichao [1 ]
Wu, Yongliang [1 ]
Chen, Hongzhuo [1 ]
机构
[1] Southeast Univ, Sch Comp Sci & Engn, Nanjing 210096, Peoples R China
[2] Southeast Univ, Key Lab New Generat Artificial Intelligence Techno, Minist Educ, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
RGB-D; Contrastive learning; Salient object detection; Semantic segmentation; NETWORK;
D O I
10.1016/j.inffus.2025.103362
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
RGB-D dense prediction is widely studied in various instrumentation and measurement systems. The prior deep learning-based methods mainly focus on the design of cross-modal fusion patterns and typically resort to the ImageNet pre-trained weights as initialization to sidestep the scarcity of labeled multi-modal data. However, the domain gap between ImageNet and downstream datasets and the modality gap always lead to biased feature learning and insufficient cross-modal fusion. Instead, we overcome the data insufficiency and fill the gaps crossing domains/modalities by designing a self-supervised multi-modal learning method. We fully use the natural paired spatial information in RGB-D data by crafting a Spatial-aware Multi-modal Multi-scale Contrastive Learning (termed as "MMCL") framework. By optimizing the proposed multi-scale cross-modal contrastive (dis)similarity loss, our pre-training method allows us to extract multi-scale modal-specific cues and heterogeneous cross-modal correlations, thereby facilitating the downstream multi-modal fusion. Comprehensive experiments on RGB-D salient object detection and semantic segmentation demonstrate the efficacy of our multi-modal self-supervised learning framework and the customized multi-scale contrastive loss for multi-modal dense prediction.
引用
收藏
页数:10
相关论文
共 68 条
[11]  
Chen XL, 2020, Arxiv, DOI arXiv:2003.04297
[12]   Locality-Sensitive Deconvolution Networks with Gated Fusion for RGB-D Indoor Semantic Segmentation [J].
Cheng, Yanhua ;
Cai, Rui ;
Li, Zhiwei ;
Zhao, Xin ;
Huang, Kaiqi .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1475-1483
[13]  
Cheng Y, 2014, IEEE INT CON MULTI
[14]   When Does Contrastive Visual Representation Learning Work? [J].
Cole, Elijah ;
Yang, Xuan ;
Wilber, Kimberly ;
Mac Aodha, Oisin ;
Belongie, Serge .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :14735-14744
[15]   The Pascal Visual Object Classes (VOC) Challenge [J].
Everingham, Mark ;
Van Gool, Luc ;
Williams, Christopher K. I. ;
Winn, John ;
Zisserman, Andrew .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2010, 88 (02) :303-338
[16]   BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network [J].
Fan, Deng-Ping ;
Zhai, Yingjie ;
Borji, Ali ;
Yang, Jufeng ;
Shao, Ling .
COMPUTER VISION - ECCV 2020, PT XII, 2020, 12357 :275-292
[17]   Structure-measure: A New Way to Evaluate Foreground Maps [J].
Fan, Deng-Ping ;
Cheng, Ming-Ming ;
Liu, Yun ;
Li, Tao ;
Borji, Ali .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4558-4567
[18]  
Fan DP, 2018, PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P698
[19]   Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks [J].
Fan, Deng-Ping ;
Lin, Zheng ;
Zhang, Zhao ;
Zhu, Menglong ;
Cheng, Ming-Ming .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (05) :2075-2089
[20]  
Grill J., 2020, NeurIPS, V33, P21271