Masked Visual Pre-training for RGB-D and RGB-T Salient Object Detection

被引:0
作者
Qi, Yanyu [1 ,2 ]
Guo, Ruohao [5 ]
Li, Zhenbo [1 ,2 ,3 ,4 ]
Niu, Dantong [6 ]
Qu, Liao [7 ]
机构
[1] China Agr Univ, Coll Informat & Elect Engn, Beijing, Peoples R China
[2] Natl Innovat Ctr Digital Fishery, Beijing, Peoples R China
[3] Minist Agr & Rural Affairs, Key Lab Smart Farming Technol Aquat Anim & Livest, Beijing, Peoples R China
[4] Beijing Engn & Technol Res Ctr Internet Things Ag, Beijing, Peoples R China
[5] Peking Univ, Beijing, Peoples R China
[6] Univ Calif Berkeley, Berkeley, CA USA
[7] Carnegie Mellon Univ, Pittsburgh, PA USA
来源
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024 | 2025年 / 15035卷
关键词
Saliency detection; Multi-modal fusion; Self-supervised learning; Transformer; Deep learning; NETWORK;
D O I
10.1007/978-981-97-8620-6_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in deep learning boost the performance of RGB-D salient object detection (SOD) at the expense of using weights pre-trained on larger-scale labeled RGB datasets. To escape labor-intensive labeling, RGB-D self-supervised learning based on mutual prediction has been proposed to pre-train networks for the RGB-D SOD task. However, its two-stream approach is cumbersome and is far from optimal when transferred to the downstream task. In this paper, we present a neat and effective masked self-supervised pre-training scheme for the RGB-D SOD task. Specifically, we develop a single-stream encoder-decoder framework, with an encoder that operates only on the sampled RGB-D patches, and a joint decoder that reconstructs original RGB images and depth maps simultaneously. This self-supervised pre-training bootstraps our model to learn uni-modal representations and cross-modal synergies, thereby providing a strong initialization for the downstream task. Moreover, we design a mutually exclusive spatial (MES) sampling strategy to sample RGB and depth patches that share no intersection spatially. This allows the encoder to establish richer cross-modal relationships from more spatial locations. Extensive experiments on six benchmarks show that our approach surpasses previous self-supervised learning methods by large margins, and performs favorably against most SOTA models pre-trained on ImageNet. In addition, our model exhibits high robustness under degraded images and transferable generalization on RGB-T benchmarks.
引用
收藏
页码:49 / 66
页数:18
相关论文
共 52 条
[1]  
Achanta R, 2009, PROC CVPR IEEE, P1597, DOI 10.1109/CVPRW.2009.5206596
[2]  
Baevski A, 2022, PR MACH LEARN RES
[3]  
Bao H., 2021, arXiv
[4]   Progressively Guided Alternate Refinement Network for RGB-D Salient Object Detection [J].
Chen, Shuhan ;
Fu, Yun .
COMPUTER VISION - ECCV 2020, PT VIII, 2020, 12353 :520-538
[5]  
Cheng Y, 2014, IEEE INT CON MULTI
[6]  
Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[7]  
Fan DP, 2018, Arxiv, DOI arXiv:1805.10421
[8]   BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network [J].
Fan, Deng-Ping ;
Zhai, Yingjie ;
Borji, Ali ;
Yang, Jufeng ;
Shao, Ling .
COMPUTER VISION - ECCV 2020, PT XII, 2020, 12357 :275-292
[9]   Structure-measure: A New Way to Evaluate Foreground Maps [J].
Fan, Deng-Ping ;
Cheng, Ming-Ming ;
Liu, Yun ;
Li, Tao ;
Borji, Ali .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4558-4567
[10]   Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks [J].
Fan, Deng-Ping ;
Lin, Zheng ;
Zhang, Zhao ;
Zhu, Menglong ;
Cheng, Ming-Ming .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (05) :2075-2089