Masked Visual Pre-training for RGB-D and RGB-T Salient Object Detection

被引：0

作者：

Qi, Yanyu ^{[1
,2
]}

Guo, Ruohao ^{[5
]}

Li, Zhenbo ^{[1
,2
,3
,4
]}

Niu, Dantong ^{[6
]}

Qu, Liao ^{[7
]}

机构：

[1] China Agr Univ, Coll Informat & Elect Engn, Beijing, Peoples R China

[2] Natl Innovat Ctr Digital Fishery, Beijing, Peoples R China

[3] Minist Agr & Rural Affairs, Key Lab Smart Farming Technol Aquat Anim & Livest, Beijing, Peoples R China

[4] Beijing Engn & Technol Res Ctr Internet Things Ag, Beijing, Peoples R China

[5] Peking Univ, Beijing, Peoples R China

[6] Univ Calif Berkeley, Berkeley, CA USA

[7] Carnegie Mellon Univ, Pittsburgh, PA USA

来源：

PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024 | 2025年 / 15035卷

关键词：

Saliency detection; Multi-modal fusion; Self-supervised learning; Transformer; Deep learning; NETWORK;

D O I：

10.1007/978-981-97-8620-6_4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent advances in deep learning boost the performance of RGB-D salient object detection (SOD) at the expense of using weights pre-trained on larger-scale labeled RGB datasets. To escape labor-intensive labeling, RGB-D self-supervised learning based on mutual prediction has been proposed to pre-train networks for the RGB-D SOD task. However, its two-stream approach is cumbersome and is far from optimal when transferred to the downstream task. In this paper, we present a neat and effective masked self-supervised pre-training scheme for the RGB-D SOD task. Specifically, we develop a single-stream encoder-decoder framework, with an encoder that operates only on the sampled RGB-D patches, and a joint decoder that reconstructs original RGB images and depth maps simultaneously. This self-supervised pre-training bootstraps our model to learn uni-modal representations and cross-modal synergies, thereby providing a strong initialization for the downstream task. Moreover, we design a mutually exclusive spatial (MES) sampling strategy to sample RGB and depth patches that share no intersection spatially. This allows the encoder to establish richer cross-modal relationships from more spatial locations. Extensive experiments on six benchmarks show that our approach surpasses previous self-supervised learning methods by large margins, and performs favorably against most SOTA models pre-trained on ImageNet. In addition, our model exhibits high robustness under degraded images and transferable generalization on RGB-T benchmarks.

引用

页码：49 / 66

页数：18

共 52 条

[1]

Achanta R, 2009, PROC CVPR IEEE, P1597, DOI 10.1109/CVPRW.2009.5206596

[2]

Baevski A, 2022, PR MACH LEARN RES

[3]

Bao H., 2021, arXiv

[4] Progressively Guided Alternate Refinement Network for RGB-D Salient Object Detection [J].

Chen, Shuhan ;

Fu, Yun .

COMPUTER VISION - ECCV 2020, PT VIII, 2020, 12353 :520-538

[5]

Cheng Y, 2014, IEEE INT CON MULTI

[6]

Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929

[7]

Fan DP, 2018, Arxiv, DOI arXiv:1805.10421

[8] BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network [J].

Fan, Deng-Ping ;

Zhai, Yingjie ;

Borji, Ali ;

Yang, Jufeng ;

Shao, Ling .

COMPUTER VISION - ECCV 2020, PT XII, 2020, 12357 :275-292

[9] Structure-measure: A New Way to Evaluate Foreground Maps [J].

Fan, Deng-Ping ;

Cheng, Ming-Ming ;

Liu, Yun ;

Li, Tao ;

Borji, Ali .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4558-4567

[10] Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks [J].

Fan, Deng-Ping ;

Lin, Zheng ;

Zhang, Zhao ;

Zhu, Menglong ;

Cheng, Ming-Ming .

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (05) :2075-2089

← 1 2 3 4 5 6 →