Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks

被引:473
作者
Fan, Deng-Ping [1 ,2 ]
Lin, Zheng [1 ]
Zhang, Zhao [1 ]
Zhu, Menglong [3 ]
Cheng, Ming-Ming [1 ]
机构
[1] Nankai Univ, Coll Comp Sci, Tianjin 300350, Peoples R China
[2] Inception Inst Artificial Intelligence IIAI, Abu Dhabi, U Arab Emirates
[3] Google AI, Mountain View, CA 94043 USA
关键词
Benchmark; RGB-D; saliency; salient object detection (SOD); Salient Person (SIP) data set; FUSION; NETWORK; CONTRAST;
D O I
10.1109/TNNLS.2020.2996406
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The use of RGB-D information for salient object detection (SOD) has been extensively explored in recent years. However, relatively few efforts have been put toward modeling SOD in real-world human activity scenes with RGB-D. In this article, we fill the gap by making the following contributions to RGB-D SOD: 1) we carefully collect a new Salient Person (SIP) data set that consists of similar to 1 K high-resolution images that cover diverse real-world scenes from various viewpoints, poses, occlusions, illuminations, and backgrounds; 2) we conduct a large-scale (and, so far, the most comprehensive) benchmark comparing contemporary methods, which has long been missing in the field and can serve as a baseline for future research, and we systematically summarize 32 popular models and evaluate 18 parts of 32 models on seven data sets containing a total of about 97k images; and 3) we propose a simple general architecture, called deep depth-depurator network (D(3)Net). It consists of a depth depurator unit (DDU) and a three-stream feature learning module (FLM), which performs low-quality depth map filtering and cross-modal feature learning, respectively. These components form a nested structure and are elaborately designed to be learned jointly. D(3)Net exceeds the performance of any prior contenders across all five metrics under consideration, thus serving as a strong model to advance research in this field. We also demonstrate that D(3)Net can be used to efficiently extract salient object masks from real scenes, enabling effective background-changing application with a speed of 65 frames/s on a single GPU. All the saliency maps, our new SIP data set, the D(3)Net model, and the evaluation tools are publicly available at https://github.com/DengPingFan/D3NetBenchmark.
引用
收藏
页码:2075 / 2089
页数:15
相关论文
共 120 条
[1]  
Achanta R, 2009, PROC CVPR IEEE, P1597, DOI 10.1109/CVPRW.2009.5206596
[2]  
Alpert S, 2007, PROC CVPR IEEE, P359
[3]  
Amirul M. A., 2017, PROC BRIT MACH VIS C, P1
[4]  
[Anonymous], 2017, IEEE INT SYMP ELEC
[5]   Salient Object Detection: A Benchmark [J].
Borji, Ali ;
Cheng, Ming-Ming ;
Jiang, Huaizu ;
Li, Jia .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2015, 24 (12) :5706-5722
[6]   Salient object detection: A survey [J].
Borji, Ali ;
Cheng, Ming-Ming ;
Hou, Qibin ;
Jiang, Huaizu ;
Li, Jia .
COMPUTATIONAL VISUAL MEDIA, 2019, 5 (02) :117-150
[7]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[8]   Three-Stream Attention-Aware Network for RGB-D Salient Object Detection [J].
Chen, Hao ;
Li, Youfu .
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (06) :2825-2835
[9]  
Chen H, 2018, IEEE INT C INT ROBOT, P6821, DOI 10.1109/IROS.2018.8594373
[10]   Progressively Complementarity-aware Fusion Network for RGB-D Salient Object Detection [J].
Chen, Hao ;
Li, Youfu .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3051-3060