Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks

被引:422
作者
Fan, Deng-Ping [1 ,2 ]
Lin, Zheng [1 ]
Zhang, Zhao [1 ]
Zhu, Menglong [3 ]
Cheng, Ming-Ming [1 ]
机构
[1] Nankai Univ, Coll Comp Sci, Tianjin 300350, Peoples R China
[2] Inception Inst Artificial Intelligence IIAI, Abu Dhabi, U Arab Emirates
[3] Google AI, Mountain View, CA 94043 USA
关键词
Benchmark; RGB-D; saliency; salient object detection (SOD); Salient Person (SIP) data set; FUSION; NETWORK; CONTRAST;
D O I
10.1109/TNNLS.2020.2996406
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The use of RGB-D information for salient object detection (SOD) has been extensively explored in recent years. However, relatively few efforts have been put toward modeling SOD in real-world human activity scenes with RGB-D. In this article, we fill the gap by making the following contributions to RGB-D SOD: 1) we carefully collect a new Salient Person (SIP) data set that consists of similar to 1 K high-resolution images that cover diverse real-world scenes from various viewpoints, poses, occlusions, illuminations, and backgrounds; 2) we conduct a large-scale (and, so far, the most comprehensive) benchmark comparing contemporary methods, which has long been missing in the field and can serve as a baseline for future research, and we systematically summarize 32 popular models and evaluate 18 parts of 32 models on seven data sets containing a total of about 97k images; and 3) we propose a simple general architecture, called deep depth-depurator network (D(3)Net). It consists of a depth depurator unit (DDU) and a three-stream feature learning module (FLM), which performs low-quality depth map filtering and cross-modal feature learning, respectively. These components form a nested structure and are elaborately designed to be learned jointly. D(3)Net exceeds the performance of any prior contenders across all five metrics under consideration, thus serving as a strong model to advance research in this field. We also demonstrate that D(3)Net can be used to efficiently extract salient object masks from real scenes, enabling effective background-changing application with a speed of 65 frames/s on a single GPU. All the saliency maps, our new SIP data set, the D(3)Net model, and the evaluation tools are publicly available at https://github.com/DengPingFan/D3NetBenchmark.
引用
收藏
页码:2075 / 2089
页数:15
相关论文
共 120 条
  • [1] Achanta R, 2009, PROC CVPR IEEE, P1597, DOI 10.1109/CVPRW.2009.5206596
  • [2] Alpert S, 2007, PROC CVPR IEEE, P359
  • [3] Amirul M. A., 2017, PROC BRIT MACH VIS C, P1
  • [4] [Anonymous], 2017, IEEE INT SYMP ELEC
  • [5] Salient Object Detection: A Benchmark
    Borji, Ali
    Cheng, Ming-Ming
    Jiang, Huaizu
    Li, Jia
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2015, 24 (12) : 5706 - 5722
  • [6] Salient object detection: A survey
    Borji, Ali
    Cheng, Ming-Ming
    Hou, Qibin
    Jiang, Huaizu
    Li, Jia
    [J]. COMPUTATIONAL VISUAL MEDIA, 2019, 5 (02) : 117 - 150
  • [7] LIBSVM: A Library for Support Vector Machines
    Chang, Chih-Chung
    Lin, Chih-Jen
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
  • [8] Three-Stream Attention-Aware Network for RGB-D Salient Object Detection
    Chen, Hao
    Li, Youfu
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (06) : 2825 - 2835
  • [9] Chen H, 2018, IEEE INT C INT ROBOT, P6821, DOI 10.1109/IROS.2018.8594373
  • [10] Progressively Complementarity-aware Fusion Network for RGB-D Salient Object Detection
    Chen, Hao
    Li, Youfu
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 3051 - 3060