Joint Task-Recursive Learning for RGB-D Scene Understanding

被引:15
作者
Zhang, Zhenyu [1 ,2 ]
Cui, Zhen [1 ,2 ]
Xu, Chunyan [1 ,2 ]
Jie, Zequn [3 ]
Li, Xiang [1 ,2 ]
Yang, Jian [1 ,2 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, PCA Lab,Minist Educ, Key Lab Intelligent Percept & Syst High Dimens In, Nanjing 210094, Peoples R China
[2] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Jiangsu Key Lab Image & Video Understanding Socia, Nanjing 210094, Peoples R China
[3] Tencent AI Lab, Nanjing 210094, Peoples R China
关键词
Task analysis; Estimation; Semantics; Image segmentation; Learning systems; Fuses; Cameras; Depth estimation; surface normal estimation; semantic segmentation; recursive learning; RGB-D scene understanding;
D O I
10.1109/TPAMI.2019.2926728
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
RGB-D scene understanding under monocular camera is an emerging and challenging topic with many potential applications. In this paper, we propose a novel Task-Recursive Learning (TRL) framework to jointly and recurrently conduct three representative tasks therein containing depth estimation, surface normal prediction and semantic segmentation. TRL recursively refines the prediction results through a series of task-level interactions, where one-time cross-task interaction is abstracted as one network block of one time stage. In each stage, we serialize multiple tasks into a sequence and then recursively perform their interactions. To adaptively enhance counterpart patterns, we encapsulate interactions into a specific Task-Attentional Module (TAM) to mutually-boost the tasks from each other. Across stages, the historical experiences of previous states of tasks are selectively propagated into the next stages by using Feature-Selection unit (FS-Unit), which takes advantage of complementary information across tasks. The sequence of task-level interactions is also evolved along a coarse-to-fine scale space such that the required details may be refined progressively. Finally the task-abstracted sequence problem of multi-task prediction is framed into a recursive network. Extensive experiments on NYU-Depth v2 and SUN RGB-D datasets demonstrate that our method can recursively refines the results of the triple tasks and achieves state-of-the-art performance.
引用
收藏
页码:2608 / 2623
页数:16
相关论文
共 71 条
[41]   Cross-stitch Networks for Multi-task Learning [J].
Misra, Ishan ;
Shrivastava, Abhinav ;
Gupta, Abhinav ;
Hebert, Martial .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3994-4003
[42]   Learning Deconvolution Network for Semantic Segmentation [J].
Noh, Hyeonwoo ;
Hong, Seunghoon ;
Han, Bohyung .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1520-1528
[43]   RDFNet: RGB-D Multi-level Residual Feature Fusion for Indoor Semantic Segmentation [J].
Park, Seong-Jin ;
Hong, Ki-Sang ;
Lee, Seungyong .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :4990-4999
[44]   GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation [J].
Qi, Xiaojuan ;
Liao, Renjie ;
Liu, Zhengzhe ;
Urtasun, Raquel ;
Jia, Jiaya .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :283-291
[45]   3D Graph Neural Networks for RGBD Semantic Segmentation [J].
Qi, Xiaojuan ;
Liao, Renjie ;
Jia, Jiaya ;
Fidler, Sanja ;
Urtasun, Raquel .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :5209-5218
[46]   CNN Features off-the-shelf: an Astounding Baseline for Recognition [J].
Razavian, Ali Sharif ;
Azizpour, Hossein ;
Sullivan, Josephine ;
Carlsson, Stefan .
2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2014, :512-519
[47]   Monocular Depth Estimation Using Neural Regression Forest [J].
Roy, Anirban ;
Todorovic, Sinisa .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :5506-5514
[48]   Fully Convolutional Networks for Semantic Segmentation [J].
Shelhamer, Evan ;
Long, Jonathan ;
Darrell, Trevor .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) :640-651
[49]   Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network [J].
Shi, Wenzhe ;
Caballero, Jose ;
Huszar, Ferenc ;
Totz, Johannes ;
Aitken, Andrew P. ;
Bishop, Rob ;
Rueckert, Daniel ;
Wang, Zehan .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1874-1883
[50]   Indoor Segmentation and Support Inference from RGBD Images [J].
Silberman, Nathan ;
Hoiem, Derek ;
Kohli, Pushmeet ;
Fergus, Rob .
COMPUTER VISION - ECCV 2012, PT V, 2012, 7576 :746-760