Laplacian Pyramid Neural Network for Dense Continuous-Value Regression for Complex Scenes

被引:17
作者
Chen, Xuejin [1 ]
Chen, Xiaotian [1 ]
Zhang, Yiteng [1 ]
Fu, Xueyang [1 ]
Zha, Zheng-Jun [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Brain Inspired Intelligence Technol, Hefei 230026, Peoples R China
基金
中国国家自然科学基金;
关键词
Estimation; Task analysis; Laplace equations; Semantics; Image reconstruction; Buildings; Satellites; Deep neural network; dense continuous-value regression (DCR); depth estimation; height estimation; Laplacian pyramid;
D O I
10.1109/TNNLS.2020.3026669
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many computer vision tasks, such as monocular depth estimation and height estimation from a satellite orthophoto, have a common underlying goal, which is regression of dense continuous values for the pixels given a single image. We define them as dense continuous-value regression (DCR) tasks. Recent approaches based on deep convolutional neural networks significantly improve the performance of DCR tasks, particularly on pixelwise regression accuracy. However, it still remains challenging to simultaneously preserve the global structure and fine object details in complex scenes. In this article, we take advantage of the efficiency of Laplacian pyramid on representing multiscale contents to reconstruct high-quality signals for complex scenes. We design a Laplacian pyramid neural network (LAPNet), which consists of a Laplacian pyramid decoder (LPD) for signal reconstruction and an adaptive dense feature fusion (ADFF) module to fuse features from the input image. More specifically, we build an LPD to effectively express both global and local scene structures. In our LPD, the upper and lower levels, respectively, represent scene layouts and shape details. We introduce a residual refinement module to progressively complement high-frequency details for signal prediction at each level. To recover the signals at each individual level in the pyramid, an ADFF module is proposed to adaptively fuse multiscale image features for accurate prediction. We conduct comprehensive experiments to evaluate a number of variants of our model on three important DCR tasks, i.e., monocular depth estimation, single-image height estimation, and density map estimation for crowd counting. Experiments demonstrate that our method achieves new state-of-the-art performance in both qualitative and quantitative evaluation on the NYU-D V2 and KITTI for monocular depth estimation, the challenging Urban Semantic 3D (US3D) for satellite height estimation, and four challenging benchmarks for crowd counting. These results demonstrate that the proposed LAPNet is a universal and effective architecture for DCR problems.
引用
收藏
页码:5034 / 5046
页数:13
相关论文
共 67 条
[1]  
Amiri Ali Jahani, 2019, 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), P602, DOI 10.1109/ROBIO49542.2019.8961504
[2]  
[Anonymous], 2015, CoRR
[3]  
[Anonymous], 2017, PROC ADV NEURAL INF
[4]   Semantic Stereo for Incidental Satellite Images [J].
Bosch, Marc ;
Foster, Kevin ;
Christie, Gordon ;
Wang, Sean ;
Hager, Gregory D. ;
Brown, Myron .
2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, :1524-1532
[5]   THE LAPLACIAN PYRAMID AS A COMPACT IMAGE CODE [J].
BURT, PJ ;
ADELSON, EH .
IEEE TRANSACTIONS ON COMMUNICATIONS, 1983, 31 (04) :532-540
[6]  
Chen X., 2019, P 28 INT JOINT C ART, P1
[7]   Using shadows in high-resolution imagery to determine building height [J].
Comber, Alexis ;
Umezaki, Masahiro ;
Zhou, Rena ;
Ding, Yongming ;
Li, Yang ;
Fu, Hua ;
Jiang, Hongwei ;
Tewkesbury, Andrew .
REMOTE SENSING LETTERS, 2012, 3 (07) :551-556
[8]  
Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[9]  
Eigen D, 2014, ADV NEUR IN, V27
[10]   Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture [J].
Eigen, David ;
Fergus, Rob .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2650-2658