EffiScene: Efficient Per-Pixel Rigidity Inference for Unsupervised Joint Learning of Optical Flow, Depth, Camera Pose and Motion Segmentation

被引:29
作者
Jiao, Yang [1 ,2 ,3 ]
Tran, Trac D. [2 ]
Shi, Guangming [1 ,3 ]
机构
[1] Xidian Univ, Xian, Peoples R China
[2] Johns Hopkins Univ, Baltimore, MD 21218 USA
[3] Xidian Guangzhou Inst Technol, Guangzhou, Peoples R China
来源
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 | 2021年
关键词
D O I
10.1109/CVPR46437.2021.00549
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses the challenging unsupervised scene flow estimation problem by jointly learning four low-level vision sub-tasks: optical flow F, stereo-depth D, camera pose P and motion segmentation S. Our key insight is that the rigidity of the scene shares the same inherent geometrical structure with object movements and scene depth. Hence, rigidity from S can be inferred by jointly coupling F, D and P to achieve more robust estimation. To this end, we propose a novel scene flow framework named EffiScene with efficient joint rigidity learning, going beyond the existing pipeline with independent auxiliary structures. In EffiScene, we first estimate optical flow and depth at the coarse level and then compute camera pose by Perspectiven-Points method. To jointly learn local rigidity, we design a novel Rigidity From Motion (RfM) layer with three principal components: (i) correlation extraction; (ii) boundary learning; and (iii) outlier exclusion. Final outputs are fused based on the rigid map M-R from RfM at finer levels. To efficiently train EffiScene, two new losses L-bn(d) and L-unc are designed to prevent trivial solutions and to regularize the flow boundary discontinuity. Extensive experiments on scene flow benchmark KITTI show that our method is effective and significantly improves the state-of-the-art ap-proaches for all sub-tasks, i.e. optical flow (5.19 -> 4.20), depth estimation (3.78 -> 3.46), visual odometry (0.012 -> 0.011) and motion segmentation (0.57 -> 0.62).
引用
收藏
页码:5534 / 5543
页数:10
相关论文
共 50 条
[1]  
[Anonymous], 1982, Visual perception: Essential readings
[2]   Vision for mobile robot navigation: A survey [J].
DeSouza, GN ;
Kak, AC .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2002, 24 (02) :237-267
[3]   FlowNet: Learning Optical Flow with Convolutional Networks [J].
Dosovitskiy, Alexey ;
Fischer, Philipp ;
Ilg, Eddy ;
Haeusser, Philip ;
Hazirbas, Caner ;
Golkov, Vladimir ;
van der Smagt, Patrick ;
Cremers, Daniel ;
Brox, Thomas .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2758-2766
[4]   Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue [J].
Garg, Ravi ;
VijayKumar, B. G. ;
Carneiro, Gustavo ;
Reid, Ian .
COMPUTER VISION - ECCV 2016, PT VIII, 2016, 9912 :740-756
[5]   Vision meets robotics: The KITTI dataset [J].
Geiger, A. ;
Lenz, P. ;
Stiller, C. ;
Urtasun, R. .
INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2013, 32 (11) :1231-1237
[6]  
Geiger A, 2012, PROC CVPR IEEE, P3354, DOI 10.1109/CVPR.2012.6248074
[7]   Digging Into Self-Supervised Monocular Depth Estimation [J].
Godard, Clement ;
Mac Aodha, Oisin ;
Firman, Michael ;
Brostow, Gabriel .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3827-3837
[8]   Unsupervised Monocular Depth Estimation with Left-Right Consistency [J].
Godard, Clement ;
Mac Aodha, Oisin ;
Brostow, Gabriel J. .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6602-6611
[9]  
He K., 2017, PROC IEEE INT C COMP, P2961
[10]   Probabilistic Future Prediction for Video Scene Understanding [J].
Hu, Anthony ;
Cotter, Fergal ;
Mohan, Nikhil ;
Gurau, Corina ;
Kendall, Alex .
COMPUTER VISION - ECCV 2020, PT XVI, 2020, 12361 :767-785