3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose From Monocular Video

被引:16
|
作者
Wang, Guangming [1 ]
Zhong, Jiquan [1 ]
Zhao, Shijie [2 ]
Wu, Wenhua [1 ]
Liu, Zhe [3 ]
Wang, Hesheng [1 ]
机构
[1] Shanghai Jiao Tong Univ, Shanghai Engn Res Ctr Intelligent Control & Manage, Key Lab Syst Control & Informat Proc, Key Lab Marine Intelligent Equipment,Dept Automat,, Shanghai 200240, Peoples R China
[2] Shanghai Jiao Tong Univ, Dept Engn Mech, Shanghai 200240, Peoples R China
[3] Shanghai Jiao Tong Univ, AI Inst, MOE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China
关键词
Monocular depth estimation; visual odometry; unsupervised learning; pose refinement; 3D augmentation; VIEW SYNTHESIS; REMOVAL;
D O I
10.1109/TCSVT.2022.3215587
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Depth and ego-motion estimations are essential for the localization and navigation of autonomous robots and autonomous driving. Recent studies make it possible to learn the per-pixel depth and ego-motion from the unlabeled monocular video. In this paper, a novel unsupervised training framework is proposed with 3D hierarchical refinement and augmentation using explicit 3D geometry. In this framework, the depth and pose estimations are hierarchically and mutually coupled to refine the estimated pose layer by layer. The intermediate view image is proposed and synthesized by warping the pixels in an image with the estimated depth and coarse pose. Then, the residual pose transformation can be estimated from the new view image and the image of the adjacent frame to refine the coarse pose. The iterative refinement is implemented in a differentiable manner in this paper, making the whole framework optimized uniformly. Meanwhile, a new image augmentation method is proposed for the pose estimation by synthesizing a new view image, which creatively augments the pose in 3D space but gets a new augmented 2D image. The experiments on dKITTI demonstrate that our depth estimation achieves state-of-the-art performance and even surpasses recent approaches that utilize other auxiliary tasks. Our visual odometry outperforms all recent unsupervised monocular learning-based methods and achieves competitive performance to the geometry-based method, ORB-SLAM2 with back-end optimization. The source codes will be released soon at: https://github.com/IRMVLab/HRANet.
引用
收藏
页码:1776 / 1786
页数:11
相关论文
共 50 条
  • [31] Bayesian 3D tracking from monocular video
    Brau, Ernesto
    Guan, Jinyan
    Simek, Kyle
    Del Pero, Luca
    Dawson, Colin Reimer
    Barnard, Kobus
    2013 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2013, : 3368 - 3375
  • [32] Learning Monocular 3D Human Pose Estimation from Multi-view Images
    Rhodin, Helge
    Sporri, Jorg
    Katircioglu, Isinsu
    Constantin, Victor
    Meyer, Frederic
    Mueller, Erich
    Salzmann, Mathieu
    Fua, Pascal
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 8437 - 8446
  • [33] Unsupervised Depth Estimation from Monocular Video based on Relative Motion
    Cao, Hui
    Wang, Chao
    Wang, Ping
    Zou, Qingquan
    Xiao, Xiao
    2018 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND MACHINE LEARNING (SPML 2018), 2018, : 159 - 165
  • [34] Monocular 3D human body reconstruction towards depth augmentation of television sequences
    Sappa, A
    Aifanti, N
    Malassiotis, S
    Strintzis, MG
    2003 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL 3, PROCEEDINGS, 2003, : 325 - 328
  • [35] Monocular 3D Object Detection With Sequential Feature Association and Depth Hint Augmentation
    Gao, Tianze
    Pan, Huihui
    Gao, Huijun
    IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2022, 7 (02): : 240 - 250
  • [36] Temporal Representation Learning on Monocular Videos for 3D Human Pose Estimation
    Honari, Sina
    Constantin, Victor
    Rhodin, Helge
    Salzmann, Mathieu
    Fua, Pascal
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (05) : 6415 - 6427
  • [37] 3-D Video Generation from Monocular Video Based on Hierarchical Video Segmentation
    Lee, Gwo Giun
    Chen, Chun-Fu
    Lin, He-Yuan
    Wang, Ming-Jiun
    JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2015, 81 (03): : 345 - 358
  • [38] 3-D Video Generation from Monocular Video Based on Hierarchical Video Segmentation
    Gwo Giun (Chris) Lee
    Chun-Fu Chen
    He-Yuan Lin
    Ming-Jiun Wang
    Journal of Signal Processing Systems, 2015, 81 : 345 - 358
  • [39] Adapted human pose: monocular 3D human pose estimation with zero real 3D pose data
    Liu, Shuangjun
    Sehgal, Naveen
    Ostadabbas, Sarah
    APPLIED INTELLIGENCE, 2022, 52 (12) : 14491 - 14506
  • [40] 3D face pose tracking from an uncalibrated monocular camera
    Zhu, ZW
    Ji, Q
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 4, 2004, : 400 - 403