Instance-Aware Multi-Object Self-Supervision for Monocular Depth Prediction

被引:5
作者
Boulahbal, Houssem Eddine [1 ,2 ,3 ]
Voicila, Adrian [4 ]
Comport, Andrew, I [5 ]
机构
[1] Cote dAzur Univ, Renault Software Factory, 2600 Rte Cretes, F-06560 Valbonne, France
[2] Cote dAzur Univ, CNRS I3S, 2600 Rte Cretes, F-06560 Valbonne, France
[3] 2000 Route Lucioles BP 121, F-06903 Sophia Antipolis, France
[4] Renault Software Factory, 2600 Rte Cretes, F-06560 Valbonne, France
[5] Cote dAzur Univ, CNRS I3S, 2000 Route Lucioles BP 121, F-06903 Sophia Antipolis, France
关键词
Depth prediction; motion prediction; multi-object detection;
D O I
10.1109/LRA.2022.3194951
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
This letter proposes a self-supervised monocular image-to-depth prediction framework that is trained with an end-to-end photometric loss that handles not only 6-DOF camera motion but also 6-DOF moving object instances. Self-supervision is performed by warping the images across a video sequence using depth and scene motion including object instances. One novelty of the proposed method is the use of the multi-head attention of the transformer network that matches moving objects across time and models their interaction and dynamics. This enables accurate and robust pose estimation for each object instance. Most image-to-depth predication frameworks make the assumption of rigid scenes, which largely degrades their performance with respect to dynamic objects. Only a few state-of-the-art (SOTA) papers have accounted for dynamic objects. The proposed method is shown to outperform these methods on standard benchmarks and the impact of the dynamic motion on these benchmarks is exposed. Furthermore, the proposed image-to-depth prediction framework is also shown to be competitive with SOTA video-to-depth prediction frameworks.
引用
收藏
页码:10962 / 10968
页数:7
相关论文
共 41 条
  • [1] Bian JW, 2019, ADV NEUR IN, V32
  • [2] Chen LC, 2018, ADV NEUR IN, V31
  • [3] Towards Scene Understanding: Unsupervised Monocular Depth Estimation with Semantic-aware Representation
    Chen, Po-Yi
    Liu, Alexander H.
    Liu, Yen-Cheng
    Wang, Yu-Chiang Frank
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 2619 - 2627
  • [4] Self-supervised Learning with Geometric Constraints in Monocular Video Connecting Flow, Depth, and Camera
    Chen, Yuhua
    Schmid, Cordelia
    Sminchisescu, Cristian
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7062 - 7071
  • [5] Choi J., 2020, PROC 34 C NEURAL INF
  • [6] The Cityscapes Dataset for Semantic Urban Scene Understanding
    Cordts, Marius
    Omran, Mohamed
    Ramos, Sebastian
    Rehfeld, Timo
    Enzweiler, Markus
    Benenson, Rodrigo
    Franke, Uwe
    Roth, Stefan
    Schiele, Bernt
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3213 - 3223
  • [7] Eigen D, 2014, ADV NEUR IN, V27
  • [8] Geiger A, 2012, PROC CVPR IEEE, P3354, DOI 10.1109/CVPR.2012.6248074
  • [9] Digging Into Self-Supervised Monocular Depth Estimation
    Godard, Clement
    Mac Aodha, Oisin
    Firman, Michael
    Brostow, Gabriel
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 3827 - 3837
  • [10] Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras
    Gordon, Ariel
    Li, Hanhan
    Jonschkowski, Rico
    Angelova, Anelia
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8976 - 8985