Instance-Aware Multi-Object Self-Supervision for Monocular Depth Prediction

被引：5

作者：

Boulahbal, Houssem Eddine ^{[1
,2
,3
]}

Voicila, Adrian ^{[4
]}

Comport, Andrew, I ^{[5
]}

机构：

[1] Cote dAzur Univ, Renault Software Factory, 2600 Rte Cretes, F-06560 Valbonne, France

[2] Cote dAzur Univ, CNRS I3S, 2600 Rte Cretes, F-06560 Valbonne, France

[3] 2000 Route Lucioles BP 121, F-06903 Sophia Antipolis, France

[4] Renault Software Factory, 2600 Rte Cretes, F-06560 Valbonne, France

[5] Cote dAzur Univ, CNRS I3S, 2000 Route Lucioles BP 121, F-06903 Sophia Antipolis, France

来源：

IEEE ROBOTICS AND AUTOMATION LETTERS | 2022年 / 7卷 / 04期

关键词：

Depth prediction; motion prediction; multi-object detection;

D O I：

10.1109/LRA.2022.3194951

中图分类号：

TP24 [机器人技术];

学科分类号：

080202 ; 1405 ;

摘要：

This letter proposes a self-supervised monocular image-to-depth prediction framework that is trained with an end-to-end photometric loss that handles not only 6-DOF camera motion but also 6-DOF moving object instances. Self-supervision is performed by warping the images across a video sequence using depth and scene motion including object instances. One novelty of the proposed method is the use of the multi-head attention of the transformer network that matches moving objects across time and models their interaction and dynamics. This enables accurate and robust pose estimation for each object instance. Most image-to-depth predication frameworks make the assumption of rigid scenes, which largely degrades their performance with respect to dynamic objects. Only a few state-of-the-art (SOTA) papers have accounted for dynamic objects. The proposed method is shown to outperform these methods on standard benchmarks and the impact of the dynamic motion on these benchmarks is exposed. Furthermore, the proposed image-to-depth prediction framework is also shown to be competitive with SOTA video-to-depth prediction frameworks.

引用

页码：10962 / 10968

页数：7

共 41 条

[1]

Bian JW, 2019, ADV NEUR IN, V32

[2]

Chen LC, 2018, ADV NEUR IN, V31

[3] Towards Scene Understanding: Unsupervised Monocular Depth Estimation with Semantic-aware Representation [J].

Chen, Po-Yi ;

Liu, Alexander H. ;

Liu, Yen-Cheng ;

Wang, Yu-Chiang Frank .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :2619-2627

[4] Self-supervised Learning with Geometric Constraints in Monocular Video Connecting Flow, Depth, and Camera [J].

Chen, Yuhua ;

Schmid, Cordelia ;

Sminchisescu, Cristian .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :7062-7071

[5]

Choi J., 2020, PROC 34 C NEURAL INF

[6] The Cityscapes Dataset for Semantic Urban Scene Understanding [J].

Cordts, Marius ;

Omran, Mohamed ;

Ramos, Sebastian ;

Rehfeld, Timo ;

Enzweiler, Markus ;

Benenson, Rodrigo ;

Franke, Uwe ;

Roth, Stefan ;

Schiele, Bernt .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3213-3223

[7]

Eigen D, 2014, ADV NEUR IN, V27

[8]

Geiger A, 2012, PROC CVPR IEEE, P3354, DOI 10.1109/CVPR.2012.6248074

[9] Digging Into Self-Supervised Monocular Depth Estimation [J].

Godard, Clement ;

Mac Aodha, Oisin ;

Firman, Michael ;

Brostow, Gabriel .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3827-3837

[10] Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras [J].

Gordon, Ariel ;

Li, Hanhan ;

Jonschkowski, Rico ;

Angelova, Anelia .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :8976-8985

← 1 2 3 4 5 →