Enhancing Self-supervised Monocular Depth Estimation via Piece-Wise Pose Estimation and Geometric Constraints

被引：0

作者：

Shyam, Pranjay ^{[1
]}

Okon, Alexandre ^{[1
]}

Yoo, HyunJin ^{[1
]}

机构：

[1] Faurecia IRYStec Inc, Montreal, PQ, Canada

来源：

2024 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS, WACVW 2024 | 2024年

关键词：

AWARE;

D O I：

10.1109/WACVW60836.2024.00030

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing single and multi-frame monocular depth estimation (MDE) approaches lack depth estimation consistency around object edges, while single-frame approaches generate scale-ambiguous depth albeit at a lower computational complexity. We revisit the framework design to address these limitations and propose a joint approach that intertwines depth estimation and panoptic segmentation networks. We present an instance-aware patch-based contrastive loss to ensure depth consistency within an object in feature space. This approach disentangles the embedding triplet and independently refines anchor-positive and anchor-negative pairs, providing coherent depth within objects. Leveraging the panoptic information, we propose masking small objects during photometric loss computation while extracting 6-DoF pose estimates for dynamic objects in a piece-wise approach, thus facilitating depth estimation in dynamic scenes. We demonstrate this mechanism to be suited for single and multi-frame MDE. In addition, to ensure scale fidelity in single-frame MDE, we capitalize on the inherent linear relationship between computed depth and ground truth when using self-supervised photometric loss-based MDE. For this, we propose using a multi-frame depth estimation as a teacher network to inject geometric insight into the student MDE via a global scaling factor, thus generating absolute depth. We further improve the teacher network architecture by introducing a multi-scale feature fusion mechanism that benefits scenarios with significant camera motion. We perform a comprehensive evaluation to validate the efficacy of the proposed mechanism and obtain state-of-the-art performance on the KITTI dataset.

引用

页码：221 / 231

页数：11

共 58 条

[1] AdaBins: Depth Estimation Using Adaptive Bins [J].

Bhat, Shariq Farooq ;

Alhashim, Ibraheem ;

Wonka, Peter .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :4008-4017

[2] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[3]

Casser V, 2019, AAAI CONF ARTIF INTE, P8001

[4] Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation [J].

Chawla, Hemang ;

Varma, Arnav ;

Arani, Elahe ;

Zonooz, Bahram .

2021 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2021), 2021, :5140-5146

[5] Attention-based context aggregation network for monocular depth estimation [J].

Chen, Yuru ;

Zhao, Haitao ;

Hu, Zhengwei ;

Peng, Jingchao .

INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (06) :1583-1596

[6] Masked-attention Mask Transformer for Universal Image Segmentation [J].

Cheng, Bowen ;

Misra, Ishan ;

Schwing, Alexander G. ;

Kirillov, Alexander ;

Girdhar, Rohit .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :1280-1289

[7]

Choi J, 2020, Arxiv, DOI arXiv:2010.02893

[8] The Cityscapes Dataset for Semantic Urban Scene Understanding [J].

Cordts, Marius ;

Omran, Mohamed ;

Ramos, Sebastian ;

Rehfeld, Timo ;

Enzweiler, Markus ;

Benenson, Rodrigo ;

Franke, Uwe ;

Roth, Stefan ;

Schiele, Bernt .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :3213-3223

[9]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[10] Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture [J].

Eigen, David ;

Fergus, Rob .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2650-2658

← 1 2 3 4 5 6 →