MonoDiffusion: Self-Supervised Monocular Depth Estimation Using Diffusion Model

被引：1

作者：

Shao, Shuwei ^{[1
,2
]}

Pei, Zhongcai ^{[1
]}

Chen, Weihai ^{[1
,2
]}

Sun, Dingchi ^{[1
]}

Chen, Peter C. Y. ^{[3
]}

Li, Zhengguo ^{[4
]}

机构：

[1] Beihang Univ, Sch Automat Sci & Elect Engn, Beijing 100191, Peoples R China

[2] Beihang Univ, Hangzhou Innovat Inst, Hangzhou 310052, Zhejiang, Peoples R China

[3] Natl Univ Singapore, Dept Mech Engn, Singapore 117575, Singapore

[4] ASTAR, Inst Infocomm Res, Dept 6, Singapore 138632, Singapore

来源：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY | 2025年 / 35卷 / 04期

基金：

中国国家自然科学基金;

关键词：

Diffusion models; Noise reduction; Circuits and systems; Training; Accuracy; Standards; Diffusion processes; Visualization; Transformers; Three-dimensional displays; Monocular depth estimation; conditional diffusion; self-supervised learning;

D O I：

10.1109/TCSVT.2024.3509619

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Over the past few years, self-supervised monocular depth estimation has received widespread attention. Most efforts focus on designing different types of network architectures and loss functions or handling edge cases, for example, occlusion and dynamic objects. In this work, we take another path and propose a novel conditional diffusion-based generative framework for self-supervised monocular depth estimation, dubbed MonoDiffusion. Because the depth ground-truth is unavailable in a self-supervised setting, we develop a new pseudo ground-truth diffusion process to assist the diffusion for training. Instead of diffusing at a fixed high resolution, we perform diffusion in a coarse-to-fine manner that allows for faster inference time without sacrificing accuracy or even better accuracy. Furthermore, we develop a simple yet effective contrastive depth reconstruction mechanism to enhance the denoising ability of model. It is worth noting that the proposed MonoDiffusion has the property of naturally acquiring the depth uncertainty that is essential to be implemented in safety-critical cases. Extensive experiments on the KITTI, Make3D and DIML datasets indicate that our MonoDiffusion outperforms prior state-of-the-art self-supervised competitors. The source code will be publicly available upon the acceptance.

引用

页码：3664 / 3678

页数：15

共 97 条

[1]

Amit T, 2022, Arxiv, DOI [arXiv:2112.00390, 10.48550/arXiv.2112.00390]

[2]

Bae J, 2023, AAAI CONF ARTIF INTE, P187

[3]

Baranchuk D., 2022, ICLR

[4]

Batzolis G, 2021, Arxiv, DOI [arXiv:2111.13606, DOI 10.48550/ARXIV.2111.13606]

[5]

Bhat SF, 2023, Arxiv, DOI [arXiv:2302.12288, DOI 10.48550/ARXIV.2302.12288]

[6] AdaBins: Depth Estimation Using Adaptive Bins [J].

Bhat, Shariq Farooq ;

Alhashim, Ibraheem ;

Wonka, Peter .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :4008-4017

[7]

Bian JW, 2019, ADV NEUR IN, V32

[8] Denoising Pretraining for Semantic Segmentation [J].

Brempong, Emmanuel Asiedu ;

Kornblith, Simon ;

Chen, Ting ;

Parmar, Niki ;

Minderer, Matthias ;

Norouzi, Mohammad .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, :4174-4185

[9] InstructPix2Pix: Learning to Follow Image Editing Instructions [J].

Brooks, Tim ;

Holynski, Aleksander ;

Efros, Alexei A. .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, :18392-18402

[10] Estimating Depth From Monocular Images as Classification Using Deep Fully Convolutional Residual Networks [J].

Cao, Yuanzhouhan ;

Wu, Zifeng ;

Shen, Chunhua .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2018, 28 (11) :3174-3182

← 1 2 3 4 5 6 7 8 9 10 →