MonoVAN: Visual Attention for Self-Supervised Monocular Depth Estimation

被引：12

作者：

Indyk, Ilia ^{[1
]}

Makarov, Ilya ^{[2
]}

机构：

[1] HSE Univ, Moscow, Russia

[2] Artificial Intelligence Res Inst AIRI, AI Ctr NUST MISiS, Moscow, Russia

来源：

2023 IEEE INTERNATIONAL SYMPOSIUM ON MIXED AND AUGMENTED REALITY, ISMAR | 2023年

关键词：

Human-centered computing-Human computer interaction (HCI)-Interaction paradigms-Mixed / augmented reality; Artificial intelligence-Computer vision-Localization; spatial registration and tracking-3D reconstruction;

D O I：

10.1109/ISMAR59233.2023.00138

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Depth estimation is crucial in various computer vision applications, including autonomous driving, robotics, and virtual and augmented reality. An accurate scene depth map is beneficial for localization, spatial registration, and tracking. It converts 2D images into precise 3D coordinates for accurate positioning, seamlessly aligns virtual and real objects in applications like AR, and enhances object tracking by distinguishing distances. The self-supervised monocular approach is particularly promising as it eliminates the need for complex and expensive data acquisition setups relying solely on a standard RGB camera. Recently, transformer-based architectures have become popular to solve this problem, but at high quality, they suffer from high computational cost and poor perception of small details as they focus more on global information. In this paper, we propose a novel fully convolutional network for monocular depth estimation, called MonoVAN, which incorporates the visual attention mechanism and applies super-resolution techniques in decoder to better capture fine-grained details in depth maps. To the best of our knowledge, this work pioneers the use of a convolutional visual attention in the context of depth estimation. Our experiments on outdoor KITTI benchmark and the indoor NYUv2 dataset show that our approach outperforms the most advanced self-supervised methods, including such state-of-the-art models as transformer-based VTDepth from ISMAR'22 and hybrid convolutional-transformer MonoFormer from AAAI'23, while having a comparable or even fewer number of parameters in our model than competitors. We also validate the impact of each proposed improvement in isolation, providing evidence of its significant contribution. Code and weights are available at https://github.com/IlyaInd/MonoVAN.

引用

页码：1211 / 1220

页数：10

共 68 条

[1] Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention [J].

Agarwal, Ashutosh ;

Arora, Chetan .

2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, :5850-5859

[2]

Bae J, 2022, Arxiv, DOI arXiv:2205.11083

[3] AdaBins: Depth Estimation Using Adaptive Bins [J].

Bhat, Shariq Farooq ;

Alhashim, Ibraheem ;

Wonka, Peter .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :4008-4017

[4]

Chen XY, 2022, Arxiv, DOI arXiv:2205.04437

[5]

Churaev E., 2022, 2022 8 INT C INF TEC, P1

[6]

Demochkina Polina, 2021, Pattern Recognition. ICPR International Workshops and Challenges. Proceedings. Lecture Notes in Computer Science (LNCS 12665), P266, DOI 10.1007/978-3-030-68821-9_25

[7]

Demochkina P., 2021, P INT C INF TECHN NA, P1

[8]

Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848

[9]

Devlin J, 2019, Arxiv, DOI arXiv:1810.04805

[10]

Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929

← 1 2 3 4 5 6 7 →