RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers

被引:9
|
作者
Ibrahem, Hatem [1 ]
Salem, Ahmed [1 ,2 ]
Kang, Hyun-Soo [1 ]
机构
[1] Chungbuk Natl Univ, Sch Elect & Comp Engn, Dept Informat & Commun Engn, Cheongju 28644, South Korea
[2] Assiut Univ, Fac Engn, Elect Engn Dept, Assiut 71515, Egypt
基金
新加坡国家研究基金会;
关键词
monocular depth estimation; convolutional neural networks; vision transformers; real-time processing;
D O I
10.3390/s22103849
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
The latest research in computer vision highlighted the effectiveness of the vision transformers (ViT) in performing several computer vision tasks; they can efficiently understand and process the image globally unlike the convolution which processes the image locally. ViTs outperform the convolutional neural networks in terms of accuracy in many computer vision tasks but the speed of ViTs is still an issue, due to the excessive use of the transformer layers that include many fully connected layers. Therefore, we propose a real-time ViT-based monocular depth estimation (depth estimation from single RGB image) method with encoder-decoder architectures for indoor and outdoor scenes. This main architecture of the proposed method consists of a vision transformer encoder and a convolutional neural network decoder. We started by training the base vision transformer (ViT-b16) with 12 transformer layers then we reduced the transformer layers to six layers, namely ViT-s16 (the Small ViT) and four layers, namely ViT-t16 (the Tiny ViT) to obtain real-time processing. We also try four different configurations of the CNN decoder network. The proposed architectures can learn the task of depth estimation efficiently and can produce more accurate depth predictions than the fully convolutional-based methods taking advantage of the multi-head self-attention module. We train the proposed encoder-decoder architecture end-to-end on the challenging NYU-depthV2 and CITYSCAPES benchmarks then we evaluate the trained models on the validation and test sets of the same benchmarks showing that it outperforms many state-of-the-art methods on depth estimation while performing the task in real-time (similar to 20 fps). We also present a fast 3D reconstruction (similar to 17 fps) experiment based on the depth estimated from our method which is considered a real-world application of our method.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Real-Time Monocular Depth Estimation Merging Vision Transformers on Edge Devices for AIoT
    Liu, Xihao
    Wei, Wei
    Liu, Cheng
    Peng, Yuyang
    Huang, Jinhao
    Li, Jun
    IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2023, 72
  • [2] MiniNet: An extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation
    Liu, Jun
    Li, Qing
    Cao, Rui
    Tang, Wenming
    Qiu, Guoping
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2020, 166 (166) : 255 - 267
  • [3] On the robustness of vision transformers for in-flight monocular depth estimation
    Simone Ercolino
    Alessio Devoto
    Luca Monorchio
    Matteo Santini
    Silvio Mazzaro
    Simone Scardapane
    Industrial Artificial Intelligence, 1 (1):
  • [4] LD-Net: A Lightweight Network for Real-Time Self-Supervised Monocular Depth Estimation
    Xiong, Mingkang
    Zhang, Zhenghong
    Zhang, Tao
    Xiong, Huilin
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 882 - 886
  • [5] Towards Real-Time Monocular Depth Estimation For Mobile Systems
    Deldjoo, Yashar
    Di Noia, Tommaso
    Di Sciascio, Eugenio
    Pernisco, Gaetano
    Reno, Vito
    Stella, Ettore
    MULTIMODAL SENSING AND ARTIFICIAL INTELLIGENCE: TECHNOLOGIES AND APPLICATIONS II, 2021, 11785
  • [6] Real-Time Depth Estimation from a Monocular Moving Camera
    Handa, Aniket
    Sharma, Prateek
    CONTEMPORARY COMPUTING, 2012, 306 : 494 - 495
  • [7] OptiDepthNet: A Real-Time Unsupervised Monocular Depth Estimation Network
    Wei, Feng
    Yin, XingHui
    Shen, Jie
    Wang, HuiBin
    WIRELESS PERSONAL COMMUNICATIONS, 2023, 128 (04) : 2831 - 2846
  • [8] Towards real-time unsupervised monocular depth estimation on CPU
    Poggi, Matteo
    Aleotti, Filippo
    Tosi, Fabio
    Mattoccia, Stefano
    2018 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2018, : 5848 - 5854
  • [9] Real-time monocular depth estimation with adaptive receptive fields
    Ji, Zhenyan
    Song, Xiaojun
    Guo, Xiaoxuan
    Wang, Fangshi
    Armendariz-Inigo, Jose Enrique
    JOURNAL OF REAL-TIME IMAGE PROCESSING, 2021, 18 (04) : 1369 - 1381
  • [10] Real-time monocular depth estimation with adaptive receptive fields
    Zhenyan Ji
    Xiaojun Song
    Xiaoxuan Guo
    Fangshi Wang
    José Enrique Armendáriz-Iñigo
    Journal of Real-Time Image Processing, 2021, 18 : 1369 - 1381