Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection

被引:0
|
作者
Cong Pan [1 ,2 ]
Junran Peng [3 ]
Zhaoxiang Zhang [4 ,5 ,6 ,7 ]
机构
[1] the Center for Research on Intelligent Perception and Computing(CRIPAC), National Laboratory of Pattern Recognition(NLPR),Institute of Automation, Chinese Academy of Sciences(CASIA)
[2] the School of Future Technology, University of Chinese Academy of Sciences(UCAS)
[3] the Huawei Inc.
[4] IEEE
[5] the Institute of Automation, Chinese Academy of Sciences(CASIA)
[6] the University of Chinese Academy of Sciences(UCAS)
[7] the Centre for Artificial Intelligence and Robotics Hong Kong Institute of Science&Innovation, Chinese Academy of Sciences(HKISI CAS)
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP391.41 [];
学科分类号
080203 ;
摘要
Monocular 3D object detection is challenging due to the lack of accurate depth information. Some methods estimate the pixel-wise depth maps from off-the-shelf depth estimators and then use them as an additional input to augment the RGB images.Depth-based methods attempt to convert estimated depth maps to pseudo-LiDAR and then use LiDAR-based object detectors or focus on the perspective of image and depth fusion learning.However, they demonstrate limited performance and efficiency as a result of depth inaccuracy and complex fusion mode with convolutions. Different from these approaches, our proposed depth-guided vision transformer with a normalizing flows(NF-DVT)network uses normalizing flows to build priors in depth maps to achieve more accurate depth information. Then we develop a novel Swin-Transformer-based backbone with a fusion module to process RGB image patches and depth map patches with two separate branches and fuse them using cross-attention to exchange information with each other. Furthermore, with the help of pixel-wise relative depth values in depth maps, we develop new relative position embeddings in the cross-attention mechanism to capture more accurate sequence ordering of input tokens.Our method is the first Swin-Transformer-based backbone architecture for monocular 3D object detection. The experimental results on the KITTI and the challenging Waymo Open datasets show the effectiveness of our proposed method and superior performance over previous counterparts.
引用
收藏
页码:673 / 689
页数:17
相关论文
共 50 条
  • [21] Monocular 3D Object Detection for Autonomous Driving Based on Contextual Transformer
    She, Xiangyang
    Yan, Weijia
    Dong, Lihong
    Computer Engineering and Applications, 2024, 60 (19) : 178 - 189
  • [22] MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer
    Zhou, Yunsong
    Zhu, Hongzi
    Liu, Quan
    Chang, Shan
    Guo, Minyi
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 17493 - 17503
  • [23] Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation
    Cai, Yingjie
    Li, Buyu
    Jiao, Zeyu
    Li, Hongsheng
    Zeng, Xingyu
    Wang, Xiaogang
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 10478 - 10485
  • [24] DEPTH-ASSISTED JOINT DETECTION NETWORK FOR MONOCULAR 3D OBJECT DETECTION
    Lei, Jianjun
    Guo, Tingyi
    Peng, Bo
    Yu, Chuanbo
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2204 - 2208
  • [25] Probabilistic Monocular 3D Human Pose Estimation with Normalizing Flows
    Wehrbein, Tom
    Rudolph, Marco
    Rosenhahn, Bodo
    Wandt, Bastian
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11179 - 11188
  • [26] Monocular 3D object detection with thermodynamic loss and decoupled instance depth
    Liu, Gang
    Xie, Xiaoxiao
    Yu, Qingchen
    CONNECTION SCIENCE, 2024, 36 (01)
  • [27] MonoDFNet: Monocular 3D Object Detection with Depth Fusion and Adaptive Optimization
    Gao, Yuhan
    Wang, Peng
    Li, Xiaoyan
    Sun, Mengyu
    Di, Ruohai
    Li, Liangliang
    Hong, Wei
    SENSORS, 2025, 25 (03)
  • [28] Exploiting Ground Depth Estimation for Mobile Monocular 3D Object Detection
    Zhou, Yunsong
    Liu, Quan
    Zhu, Hongzi
    Li, Yunzhe
    Chang, Shan
    Guo, Minyi
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (04) : 3079 - 3093
  • [29] Depth-discriminative Metric Learning for Monocular 3D Object Detection
    Choi, Wonhyeok
    Shin, Mingyu
    Im, Sunghoon
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [30] Depth dynamic center difference convolutions for monocular 3D object detection
    Wu, Xinyu
    Ma, Dongliang
    Qu, Xin
    Jiang, Xin
    Zeng, Dan
    NEUROCOMPUTING, 2023, 520 : 73 - 81