Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection

被引:15
|
作者
Chen, Zehui [1 ]
Li, Zhenyu [2 ]
Zhang, Shiquan [3 ]
Fang, Liangji [3 ]
Jiang, Qinhong [3 ]
Zhao, Feng [1 ]
机构
[1] Univ Sci & Tech China, Hefei, Peoples R China
[2] Harbin Inst Technol, Harbin, Peoples R China
[3] SenseTime Res, Hong Kong, Peoples R China
来源
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年
关键词
3D object detection; Multi-view Detection; Transformer;
D O I
10.1145/3503161.3547859
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
3D object detection from multiple image views is a fundamental and challenging task for visual scene understanding. However, accurately detecting objects through perspective views in the 3D space is extremely difficult due to the lack of depth information. Recently, DETR3D [50] introduces a novel 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves state-of-the-art performance. In this paper, with intensive pilot experiments, we quantify the objects located at different regions and find that the "truncated instances" (i.e., at the border regions of each image) are the main bottleneck hindering the performance of DETR3D. Although it merges multiple features from two adjacent views in the overlapping regions, DETR3D still suffers from insufficient feature aggregation, thus missing the chance to fully boost the detection performance. In an effort to tackle the problem, we propose Graph-DETR3D to automatically aggregate multi-view imagery information through graph structure learning. It constructs a dynamic 3D graph between each object query and 2D feature maps to enhance the object representations, especially at the border regions. Besides, Graph-DETR3D benefits from a novel depthinvariant multi-scale training strategy, which maintains the visual depth consistency by simultaneously scaling the image size and the object depth. Extensive experiments on the nuScenes dataset demonstrate the effectiveness and efficiency of our Graph-DETR3D. Notably, our best model achieves 49.5 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various published image-view 3D object detectors.
引用
收藏
页码:5999 / 6008
页数:10
相关论文
共 50 条
  • [1] Graph-DETR4D: Spatio-Temporal Graph Modeling for Multi-View 3D Object Detection
    Chen, Zehui
    Chen, Zheng
    Li, Zhenyu
    Zhang, Shiquan
    Fang, Liangji
    Jiang, Qinhong
    Wu, Feng
    Zhao, Feng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 4488 - 4500
  • [2] PETR: Position Embedding Transformation for Multi-view 3D Object Detection
    Liu, Yingfei
    Wang, Tiancai
    Zhang, Xiangyu
    Sun, Jian
    COMPUTER VISION - ECCV 2022, PT XXVII, 2022, 13687 : 531 - 548
  • [3] Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection
    Zhang, Yifan
    Zhu, Zhiyu
    Hou, Junhui
    Wu, Dapeng
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 10614 - 10628
  • [4] Dynamic Grouping With Multi-Manifold Attention for Multi-View 3D Object Reconstruction
    Kalitsios, Georgios
    Konstantinidis, Dimitrios
    Daras, Petros
    Dimitropoulos, Kosmas
    IEEE ACCESS, 2024, 12 : 160690 - 160699
  • [5] Dynamic graph transformer for 3D object detection
    Ren, Siyuan
    Pan, Xiao
    Zhao, Wenjie
    Nie, Binling
    Han, Bo
    KNOWLEDGE-BASED SYSTEMS, 2023, 259
  • [6] PROGRESSIVE MULTI-VIEW FUSION FOR 3D HUMAN POSE ESTIMATION
    Zhang, Lijun
    Zhou, Kangkang
    Liu, Liangchen
    Li, Zhenghao
    Zhao, Xunyi
    Zhou, Xiang-Dong
    Shi, Yu
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 1600 - 1604
  • [7] Disentangling 3D/4D Facial Affect Recognition With Faster Multi-View Transformer
    Behzad, Muzammil
    Li, Xiaobai
    Zhao, Guoying
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 1913 - 1917
  • [8] Diff3DETR: Agent-Based Diffusion Model for Semi-supervised 3D Object Detection
    Deng, Jiacheng
    Lu, Jiahao
    Zhang, Tianzhu
    COMPUTER VISION - ECCV 2024, PT XXXIV, 2025, 15092 : 57 - 73
  • [9] Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation
    Zhou, Kangkang
    Zhang, Lijun
    Lu, Feng
    Zhou, Xiang-Dong
    Shi, Yu
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 7512 - 7520
  • [10] SKETCH-BASED 3D SHAPE RETRIEVAL WITH MULTI-VIEW FUSION TRANSFORMER
    Zhu, Cunjuan
    Cui, Dongdong
    Jia, Qi
    Wang, Weimin
    Liu, Yu
    Lew, Michael S.
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 3005 - 3009