DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting

被引:0
作者
Li, Hongyang [1 ,3 ]
Zhang, Hao [2 ,3 ]
Zeng, Zhaoyang [3 ]
Liu, Shilong [3 ,4 ]
Li, Feng [3 ]
Ren, Tianhe [3 ]
Zhang, Lei [1 ,3 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
[2] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[3] Int Digital Econ Acad IDEA, Shenzhen, Peoples R China
[4] Tsinghua Univ, Inst AI, Dept CST, BNRist Ctr, Beijing, Peoples R China
来源
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose a new operator, called 3D DeFormable Attention (DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image features into a unified 3D space for 3D object detection. Existing feature lifting approaches, such as Lift-Splat-based and 2D attention-based, either use estimated depth to get pseudo LiDAR features and then splat them to a 3D space, which is a one-pass operation without feature refinement, or ignore depth and lift features by 2D attention mechanisms, which achieve finer semantics while suffering from a depth ambiguity problem. In contrast, our DFA3D-based method first leverages the estimated depth to expand each view's 2D feature map to 3D and then utilizes DFA3D to aggregate features from the expanded 3D feature maps. With the help of DFA3D, the depth ambiguity problem can be effectively alleviated from the root, and the lifted features can be progressively refined layer by layer, thanks to the Transformerlike architecture. In addition, we propose a mathematically equivalent implementation of DFA3D which can significantly improve its memory efficiency and computational speed. We integrate DFA3D into several methods that use 2D attention-based feature lifting with only a few modifications in code and evaluate on the nuScenes dataset. The experiment results show a consistent improvement of +1.41% mAP on average, and up to +15.1% mAP improvement when high-quality depth information is available, demonstrating the superiority, applicability, and huge potential of DFA3D. The code is available at https://github.com/IDEAResearch/3D-deformable-attention.git.
引用
收藏
页码:6661 / 6670
页数:10
相关论文
共 41 条
[1]  
Bruls T, 2019, IEEE INT VEH SYM, P302, DOI [10.1109/IVS.2019.8814056, 10.1109/ivs.2019.8814056]
[2]   nuScenes: A multimodal dataset for autonomous driving [J].
Caesar, Holger ;
Bankiti, Varun ;
Lang, Alex H. ;
Vora, Sourabh ;
Liong, Venice Erin ;
Xu, Qiang ;
Krishnan, Anush ;
Pan, Yu ;
Baldan, Giancarlo ;
Beijbom, Oscar .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11618-11628
[3]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[4]   Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection [J].
Chen, Zehui ;
Li, Zhenyu ;
Zhang, Shiquan ;
Fang, Liangji ;
Jiang, Qinhong ;
Zhao, Feng .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :5999-6008
[5]  
Chu Xiaomeng, 2023, ARXIV230105711
[6]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[7]   Deep Ordinal Regression Network for Monocular Depth Estimation [J].
Fu, Huan ;
Gong, Mingming ;
Wang, Chaohui ;
Batmanghelich, Kayhan ;
Tao, Dacheng .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :2002-2011
[8]  
He K, 2016, Proceedings of the IEEE conference on computer vision and pattern recognition, DOI [DOI 10.1109/CVPR.2016.90, 10.1109/CVPR.2016.90]
[9]  
Huang J., 2021, ARXIV211211790
[10]  
Huang Junjie, 2022, ARXIV220317054