VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention

被引:54
作者
Deng, Shengheng [1 ]
Liang, Zhihao [1 ,3 ]
Sun, Lin [2 ]
Jia, Kui [1 ,4 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
[2] Mag Leap, Sunnyvale, CA USA
[3] DexForce Technol Co Ltd, Shenzhen, Peoples R China
[4] Peng Cheng Lab, Shenzhen, Peoples R China
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年
关键词
D O I
10.1109/CVPR52688.2022.00826
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Detecting objects from LiDAR point clouds is of tremendous significance in autonomous driving. In spite of good progress, accurate and reliable 3D detection is yet to be achieved due to the sparsity and irregularity of LiDAR point clouds. Among existing strategies, multi-view methods have shown great promise by leveraging the more comprehensive information from both bird's eye view (BEV) and range view (RV). These multi-view methods either refine the proposals predicted from single view via fused features, or fuse the features without considering the global spatial context; their performance is limited consequently. In this paper, we propose to adaptively fuse multi-view features in a global spatial context via Dual Cross-VIew SpaTial Attention (VISTA). The proposed VISTA is a novel plug-and-play fusion module, wherein the multi-layer perceptron widely adopted in standard attention modules is replaced with a convolutional one. Thanks to the learned attention mechanism, VISTA can produce fused features of high quality for prediction of proposals. We decouple the classification and regression tasks in VISTA, and an additional constraint of attention variance is applied that enables the attention module to focus on specific targets instead of generic points. We conduct thorough experiments on the benchmarks of nuScenes and Waymo; results confirm the efficacy of our designs. At the time of submission, our method achieves 63.0% in overall mAP and 69.8% in NDS on the nuScenes benchmark, outperforming all published methods by up to 24% in safety-crucial categories such as cyclist.
引用
收藏
页码:8438 / 8447
页数:10
相关论文
共 34 条
[11]  
Gugger S., 2018, The 1cycle policy
[12]   PCT: Point cloud transformer [J].
Guo, Meng-Hao ;
Cai, Jun-Xiong ;
Liu, Zheng-Ning ;
Mu, Tai-Jiang ;
Martin, Ralph R. ;
Hu, Shi-Min .
COMPUTATIONAL VISUAL MEDIA, 2021, 7 (02) :187-199
[13]  
Han K., 2021, Advances in Neural Information Processing Systems, DOI 10.48550/arXiv.2103.00112
[14]  
Hu Ping, 2020, P IEEE C COMP VIS PA
[15]  
Kingma D P., 2014, P INT C LEARN REPR
[16]  
Laddha Ankit, 2021, P IEEE C COMP VIS PA
[17]   PointPillars: Fast Encoders for Object Detection from Point Clouds [J].
Lang, Alex H. ;
Vora, Sourabh ;
Caesar, Holger ;
Zhou, Lubing ;
Yang, Jiong ;
Beijbom, Oscar .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :12689-12697
[18]  
Lee J., 2018, IEEE INT C INT ROB S, P1
[19]  
LIN TY, 2017, PROC CVPR IEEE, P936, DOI DOI 10.1109/CVPR.2017.106
[20]   Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [J].
Liu, Ze ;
Lin, Yutong ;
Cao, Yue ;
Hu, Han ;
Wei, Yixuan ;
Zhang, Zheng ;
Lin, Stephen ;
Guo, Baining .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9992-10002