LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention

被引：108

作者：

Yin, Junbo ^{[1
,2
]}

Shen, Jianbing ^{[1
,4
]}

Guan, Chenye ^{[2
,3
]}

Zhou, Dingfu ^{[2
,3
]}

Yang, Ruigang ^{[2
,3
,5
]}

机构：

[1] Beijing Inst Technol, Sch Comp Sci, Beijing Lab Intelligent Informat Technol, Beijing, Peoples R China

[2] Baidu Res, Beijing, Peoples R China

[3] Natl Engn Lab Deep Learning Technol & Applicat, Beijing, Peoples R China

[4] Incept Inst Artificial Intelligence, Abu Dhabi, U Arab Emirates

[5] Univ Kentucky, Lexington, KY USA

来源：

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020) | 2020年

关键词：

D O I：

10.1109/CVPR42600.2020.01151

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing LiDAR-based 3D object detectors usually focus on the single-frame detection, while ignoring the spatiotemporal information in consecutive point cloud frames. In this paper, we propose an end-to-end online 3D video object detector that operates on point cloud sequences. The proposed model comprises a spatial feature encoding component and a spatiotemporal feature aggregation component. In the former component, a novel Pillar Message Passing Network (PMPNet) is proposed to encode each discrete point cloud frame. It adaptively collects information for a pillar node from its neighbors by iterative message passing, which effectively enlarges the receptive field of the pillar feature. In the latter component, we propose an Attentive Spatiotemporal Transformer GRU (AST-GRU) to aggregate the spatiotemporal information, which enhances the conventional ConvGRU with an attentive memory gating mechanism. AST-GRU contains a Spatial Transformer Attention (STA) module and a Temporal Transformer Attention (TTA) module, which can emphasize the foreground objects and align the dynamic objects, respectively. Experimental results demonstrate that the proposed 3D video object detector achieves state-of-the-art performance on the large-scale nuScenes benchmark.

引用

页码：11492 / 11501

页数：10

共 65 条

[21] Molecular graph convolutions: moving beyond fingerprints [J].

Kearnes, Steven ;

McCloskey, Kevin ;

Berndl, Marc ;

Pande, Vijay ;

Riley, Patrick .

JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2016, 30 (08) :595-608

[22]

King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001

[23]

Ku J, 2019, P IEEECVF C COMPUTER, P11867

[24]

Ku J, 2018, IEEE INT C INT ROBOT, P5750, DOI 10.1109/IROS.2018.8594049

[25] Video Saliency Prediction Using Spatiotemporal Residual Attentive Networks [J].

Lai, Qiuxia ;

Wang, Wenguan ;

Sun, Hanqiu ;

Shen, Jianbing .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 :1113-1126

[26] SiamRPN plus plus : Evolution of Siamese Visual Tracking with Very Deep Networks [J].

Li, Bo ;

Wu, Wei ;

Wang, Qiang ;

Zhang, Fangyi ;

Xing, Junliang ;

Yan, Junjie .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :4277-4286

[27] Stereo R-CNN based 3D Object Detection for Autonomous Driving [J].

Li, Peiliang ;

Chen, Xiaozhi ;

Shen, Shaojie .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7636-7644

[28]

Li Tao, 2020, CVPR

[29]

Li Y., 2016, P ICLR 16, DOI DOI 10.48550/ARXIV.1511.05493

[30] Multi-Task Multi-Sensor Fusion for 3D Object Detection [J].

Liang, Ming ;

Yang, Bin ;

Chen, Yun ;

Hu, Rui ;

Urtasun, Raquel .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7337-7345

← 1 2 3 4 5 6 7 →