Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection for Autonomous Driving

被引:126
作者
Yuan, Zhenxun [1 ]
Song, Xiao [2 ]
Bai, Lei [1 ]
Wang, Zhe [2 ]
Ouyang, Wanli [1 ]
机构
[1] Univ Sydney, Sch Elect & Informat Engn, Sydney, NSW 2006, Australia
[2] Sense Time Grp Ltd, Beijing 100080, Peoples R China
基金
澳大利亚研究理事会;
关键词
Three-dimensional displays; Object detection; Feature extraction; Laser radar; Correlation; Decoding; Head; Lidar-based video; 3D object detection; transformer; temporal-channel attention; CNN;
D O I
10.1109/TCSVT.2021.3082763
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The strong demand of autonomous driving in the industry has led to vigorous interest in 3D object detection and resulted in many excellent 3D object detection algorithms. However, the vast majority of algorithms only model single-frame data, ignoring the temporal clue in video sequence. In this work, we propose a new transformer, called Temporal-Channel Transformer (TCTR), to model the temporal-channel domain and spatial-wise relationships for video object detecting from Lidar data. As the special design of this transformer, the information encoded in the encoder is different from that in the decoder. The encoder encodes temporal-channel information of multiple frames while the decoder decodes the spatial-wise information for the current frame in a voxel-wise manner. Specifically, the temporal-channel encoder of the transformer is designed to encode the information of different channels and frames by utilizing the correlation among features from different channels and frames. On the other hand, the spatial decoder of the transformer decodes the information for each location of the current frame. Before conducting the object detection with detection head, a gate mechanism is further deployed for re-calibrating the features of current frame, which filters out the object-irrelevant information by repetitively refining the representation of target frame along with the up-sampling process. Experimental results reveal that TCTR achieves the state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.
引用
收藏
页码:2068 / 2078
页数:11
相关论文
共 55 条
[1]  
Ballas Nicolas, 2015, Delving Deeper Into Convolution Networks for Learning Video Representation,
[2]   Attention Augmented Convolutional Networks [J].
Bello, Irwan ;
Zoph, Barret ;
Vaswani, Ashish ;
Shlens, Jonathon ;
Le, Quoc V. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :3285-3294
[3]   Object Detection in Video with Spatiotemporal Sampling Networks [J].
Bertasius, Gedas ;
Torresani, Lorenzo ;
Shi, Jianbo .
COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 :342-357
[4]   nuScenes: A multimodal dataset for autonomous driving [J].
Caesar, Holger ;
Bankiti, Varun ;
Lang, Alex H. ;
Vora, Sourabh ;
Liong, Venice Erin ;
Xu, Qiang ;
Krishnan, Anush ;
Pan, Yu ;
Baldan, Giancarlo ;
Beijbom, Oscar .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11618-11628
[5]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[6]   Temporally Identity-Aware SSD With Attentional LSTM [J].
Chen, Xingyu ;
Yu, Junzhi ;
Wu, Zhengxing .
IEEE TRANSACTIONS ON CYBERNETICS, 2020, 50 (06) :2674-2686
[7]  
Chen YL, 2019, IEEE I CONF COMP VIS, P9774, DOI [10.1109/ICCV.2019.00987, 10.1109/iccv.2019.00987]
[8]  
Chung J., 2014, arXiv
[9]  
Dauphin YN, 2017, PR MACH LEARN RES, V70
[10]  
Deng JJ, 2021, AAAI CONF ARTIF INTE, V35, P1201