TM2B: Transformer-Based Motion-to-Box Network for 3D Single Object Tracking on Point Clouds

被引:0
|
作者
Xu, Anqi [1 ]
Nie, Jiahao [1 ]
He, Zhiwei [1 ]
Lv, Xudong [1 ]
机构
[1] Sch Hangzhou Dianzi Univ, Hangzhou 310018, Peoples R China
来源
IEEE ROBOTICS AND AUTOMATION LETTERS | 2024年 / 9卷 / 08期
关键词
Transformers; Accuracy; Three-dimensional displays; Target tracking; Object tracking; Feature extraction; Point cloud compression; 3D single object tracking; motion-to-box; transformer;
D O I
10.1109/LRA.2024.3418274
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
3D single object tracking plays a crucial role in numerous applications such as autonomous driving. Recent trackers based on motion-centric paradigm perform well as they exploit motion cues to infer target relative motion across successive frames, which effectively overcome significant appearance variations of targets and distractors caused by occlusion. However, such a motion-centric paradigm tends to require multi-stage motion-to-box to refine the motion cues, which suffers from tedious hyper-parameter tuning and elaborate subtask designs. In this letter, we propose a novel transformer-based motion-to-box network (TM2B), which employs a learnable relation modeling transformer (LRMT) to generate accurate motion cues without multi-stage refinements. Our proposed LRMT contains two novel attention mechanisms: hierarchical interactive attention and learnable query attention. The former attention builds a learnable number-fixed sampling sets for each query on multi-scale feature maps, enabling each query to adaptively select prominent sampling elements, thus effectively encoding multi-scale features in a lightweight manner, while the latter calculates the weighted sum of the encoded features with learnable global query, enabling to extract valuable motion cues from all available features, thereby achieving accurate object tracking. Extensive experiments demonstrate that TM2B achieves state-of-the-art performance on KITTI, NuScenes and Waymo Open Dataset, while obtaining a significant improvement in inference speed over previous leading methods, achieving 56.8 FPS on a single NVIDIA 1080Ti GPU. The code is available at TM2B.
引用
收藏
页码:7078 / 7085
页数:8
相关论文
共 50 条
  • [11] DTSSD: Dual-Channel Transformer-Based Network for Point-Based 3D Object Detection
    Zheng, Zhijie
    Huang, Zhicong
    Zhao, Jingwen
    Hu, Haifeng
    Chen, Dihu
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 798 - 802
  • [12] Relation Graph Network for 3D Object Detection in Point Clouds
    Feng, Mingtao
    Gilani, Syed Zulqarnain
    Wang, Yaonan
    Zhang, Liang
    Mian, Ajmal
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 92 - 107
  • [13] Particle Filter Based Object Tracking of 3D Sparse Point Clouds for Autopilot
    Du, Yu
    Wei ShangGuan
    Chai, LinGuo
    2018 CHINESE AUTOMATION CONGRESS (CAC), 2018, : 1102 - 1107
  • [14] Geo-Localization With Transformer-Based 2D-3D Match Network
    Li, Laijian
    Ma, Yukai
    Tang, Kai
    Zhao, Xiangrui
    Chen, Chao
    Huang, Jianxin
    Mei, Jianbiao
    Liu, Yong
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2023, 8 (08) : 4855 - 4862
  • [15] MD3D: Mixture-Density-Based 3D Object Detection in Point Clouds
    Choi, Jaeseok
    Song, Yeji
    Kim, Yerim
    Yoo, Jaeyoung
    Kwak, Nojun
    IEEE ACCESS, 2022, 10 : 104011 - 104022
  • [16] Transformer-Based Global PointPillars 3D Object Detection Method
    Zhang, Lin
    Meng, Hua
    Yan, Yunbing
    Xu, Xiaowei
    ELECTRONICS, 2023, 12 (14)
  • [17] DA-Net: Density-Aware 3D Object Detection Network for Point Clouds
    Wang, Shuhua
    Lu, Ke
    Xue, Jian
    Zhao, Yang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 665 - 678
  • [18] Sewer defect detection from 3D point clouds using a transformer-based deep learning model
    Zhou, Yunxiang
    Ji, Ankang
    Zhang, Limao
    AUTOMATION IN CONSTRUCTION, 2022, 136
  • [19] Real-Time 3D Single Object Tracking With Transformer
    Shan, Jiayao
    Zhou, Sifan
    Cui, Yubo
    Fang, Zheng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 2339 - 2353
  • [20] U-shaped network based on Transformer for 3D point clouds semantic segmentation
    Zhang, Jiazhe
    Li, Xingwei
    Zhao, Xianfa
    Ge, Yizhi
    Zhang, Zheng
    2021 THE 5TH INTERNATIONAL CONFERENCE ON VIDEO AND IMAGE PROCESSING, ICVIP 2021, 2021, : 170 - 176