Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception

被引:2
|
作者
Han, Chunrui [1 ]
Yang, Jinrong [2 ]
Sun, Jianjian [1 ]
Ge, Zheng [1 ]
Dong, Runpei [3 ]
Zhou, Hongyu [1 ]
Mao, Weixin [4 ]
Peng, Yuang [5 ]
Zhang, Xiangyu [1 ]
机构
[1] Megvii Technol, Beijing 100080, Peoples R China
[2] Huazhong Univ Sci & Technol, Wuhan 430074, Peoples R China
[3] Xi An Jiao Tong Univ, Beijing 100084, Peoples R China
[4] Waseda Univ, Fukuoka 8070832, Japan
[5] Tsinghua Univ, Jian 343200, Peoples R China
来源
IEEE ROBOTICS AND AUTOMATION LETTERS | 2024年 / 9卷 / 07期
关键词
Three-dimensional displays; History; Task analysis; Feature extraction; Fuses; Pipelines; Detectors; Multi-view 3D object detection; recurrent network and long-term temporal fusion;
D O I
10.1109/LRA.2024.3401172
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this letter, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (<bold>55.4%</bold> mAP and <bold>62.9%</bold> NDS), segmentation (<bold>48.6%</bold> vehicle mIoU), tracking (<bold>54.8%</bold> AMOTA), and motion prediction (<bold>0.80 m</bold> minADE and <bold>0.463</bold> EPA).
引用
收藏
页码:6544 / 6551
页数:8
相关论文
共 50 条
  • [1] Multi-View Hierarchical Fusion Network for 3D Object Retrieval and Classification
    Liu, An-An
    Hu, Nian
    Song, Dan
    Guo, Fu-Bin
    Zhou, He-Yu
    Hao, Tong
    IEEE ACCESS, 2019, 7 : 153021 - 153030
  • [2] Dynamic View Aggregation for Multi-View 3D Shape Recognition
    Zhou, Yuan
    Sun, Zhongqi
    Huo, Shuwei
    Kung, Sun-Yuan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9163 - 9174
  • [3] Adaptive Multi-View and Temporal Fusing Transformer for 3D Human Pose Estimation
    Shuai, Hui
    Wu, Lele
    Liu, Qingshan
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (04) : 4122 - 4135
  • [4] X-View: Non-Egocentric Multi-View 3D Object Detector
    Xie, Liang
    Xu, Guodong
    Cai, Deng
    He, Xiaofei
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2023, 32 : 1488 - 1497
  • [5] MVF-GNN: Multi-View Fusion With GNN for 3D Semantic Segmentation
    Du, Zhenxiang
    Ren, Minglun
    Chu, Wei
    Chen, Nengying
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (04): : 3262 - 3269
  • [6] Learning Disentangled Representation for Multi-View 3D Object Recognition
    Huang, Jingjia
    Yan, Wei
    Li, Ge
    Li, Thomas
    Liu, Shan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (02) : 646 - 659
  • [7] Recognition of 3D Object Based on Multi-View Recurrent Neural Networks
    Dong S.
    Li W.-S.
    Zhang W.-Q.
    Zou K.
    Dianzi Keji Daxue Xuebao/Journal of the University of Electronic Science and Technology of China, 2020, 49 (02): : 269 - 275
  • [8] An Improved Multi-View Convolutional Neural Network for 3D Object Retrieval
    He, Xinwei
    Bai, Song
    Chu, Jiajia
    Bai, Xiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 7917 - 7930
  • [9] MVPointNet: Multi-View Network for 3D Object Based on Point Cloud
    Zhou, Weiguo
    Jiang, Xin
    Liu, Yun-Hui
    IEEE SENSORS JOURNAL, 2019, 19 (24) : 12145 - 12152
  • [10] DRCNN: Dynamic Routing Convolutional Neural Network for Multi-View 3D Object Recognition
    Sun, Kai
    Zhang, Jiangshe
    Liu, Junmin
    Yu, Ruixuan
    Song, Zengjie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 868 - 877