Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception

被引:2
|
作者
Han, Chunrui [1 ]
Yang, Jinrong [2 ]
Sun, Jianjian [1 ]
Ge, Zheng [1 ]
Dong, Runpei [3 ]
Zhou, Hongyu [1 ]
Mao, Weixin [4 ]
Peng, Yuang [5 ]
Zhang, Xiangyu [1 ]
机构
[1] Megvii Technol, Beijing 100080, Peoples R China
[2] Huazhong Univ Sci & Technol, Wuhan 430074, Peoples R China
[3] Xi An Jiao Tong Univ, Beijing 100084, Peoples R China
[4] Waseda Univ, Fukuoka 8070832, Japan
[5] Tsinghua Univ, Jian 343200, Peoples R China
来源
IEEE ROBOTICS AND AUTOMATION LETTERS | 2024年 / 9卷 / 07期
关键词
Three-dimensional displays; History; Task analysis; Feature extraction; Fuses; Pipelines; Detectors; Multi-view 3D object detection; recurrent network and long-term temporal fusion;
D O I
10.1109/LRA.2024.3401172
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this letter, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (<bold>55.4%</bold> mAP and <bold>62.9%</bold> NDS), segmentation (<bold>48.6%</bold> vehicle mIoU), tracking (<bold>54.8%</bold> AMOTA), and motion prediction (<bold>0.80 m</bold> minADE and <bold>0.463</bold> EPA).
引用
收藏
页码:6544 / 6551
页数:8
相关论文
共 50 条
  • [11] VSFormer: Mining Correlations in Flexible View Set for Multi-View 3D Shape Understanding
    Sun, Hongyu
    Wang, Yongcai
    Wang, Peng
    Deng, Haoran
    Cai, Xudong
    Li, Deying
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2025, 31 (04) : 2127 - 2141
  • [12] Group Multi-View Transformer for 3D Shape Analysis With Spatial Encoding
    Xu, Lixiang
    Cui, Qingzhe
    Hong, Richang
    Xu, Wei
    Chen, Enhong
    Yuan, Xin
    Li, Chenglong
    Tang, Yuanyan
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9450 - 9463
  • [13] A Bayesian Filter for Multi-View 3D Multi-Object Tracking With Occlusion Handling
    Ong, Jonah
    Ba-Tuong Vo
    Ba-Ngu Vo
    Kim, Du Yong
    Nordholm, Sven
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (05) : 2246 - 2263
  • [14] Fine-grained Recognition of 3D Shapes Based on Multi-view Recurrent Neural Network
    Dong, Shuai
    Zou, Kun
    Li, Wensheng
    ICMLC 2020: 2020 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2018, : 152 - 156
  • [15] Discriminative Multi-View Dynamic Image Fusion for Cross-View 3-D Action Recognition
    Wang, Yancheng
    Xiao, Yang
    Lu, Junyi
    Tan, Bo
    Cao, Zhiguo
    Zhang, Zhenjun
    Zhou, Joey Tianyi
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (10) : 5332 - 5345
  • [16] DETransMVSnet: Research on Terahertz 3D Reconstruction of Multi-View Stereo Network With Deep Equilibrium Transformers
    Bai, Fan
    Li, Lun
    Wang, Wencheng
    Wu, Xiaojin
    IEEE ACCESS, 2023, 11 : 146042 - 146053
  • [17] From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object Detection
    Deng, Jiajun
    Zhou, Wengang
    Zhang, Yanyong
    Li, Houqiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (12) : 4722 - 4734
  • [18] Learning View-Based Graph Convolutional Network for Multi-View 3D Shape Analysis
    Wei, Xin
    Yu, Ruixuan
    Sun, Jian
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7525 - 7541
  • [19] Local-to-Global Semantic Learning for Multi-View 3D Object Detection From Point Cloud
    Qiao, Renzhong
    Ji, Hongbing
    Zhu, Zhigang
    Zhang, Wenbo
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (10) : 9371 - 9385
  • [20] A Progressive Multi-View Learning Approach for Multi-Loss Optimization in 3D Object Recognition
    Prasad, Shitala
    Li, Yiqun
    Lin, Dongyun
    Dong, Sheng
    Nwe, Ma Tin Lay
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 707 - 711