Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception

被引:2
|
作者
Han, Chunrui [1 ]
Yang, Jinrong [2 ]
Sun, Jianjian [1 ]
Ge, Zheng [1 ]
Dong, Runpei [3 ]
Zhou, Hongyu [1 ]
Mao, Weixin [4 ]
Peng, Yuang [5 ]
Zhang, Xiangyu [1 ]
机构
[1] Megvii Technol, Beijing 100080, Peoples R China
[2] Huazhong Univ Sci & Technol, Wuhan 430074, Peoples R China
[3] Xi An Jiao Tong Univ, Beijing 100084, Peoples R China
[4] Waseda Univ, Fukuoka 8070832, Japan
[5] Tsinghua Univ, Jian 343200, Peoples R China
来源
IEEE ROBOTICS AND AUTOMATION LETTERS | 2024年 / 9卷 / 07期
关键词
Three-dimensional displays; History; Task analysis; Feature extraction; Fuses; Pipelines; Detectors; Multi-view 3D object detection; recurrent network and long-term temporal fusion;
D O I
10.1109/LRA.2024.3401172
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this letter, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (<bold>55.4%</bold> mAP and <bold>62.9%</bold> NDS), segmentation (<bold>48.6%</bold> vehicle mIoU), tracking (<bold>54.8%</bold> AMOTA), and motion prediction (<bold>0.80 m</bold> minADE and <bold>0.463</bold> EPA).
引用
收藏
页码:6544 / 6551
页数:8
相关论文
共 50 条
  • [21] Mobile Proxy Caching for Multi-View 3D Videos With Adaptive View Selection
    Yeh, Mengsi
    Wang, Chih-Hang
    Yang, De-Nian
    Lee, Ji-Tang
    Liao, Wanjiun
    IEEE TRANSACTIONS ON MOBILE COMPUTING, 2022, 21 (08) : 2909 - 2921
  • [22] Prior-Guided Multi-View 3D Head Reconstruction
    Wang, Xueying
    Guo, Yudong
    Yang, Zhongqi
    Zhang, Juyong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 4028 - 4040
  • [23] Multi-View Vision Fusion Network: Can 2D Pre-Trained Model Boost 3D Point Cloud Data-Scarce Learning?
    Peng, Haoyang
    Li, Baopu
    Zhang, Bo
    Chen, Xin
    Chen, Tao
    Zhu, Hongyuan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (07) : 5951 - 5962
  • [24] Deformable convolutional networks for multi-view 3D shape classification
    Ma, Pengfei
    Ma, Jie
    Wang, Xujiao
    Yang, Lichuang
    Wang, Nannan
    ELECTRONICS LETTERS, 2018, 54 (24) : 1373 - 1374
  • [25] Disentangling 3D/4D Facial Affect Recognition With Faster Multi-View Transformer
    Behzad, Muzammil
    Li, Xiaobai
    Zhao, Guoying
    IEEE SIGNAL PROCESSING LETTERS, 2021, 28 : 1913 - 1917
  • [26] Multi-View Tree Structure Learning for 3D Model Retrieval and Classification in Smart City
    Liu, An-An
    Zhao, Zhenlan
    Li, Wenhui
    Song, Dan
    IEEE ACCESS, 2020, 8 : 129743 - 129753
  • [27] Study of 3D Finger Vein Biometrics on Imaging Device Design and Multi-View Verification
    Song, Yizhuo
    Zhao, Pengyang
    Wang, Siqi
    Liao, Qingmin
    Yang, Wenming
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (04) : 3043 - 3048
  • [28] Multi-View 3D Scene Abstraction From Drone-Captured RGB Images
    Jeong, Wooseong
    Kim, Jihun
    Kweon, Hyeokjun
    Yoon, Kuk-Jin
    IEEE ACCESS, 2025, 13 : 27641 - 27656
  • [29] STFNET: Sparse Temporal Fusion for 3D Object Detection in LiDAR Point Cloud
    Meng, Xin
    Zhou, Yuan
    Ma, Jun
    Jiang, Fangdi
    Qi, Yongze
    Wang, Cui
    Kim, Jonghyuk
    Wang, Shifeng
    IEEE SENSORS JOURNAL, 2025, 25 (03) : 5866 - 5877
  • [30] Conditions of a Multi-View 3D Display for Accurate Reproduction of Perceived Glossiness
    Sakano, Yuichi
    Ando, Hiroshi
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2022, 28 (10) : 3336 - 3350