Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception

被引:2
|
作者
Han, Chunrui [1 ]
Yang, Jinrong [2 ]
Sun, Jianjian [1 ]
Ge, Zheng [1 ]
Dong, Runpei [3 ]
Zhou, Hongyu [1 ]
Mao, Weixin [4 ]
Peng, Yuang [5 ]
Zhang, Xiangyu [1 ]
机构
[1] Megvii Technol, Beijing 100080, Peoples R China
[2] Huazhong Univ Sci & Technol, Wuhan 430074, Peoples R China
[3] Xi An Jiao Tong Univ, Beijing 100084, Peoples R China
[4] Waseda Univ, Fukuoka 8070832, Japan
[5] Tsinghua Univ, Jian 343200, Peoples R China
来源
IEEE ROBOTICS AND AUTOMATION LETTERS | 2024年 / 9卷 / 07期
关键词
Three-dimensional displays; History; Task analysis; Feature extraction; Fuses; Pipelines; Detectors; Multi-view 3D object detection; recurrent network and long-term temporal fusion;
D O I
10.1109/LRA.2024.3401172
中图分类号
TP24 [机器人技术];
学科分类号
080202 ; 1405 ;
摘要
Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this letter, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (<bold>55.4%</bold> mAP and <bold>62.9%</bold> NDS), segmentation (<bold>48.6%</bold> vehicle mIoU), tracking (<bold>54.8%</bold> AMOTA), and motion prediction (<bold>0.80 m</bold> minADE and <bold>0.463</bold> EPA).
引用
收藏
页码:6544 / 6551
页数:8
相关论文
共 50 条
  • [31] Multi-Range View Aggregation Network With Vision Transformer Feature Fusion for 3D Object Retrieval
    Lin, Dongyun
    Li, Yiqun
    Cheng, Yi
    Prasad, Shitala
    Guo, Aiyuan
    Cao, Yanpeng
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9108 - 9119
  • [32] MVSGAN: Spatial-Aware Multi-View CMR Fusion for Accurate 3D Left Ventricular Myocardium Segmentation
    Qi, Xiaoming
    He, Yuting
    Yang, Guanyu
    Chen, Yang
    Yang, Jian
    Liu, Wangyag
    Zhu, Yinsu
    Xu, Yi
    Shu, Huazhong
    Li, Shuo
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (05) : 2264 - 2275
  • [33] A Temporal Multi-View Fuzzy Classifier for Fusion Identification on Epileptic Brain Network
    Xia, Zhengxin
    Xue, Wei
    Zhai, Jia
    Zhou, Ta
    Su, Chong
    IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2025, 33 (01) : 120 - 130
  • [34] Synthesis of Multi-View 3D Fingerprints to Advance Contactless Fingerprint Identification
    Dong, Chengdong
    Kumar, Ajay
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13134 - 13151
  • [35] MVCLN: Multi-View Convolutional LSTM Network for Cross-Media 3D Shape Recognition
    Liang, Qi
    Wang, Yixin
    Nie, Weizhi
    Li, Qiang
    IEEE ACCESS, 2020, 8 : 139792 - 139802
  • [36] Dynamic Grouping With Multi-Manifold Attention for Multi-View 3D Object Reconstruction
    Kalitsios, Georgios
    Konstantinidis, Dimitrios
    Daras, Petros
    Dimitropoulos, Kosmas
    IEEE ACCESS, 2024, 12 : 160690 - 160699
  • [37] Hierarchical Graph Attention Based Multi-View Convolutional Neural Network for 3D Object Recognition
    Zeng, Hui
    Zhao, Tianmeng
    Cheng, Ruting
    Wang, Fuzhou
    Liu, Jiwei
    IEEE ACCESS, 2021, 9 (09): : 33323 - 33335
  • [38] 3D Reconstruction of Aircraft Structures via 2D Multi-view Images
    Zhang, Tianyou
    Fan, Runze
    Zhang, Yu
    Feng, Guangkun
    Wei, Zhenzhong
    TENTH INTERNATIONAL SYMPOSIUM ON PRECISION MECHANICAL MEASUREMENTS, 2021, 12059
  • [39] PointMCD: Boosting Deep Point Cloud Encoders via Multi-View Cross-Modal Distillation for 3D Shape Recognition
    Zhang, Qijian
    Hou, Junhui
    Qian, Yue
    IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 754 - 767
  • [40] FEATURE MATCHING OF MULTI-VIEW 3D MODELS BASED ON HASH BINARY ENCODING
    Li, H.
    Zhao, T.
    Li, N.
    Cai, Q.
    Du, J.
    NEURAL NETWORK WORLD, 2017, 27 (01) : 95 - 105