Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception

被引：2

作者：

Han, Chunrui ^{[1
]}

Yang, Jinrong ^{[2
]}

Sun, Jianjian ^{[1
]}

Ge, Zheng ^{[1
]}

Dong, Runpei ^{[3
]}

Zhou, Hongyu ^{[1
]}

Mao, Weixin ^{[4
]}

Peng, Yuang ^{[5
]}

Zhang, Xiangyu ^{[1
]}

机构：

[1] Megvii Technol, Beijing 100080, Peoples R China

[2] Huazhong Univ Sci & Technol, Wuhan 430074, Peoples R China

[3] Xi An Jiao Tong Univ, Beijing 100084, Peoples R China

[4] Waseda Univ, Fukuoka 8070832, Japan

[5] Tsinghua Univ, Jian 343200, Peoples R China

来源：

IEEE ROBOTICS AND AUTOMATION LETTERS | 2024年 / 9卷 / 07期

关键词：

Three-dimensional displays; History; Task analysis; Feature extraction; Fuses; Pipelines; Detectors; Multi-view 3D object detection; recurrent network and long-term temporal fusion;

D O I：

10.1109/LRA.2024.3401172

中图分类号：

TP24 [机器人技术];

学科分类号：

080202 ; 1405 ;

摘要：

Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this letter, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (<bold>55.4%</bold> mAP and <bold>62.9%</bold> NDS), segmentation (<bold>48.6%</bold> vehicle mIoU), tracking (<bold>54.8%</bold> AMOTA), and motion prediction (<bold>0.80 m</bold> minADE and <bold>0.463</bold> EPA).

引用

页码：6544 / 6551

页数：8

共 50 条

[31] Multi-Range View Aggregation Network With Vision Transformer Feature Fusion for 3D Object Retrieval
Lin, Dongyun
Li, Yiqun
Cheng, Yi
Prasad, Shitala
Guo, Aiyuan
Cao, Yanpeng
IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9108 - 9119
[32] MVSGAN: Spatial-Aware Multi-View CMR Fusion for Accurate 3D Left Ventricular Myocardium Segmentation
Qi, Xiaoming
He, Yuting
Yang, Guanyu
Chen, Yang
Yang, Jian
Liu, Wangyag
Zhu, Yinsu
Xu, Yi
Shu, Huazhong
Li, Shuo
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (05) : 2264 - 2275
[33] A Temporal Multi-View Fuzzy Classifier for Fusion Identification on Epileptic Brain Network
Xia, Zhengxin
Xue, Wei
Zhai, Jia
Zhou, Ta
Su, Chong
IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2025, 33 (01) : 120 - 130
[34] Synthesis of Multi-View 3D Fingerprints to Advance Contactless Fingerprint Identification
Dong, Chengdong
Kumar, Ajay
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (11) : 13134 - 13151
[35] MVCLN: Multi-View Convolutional LSTM Network for Cross-Media 3D Shape Recognition
Liang, Qi
Wang, Yixin
Nie, Weizhi
Li, Qiang
IEEE ACCESS, 2020, 8 : 139792 - 139802
[36] Dynamic Grouping With Multi-Manifold Attention for Multi-View 3D Object Reconstruction
Kalitsios, Georgios
Konstantinidis, Dimitrios
Daras, Petros
Dimitropoulos, Kosmas
IEEE ACCESS, 2024, 12 : 160690 - 160699
[37] Hierarchical Graph Attention Based Multi-View Convolutional Neural Network for 3D Object Recognition
Zeng, Hui
Zhao, Tianmeng
Cheng, Ruting
Wang, Fuzhou
Liu, Jiwei
IEEE ACCESS, 2021, 9 (09): : 33323 - 33335
[38] 3D Reconstruction of Aircraft Structures via 2D Multi-view Images
Zhang, Tianyou
Fan, Runze
Zhang, Yu
Feng, Guangkun
Wei, Zhenzhong
TENTH INTERNATIONAL SYMPOSIUM ON PRECISION MECHANICAL MEASUREMENTS, 2021, 12059
[39] PointMCD: Boosting Deep Point Cloud Encoders via Multi-View Cross-Modal Distillation for 3D Shape Recognition
Zhang, Qijian
Hou, Junhui
Qian, Yue
IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 754 - 767
[40] FEATURE MATCHING OF MULTI-VIEW 3D MODELS BASED ON HASH BINARY ENCODING
Li, H.
Zhao, T.
Li, N.
Cai, Q.
Du, J.
NEURAL NETWORK WORLD, 2017, 27 (01) : 95 - 105

← 1 2 3 4 5 →