TFIENet: Transformer Fusion Information Enhancement Network for Multimodel 3-D Object Detection

被引：0

作者：

Cao, Feng ^{[1
]}

Jin, Yufeng ^{[2
]}

Tao, Chongben ^{[2
,3
]}

Luo, Xizhao ^{[4
]}

Gao, Zhen ^{[5
]}

Zhang, Zufeng ^{[6
]}

Zheng, Sifa ^{[6
]}

Zhu, Yuan ^{[7
]}

机构：

[1] Shanxi Univ, Sch Comp & Informat Technol, Taiyuan 030006, Peoples R China

[2] Suzhou Univ Sci & Technol, Sch Elect & Informat Engn, Suzhou 215009, Peoples R China

[3] Tsinghua Univ, Suzhou Automobile Res Inst, Suzhou 215134, Peoples R China

[4] Soochow Univ, Sch Comp Sci & Technol, Suzhou 215006, Peoples R China

[5] McMaster Univ, Fac Engn, Hamilton, ON L8S 0A, Canada

[6] Tsinghua Univ, Suzhou Automot Res Inst, Beijing 100084, Peoples R China

[7] Tongji Univ, Coll Automot Studies, Shanghai 201804, Peoples R China

来源：

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT | 2024年 / 73卷

基金：

中国国家自然科学基金;

关键词：

Feature extraction; Transformers; Laser radar; Three-dimensional displays; Object detection; Point cloud compression; Cameras; 3-D object detection; autonomous driving; depth complementation; sensor fusion; transformer;

D O I：

10.1109/TIM.2024.3451586

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

During feature-level data fusion in 3-D object detection, the correlation between different modal data is destroyed by the misalignment problem, which leads to inaccurate localization of small targets at long distances. For the problem, a transformer fusion information enhancement network (TFIENet) is proposed. First, the original point cloud and color images are taken as input. Besides, the standard backbone network of feature extraction is passed to obtain LiDAR point cloud features and image features, respectively. Second, a region proposal network of transformer dual-fusion features is designed, which uses a deformable transformer-decoder to double fuse the extracted LiDAR point cloud features and image features based on a deformed attention mechanism. Moreover, the dual-domain feature information of the LiDAR camera is aggregated to generate the initial candidate frames. Then, the enhancement module of feature information is used to further refine the frame, which predicts the dense depth feature information using a depth complementation mechanism. The corresponding dense depth information and feature semantic information are extracted to complete the box refinement. Finally, for aligning and fusing feature information from different modalities effectively, a multimodal feature cross-attention module (MFCAM) is designed. Moreover, a dynamic cross-attention mechanism is applied to obtain the correlation between different modalities. Experimental results on the KITTI, NuScenes, and Waymo datasets demonstrate the generality and effectiveness of the proposed TFIENet method. Extensive ablation experiments demonstrate the efficiency of each individual module. Experimental results on a real road dataset show that the TFIENet algorithm has strong robustness in complex real road environments.

引用

页数：13

共 64 条

[1] TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers [J].

Bai, Xuyang ;

Hu, Zeyu ;

Zhu, Xinge ;

Huang, Qingqiu ;

Chen, Yilun ;

Fu, Hangbo ;

Tai, Chiew-Lan .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :1080-1089

[2] nuScenes: A multimodal dataset for autonomous driving [J].

Caesar, Holger ;

Bankiti, Varun ;

Lang, Alex H. ;

Vora, Sourabh ;

Liong, Venice Erin ;

Xu, Qiang ;

Krishnan, Anush ;

Pan, Yu ;

Baldan, Giancarlo ;

Beijbom, Oscar .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11618-11628

[3] MCHFormer: A Multi-Cross Hybrid Former of Point-Image for 3D Object Detection [J].

Cao, Feng ;

Xue, Jun ;

Tao, Chongben ;

Luo, Xizhao ;

Gao, Zhen ;

Zhang, Zufeng ;

Zheng, Sifa ;

Zhu, Yuan .

IEEE TRANSACTIONS ON INTELLIGENT VEHICLES, 2024, 9 (01) :383-394

[4]

Cao JieCheng, 2024, IEEE Transactions on Artificial Intelligence, P254, DOI [10.1109/tai.2023.3237787, 10.1109/TAI.2023.3237787]

[5] Vision-Enhanced and Consensus-Aware Transformer for Image Captioning [J].

Cao, Shan ;

An, Gaoyun ;

Zheng, Zhenxing ;

Wang, Zhiyong .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (10) :7005-7018

[6]

Carion N., 2022, ECCV, P213

[7] A Transformer-Based Feature Segmentation and Region Alignment Method for UAV-View Geo-Localization [J].

Dai, Ming ;

Hu, Jianhong ;

Zhuang, Jiedong ;

Zheng, Enhui .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (07) :4376-4389

[8]

Deng JJ, 2021, AAAI CONF ARTIF INTE, V35, P1201

[9] Hierarchical Feature Aggregation Based on Transformer for Image-Text Matching [J].

Dong, Xinfeng ;

Zhang, Huaxiang ;

Zhu, Lei ;

Nie, Liqiang ;

Liu, Li .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (09) :6437-6447

[10] JTEA: A Joint Trajectory Tracking and Estimation Approach for Low-Observable Micro-UAV Monitoring With 4-D Radar [J].

Fang, Xin ;

He, Min ;

Huang, Darong ;

Zhang, Zhenyuan ;

Ge, Liang ;

Xiao, Guoqing .

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2024, 73 :1-14

← 1 2 3 4 5 6 7 →