TDOcc: Exploit machine learning and big data in multi-view 3D occupancy prediction

被引:1
作者
Shan, Chun [1 ,2 ]
Zeng, Jian [1 ,2 ]
Liu, Hongming [1 ,2 ,3 ]
Chen, Chuixing [3 ]
Du, Xiaojiang [4 ]
Guizani, Mohsen [5 ]
机构
[1] Guangdong Polytech Normal Univ, Sch Elect & Informat, Guangzhou 510665, Peoples R China
[2] Guangdong Polytech Normal Univ, Guangdong Prov Key Lab Intellectual Property & Big, Guangzhou 510665, Peoples R China
[3] Guangzhou Inst Sci & Technol, Sch Informat & Optoelect Engn, Guangzhou 510540, Peoples R China
[4] Temple Univ, Dept Comp & Informat Sci, Philadelphia, PA 19122 USA
[5] Mohamed bin Zayed Univ Artificial Intelligence, Machine Learning Dept, Abu Dhabi, U Arab Emirates
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2025年 / 164卷
关键词
Big data; Machine learning; 3D occupancy prediction;
D O I
10.1016/j.future.2024.107583
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the advancement of machine learning and big data technologies, BEV (Bird's Eye View)-based methodologies have recently achieved significant breakthroughs in multi-view 3D occupancy prediction tasks. However, BEV (Bird's Eye View)-centric 3D occupancy prediction continues to grapple with feature representation and annotation costs when applied to complex open environments. In order to surmount these issues and further propel the evolution of 3D occupancy tasks, this study introduces a novel framework termed TDOcc. By leveraging multi-camera imagery, TDOcc executes 3D semantic occupancy prediction by directly learning from unprocessed 3D spaces, thereby maximizing information retention. TDOcc presents two notable advantages: firstly, it utilizes dense occupancy labels, which not only facilitate robust dense occupancy inference but also enable comprehensive object estimation within the scene. Secondly, the framework synthesizes historical feature information by adeptly aligning past and present features through temporal cues, thereby bolstering the efficacy of the feature fusion module. Additionally, with a view to address the ill-posed nature inherent in camera-based 3D occupancy prediction, we incorporate an enhancement module that operates within the 3D feature space. This module has been meticulously crafted for the training phase to amplify the model's learning potential. Extensive experiments conducted on the widely recognized nuScenes dataset underscore the efficacy of our proposed approach. Compared to the most recent TPVFormer and OccFormer, our approach has achieved a significant improvement in mean Intersection over Union (mIoU) by 2.0 and 0.8 respectively, and has reached performance comparable to the state-of-the-art LiDAR-based methods.
引用
收藏
页数:12
相关论文
共 37 条
  • [1] Mescheder L., Oechsle M., Niemeyer M., Nowozin S., Geiger A., Occupancy networks: Learning 3d reconstruction in function space, pp. 4460-4470, (2019)
  • [2] Peng S., Niemeyer M., Mescheder L., Pollefeys M., Geiger A., Convolutional occupancy networks, Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pp. 523-540, (2020)
  • [3] Hu Y., Yang J., Chen L., Li K., Sima C., Zhu X., Chai S., Du S., Lin T., Wang W., Et al., pp. 17853-17862, (2023)
  • [4] Tong W., Sima C., Wang T., Chen L., Wu S., Deng H., Gu Y., Lu L., Luo P., Lin D., Et al., pp. 8406-8415, (2023)
  • [5] Huang J., Huang G., Zhu Z., Ye Y., Du D., Bevdet: High-performance multi-camera 3d object detection in bird-eye-view, (2021)
  • [6] Li Z., Wang W., Li H., Xie E., Sima C., Lu T., Qiao Y., Dai J., Bevformer: Learning bird's-eye-view representation from multi-camera images via spatiotemporal transformers, European Conference on Computer Vision, pp. 1-18, (2022)
  • [7] Huang Y., Zheng W., Zhang Y., Zhou J., Lu J., Tri-perspective view for vision-based 3d semantic occupancy prediction, pp. 9223-9232, (2023)
  • [8] Zuo S., Zheng W., Huang Y., Zhou J., Lu J., Pointocc: Cylindrical tri-perspective view for point-based 3d semantic occupancy prediction, (2023)
  • [9] Zhang Y., Zhu Z., Du D., Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction, pp. 9433-9443, (2023)
  • [10] Caesar H., Bankiti V., Lang A.H., Vora S., Liong V.E., Xu Q., Krishnan A., Pan Y., Baldan G., Beijbom O., nuscenes: A multimodal dataset for autonomous driving, pp. 11621-11631, (2020)