A State Space Model for Multiobject Full 3-D Information Estimation From RGB-D Images

被引:0
作者
Zhou, Jiaming [1 ]
Zhu, Qing [1 ]
Wang, Yaonan [1 ]
Feng, Mingtao [2 ]
Liu, Jian [1 ]
Huang, Jianan [1 ]
Mian, Ajmal [3 ]
机构
[1] Hunan Univ, Coll Elect & Informat Engn, Natl Engn Res Ctr Robot Visual Percept & Control, Changsha 410082, Peoples R China
[2] Xidian Univ, Sch Artificial Intelligence, Xian 710071, Peoples R China
[3] Univ Western Australia, Dept Comp Sci & Software Engn, Perth, WA 6009, Australia
基金
澳大利亚研究理事会; 中国国家自然科学基金;
关键词
Shape; Three-dimensional displays; Solid modeling; Computational modeling; Image reconstruction; Codes; Accuracy; Visualization; Point cloud compression; Head; Mamba; pose estimation; shape reconstruction; state space model (SSM);
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual understanding of 3-D objects is essential for robotic manipulation, autonomous navigation, and augmented reality. However, existing methods struggle to perform this task efficiently and accurately in an end-to-end manner. We propose a single-shot method based on the state space model (SSM) to predict the full 3-D information (pose, size, shape) of multiple 3-D objects from a single RGB-D image in an end-to-end manner. Our method first encodes long-range semantic information from RGB and depth images separately and then combines them into an integrated latent representation that is processed by a modified SSM to infer the full 3-D information in two separate task heads within a unified model. A heatmap/detection head predicts object centers, and a 3-D information head predicts a matrix detailing the pose, size and latent code of shape for each detected object. We also propose a shape autoencoder based on the SSM, which learns canonical shape codes derived from a large database of 3-D point cloud shapes. The end-to-end framework, modified SSM block and SSM-based shape autoencoder form major contributions of this work. Our design includes different scan strategies tailored to different input data representations, such as RGB-D images and point clouds. Extensive evaluations on the REAL275, CAMERA25, and Wild6D datasets show that our method achieves state-of-the-art performance. On the large-scale Wild6D dataset, our model significantly outperforms the nearest competitor, achieving 2.6% and 5.1% improvements on the IOU-50 and 5(degrees)10 cm metrics, respectively.
引用
收藏
页码:2248 / 2260
页数:13
相关论文
共 44 条
  • [31] Liu Yong, 2024, arXiv
  • [32] Template Deformation-Based 3-D Reconstruction of Full Human Body Scans From Low-Cost Depth Cameras
    Liu, Zhenbao
    Huang, Jinxin
    Bu, Shuhui
    Han, Junwei
    Tang, Xiaojun
    Li, Xuelong
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2017, 47 (03) : 695 - 708
  • [33] Meng Tian, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12366), P530, DOI 10.1007/978-3-030-58589-1_32
  • [34] Total3DUnderstanding: Joint Layout, Object Pose and Mesh Reconstruction for Indoor Scenes from a Single Image
    Nie, Yinyu
    Han, Xiaoguang
    Guo, Shihui
    Zheng, Yujian
    Chang, Jian
    Zhang, Jian Jun
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 52 - 61
  • [35] Frustum PointNets for 3D Object Detection from RGB-D Data
    Qi, Charles R.
    Liu, Wei
    Wu, Chenxia
    Su, Hao
    Guibas, Leonidas J.
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 918 - 927
  • [36] Augmented Autoencoders: Implicit 3D Orientation Learning for 6D Object Detection
    Sundermeyer, Martin
    Marton, Zoltan-Csaba
    Durner, Maximilian
    Triebel, Rudolph
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2020, 128 (03) : 714 - 729
  • [37] Layer-Structured 3D Scene Inference via View Synthesis
    Tulsiani, Shubham
    Tucker, Richard
    Snavely, Noah
    [J]. COMPUTER VISION - ECCV 2018, PT VII, 2018, 11211 : 311 - 327
  • [38] Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation
    Wang, He
    Sridhar, Srinath
    Huang, Jingwei
    Valentin, Julien
    Song, Shuran
    Guibas, Leonidas J.
    [J]. 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 2637 - 2646
  • [39] Yu S, 2024, AAAI CONF ARTIF INTE, P6808
  • [40] An Efficient Robotic Pushing and Grasping Method in Cluttered Scene
    Yu, Sheng
    Zhai, Di-Hua
    Xia, Yuanqing
    Guan, Yuyin
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (09) : 4889 - 4902