A State Space Model for Multiobject Full 3-D Information Estimation From RGB-D Images

被引：0

作者：

Zhou, Jiaming ^{[1
]}

Zhu, Qing ^{[1
]}

Wang, Yaonan ^{[1
]}

Feng, Mingtao ^{[2
]}

Liu, Jian ^{[1
]}

Huang, Jianan ^{[1
]}

Mian, Ajmal ^{[3
]}

机构：

[1] Hunan Univ, Coll Elect & Informat Engn, Natl Engn Res Ctr Robot Visual Percept & Control, Changsha 410082, Peoples R China

[2] Xidian Univ, Sch Artificial Intelligence, Xian 710071, Peoples R China

[3] Univ Western Australia, Dept Comp Sci & Software Engn, Perth, WA 6009, Australia

来源：

IEEE TRANSACTIONS ON CYBERNETICS | 2025年 / 55卷 / 05期

基金：

澳大利亚研究理事会; 中国国家自然科学基金;

关键词：

Shape; Three-dimensional displays; Solid modeling; Computational modeling; Image reconstruction; Codes; Accuracy; Visualization; Point cloud compression; Head; Mamba; pose estimation; shape reconstruction; state space model (SSM);

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visual understanding of 3-D objects is essential for robotic manipulation, autonomous navigation, and augmented reality. However, existing methods struggle to perform this task efficiently and accurately in an end-to-end manner. We propose a single-shot method based on the state space model (SSM) to predict the full 3-D information (pose, size, shape) of multiple 3-D objects from a single RGB-D image in an end-to-end manner. Our method first encodes long-range semantic information from RGB and depth images separately and then combines them into an integrated latent representation that is processed by a modified SSM to infer the full 3-D information in two separate task heads within a unified model. A heatmap/detection head predicts object centers, and a 3-D information head predicts a matrix detailing the pose, size and latent code of shape for each detected object. We also propose a shape autoencoder based on the SSM, which learns canonical shape codes derived from a large database of 3-D point cloud shapes. The end-to-end framework, modified SSM block and SSM-based shape autoencoder form major contributions of this work. Our design includes different scan strategies tailored to different input data representations, such as RGB-D images and point clouds. Extensive evaluations on the REAL275, CAMERA25, and Wild6D datasets show that our method achieves state-of-the-art performance. On the large-scale Wild6D dataset, our model significantly outperforms the nearest competitor, achieving 2.6% and 5.1% improvements on the IOU-50 and 5(degrees)10 cm metrics, respectively.

引用

页码：2248 / 2260

页数：13

共 44 条

[1] Avetisyan Armen, 2020, Computer Vision - ECCV 2020 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12367), P596, DOI 10.1007/978-3-030-58542-6_36
[2] Brachmann E, 2014, LECT NOTES COMPUT SC, V8690, P536, DOI 10.1007/978-3-319-10605-2_35
[3] Chen DS, 2020, PROC CVPR IEEE, P11970, DOI 10.1109/CVPR42600.2020.01199
[4] Chen K., 2024, ARXIV
[5] Toward Safe Distributed Multi-Robot Navigation Coupled With Variational Bayesian Model
Chen, Lin
Wang, Yaonan
Miao, Zhiqiang
Feng, Mingtao
Zhou, Zhen
Wang, Hesheng
Wang, Danwei
[J]. IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024, 21 (04) : 7583 - 7598
[6] A Comprehensive Study of 3-D Vision-Based Robot Manipulation
Cong, Yang
Chen, Ronghan
Ma, Bingtao
Liu, Hongsen
Hou, Dongdong
Yang, Chenguang
[J]. IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (03) : 1682 - 1698
[7] Speedup 3-D Texture-Less Object Recognition Against Self-Occlusion for Intelligent Manufacturing
Cong, Yang
Tian, Dongying
Feng, Yun
Fan, Baojie
Yu, Haibin
[J]. IEEE TRANSACTIONS ON CYBERNETICS, 2019, 49 (11) : 3887 - 3897
[8] Dosovitskiy A., 2021, PROC INT C LEARN REP, DOI DOI 10.48550/ARXIV.2010.11929
[9] Feng M., 2021, P IEEE CVF INT C COM, P3722
[10] Exploring Hierarchical Spatial Layout Cues for 3D Point Cloud Based Scene Graph Prediction
Feng, Mingtao
Hou, Haoran
Zhang, Liang
Guo, Yulan
Yu, Hongshan
Wang, Yaonan
Mian, Ajmal
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2025, 27 : 731 - 743

← 1 2 3 4 5 →