Uni6D: A Unified CNN Framework without Projection Breakdown for 6D Pose Estimation

被引:16
作者
Jiang, Xiaoke [1 ]
Li, Donghai [1 ]
Chen, Hao [1 ]
Zheng, Ye [2 ,3 ]
Zhao, Rui [1 ,4 ]
Wu, Liwei [1 ]
机构
[1] SenseTime Res, Shenzhen, Peoples R China
[2] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[3] Univ Chinese Acad Sci, Beijing, Peoples R China
[4] Shanghai Jiao Tong Univ, Qing Yuan Res Inst, Shanghai, Peoples R China
来源
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2022年
关键词
RECOGNITION;
D O I
10.1109/CVPR52688.2022.01089
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As RGB-D sensors become more affordable, using RGB-D images to obtain high-accuracy 6D pose estimation results becomes a better option. State-of-the-art approaches typically use different backbones to extract features for RGB and depth images. They use a 2D CNN for RGB images and a per-pixel point cloud network for depth data, as well as a fusion network for feature fusion. We find that the essential reason for using two independent backbones is the "projection breakdown" problem. In the depth image plane, the projected 3D structure of the physical world is preserved by the ID depth value and its built-in 2D pixel coordinate (UV). Any spatial transformation that modifies UV, such as resize, flip, crop, or pooling operations in the CNN pipeline, breaks the binding between the pixel value and UV coordinate. As a consequence, the 3D structure is no longer preserved by a modified depth image or feature. To address this issue, we propose a simple yet effective method denoted as Uni6D that explicitly takes the extra UV data along with RGB-D images as input. Our method has a Unified CNN framework for 6D pose estimation with a single CNN backbone. In particular, the architecture of our method is based on Mask R-CNN with two extra heads, one named RT head for directly predicting 6D pose and the other named abc head for guiding the network to map the visible points to their coordinates in the 3D model as an auxiliary module. This end-to-end approach balances simplicity and accuracy, achieving comparable accuracy with state of the arts and 7.2 x faster inference speed on the YCB-Video dataset.
引用
收藏
页码:11164 / 11174
页数:11
相关论文
共 53 条
  • [1] [Anonymous], 2020, P IEEE CVF C COMP VI, DOI DOI 10.1109/ECTC32862.2020.00117
  • [2] Brachmann E, 2014, LECT NOTES COMPUT SC, V8690, P536, DOI 10.1007/978-3-319-10605-2_35
  • [3] Cai Ming, 2020, P IEEECVF C COMPUTER, P3153
  • [4] Calli B, 2015, PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS (ICAR), P510, DOI 10.1109/ICAR.2015.7251504
  • [5] Chen DS, 2020, PROC CVPR IEEE, P11970, DOI 10.1109/CVPR42600.2020.01199
  • [6] The MOPED framework: Object recognition and pose estimation for manipulation
    Collet, Alvaro
    Martinez, Manuel
    Srinivasa, Siddhartha S.
    [J]. INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2011, 30 (10) : 1284 - 1306
  • [7] Shape Completion using 3D-Encoder-Predictor CNNs and Shape Synthesis
    Dai, Angela
    Qi, Charles Ruizhongtai
    Niessner, Matthias
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 6545 - 6554
  • [8] Recovering 6D Object Pose and Predicting Next-Best-View in the Crowd
    Doumanoglou, Andreas
    Kouskouridas, Rigas
    Malassiotis, Sotiris
    Kim, Tae-Kyun
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 3583 - 3592
  • [9] Geiger A., 2012, C COMP VIS PATT REC
  • [10] Glasner D, 2011, IEEE I CONF COMP VIS, P1275, DOI 10.1109/ICCV.2011.6126379