Self-supervised Vision Transformers for 3D pose estimation of novel objects

被引:1
|
作者
Thalhammer, Stefan [1 ]
Weibel, Jean-Baptiste [1 ]
Vincze, Markus [1 ]
Garcia-Rodriguez, Jose [2 ]
机构
[1] TU Wien, Automat & Control Inst, Gusshausstr 27-29, A-1040 Vienna, Austria
[2] Univ Alicante, Dept Comp Technol, Carr San Vicente del Raspeig, Alicante 03690, Spain
关键词
Object pose estimation; Template matching; Vision transformer; Self-supervised learning;
D O I
10.1016/j.imavis.2023.104816
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Object pose estimation is important for object manipulation and scene understanding. In order to improve the general applicability of pose estimators, recent research focuses on providing estimates for novel objects, that is, objects unseen during training. Such works use deep template matching strategies to retrieve the closest template connected to a query image, which implicitly provides object class and pose. Despite the recent success and improvements of Vision Transformers over CNNs for many vision tasks, the state of the art uses CNN-based approaches for novel object pose estimation. This work evaluates and demonstrates the differences between self-supervised CNNs and Vision Transformers for deep template matching. In detail, both types of approaches are trained using contrastive learning to match training images against rendered templates of isolated objects. At test time such templates are matched against query images of known and novel objects under challenging settings, such as clutter, occlusion and object symmetries, using masked cosine similarity. The presented results not only demonstrate that Vision Transformers improve matching accuracy over CNNs but also that for some cases pre-trained Vision Transformers do not need fine-tuning to achieve the improvement. Furthermore, we highlight the differences in optimization and network architecture when comparing these two types of networks for deep template matching.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Self-supervised Detection and Pose Estimation of Logistical Objects in 3D Sensor Data
    Mueller, Nikolas
    Stenzel, Jonas
    Chen, Jian-Jia
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 10251 - 10258
  • [2] Self-supervised 3D human pose estimation from video
    Gholami, Mohsen
    Rezaei, Ahmad
    Rhodin, Helge
    Ward, Rabab
    Wang, Z. Jane
    NEUROCOMPUTING, 2022, 488 : 97 - 106
  • [3] Rotated Orthographic Projection for Self-supervised 3D Human Pose Estimation
    Yao, Yao
    Pan, Yixuan
    Shi, Wenjun
    Zhu, Dongchen
    Wang, Lei
    Li, Jiamao
    COMPUTER VISION - ECCV 2024, PT LXIX, 2025, 15127 : 422 - 439
  • [4] Self-supervised 3D hand pose estimation through training by fitting
    Wan, Chengde
    Probst, Thomas
    Van Gool, Luc
    Yao, Angela
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10845 - 10854
  • [5] CanonPose: Self-Supervised Monocular 3D Human Pose Estimation in the Wild
    Wandt, Bastian
    Rudolph, Marco
    Zell, Petrissa
    Rhodin, Helge
    Rosenhahn, Bodo
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 13289 - 13299
  • [6] Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image Synthesis
    Kundu, Jogendra Nath
    Seth, Siddharth
    Jampani, Varun
    Rakesh, Mugalodi
    Babu, R. Venkatesh
    Chakraborty, Anirban
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 6151 - 6161
  • [7] Ssman: self-supervised masked adaptive network for 3D human pose estimation
    Shi, Yu
    Yue, Tianyi
    Zhao, Hu
    He, Guoping
    Ren, Keyan
    MACHINE VISION AND APPLICATIONS, 2024, 35 (03)
  • [8] Self-Supervised 3D Human Pose Estimation with Multiple-View Geometry
    Bouazizi, Arij
    Wiederer, Julian
    Kressel, Ulrich
    Belagiannis, Vasileios
    2021 16TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2021), 2021,
  • [9] Multi-View 3D Human Pose Estimation with Self-Supervised Learning
    Chang, Inho
    Park, Min-Gyu
    Kim, Jaewoo
    Yoon, Ju Hong
    3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION (IEEE ICAIIC 2021), 2021, : 255 - 257
  • [10] Ssman: self-supervised masked adaptive network for 3D human pose estimation
    Yu Shi
    Tianyi Yue
    Hu Zhao
    Guoping He
    Keyan Ren
    Machine Vision and Applications, 2024, 35