Self-supervised Vision Transformers for 3D pose estimation of novel objects

被引:1
|
作者
Thalhammer, Stefan [1 ]
Weibel, Jean-Baptiste [1 ]
Vincze, Markus [1 ]
Garcia-Rodriguez, Jose [2 ]
机构
[1] TU Wien, Automat & Control Inst, Gusshausstr 27-29, A-1040 Vienna, Austria
[2] Univ Alicante, Dept Comp Technol, Carr San Vicente del Raspeig, Alicante 03690, Spain
关键词
Object pose estimation; Template matching; Vision transformer; Self-supervised learning;
D O I
10.1016/j.imavis.2023.104816
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Object pose estimation is important for object manipulation and scene understanding. In order to improve the general applicability of pose estimators, recent research focuses on providing estimates for novel objects, that is, objects unseen during training. Such works use deep template matching strategies to retrieve the closest template connected to a query image, which implicitly provides object class and pose. Despite the recent success and improvements of Vision Transformers over CNNs for many vision tasks, the state of the art uses CNN-based approaches for novel object pose estimation. This work evaluates and demonstrates the differences between self-supervised CNNs and Vision Transformers for deep template matching. In detail, both types of approaches are trained using contrastive learning to match training images against rendered templates of isolated objects. At test time such templates are matched against query images of known and novel objects under challenging settings, such as clutter, occlusion and object symmetries, using masked cosine similarity. The presented results not only demonstrate that Vision Transformers improve matching accuracy over CNNs but also that for some cases pre-trained Vision Transformers do not need fine-tuning to achieve the improvement. Furthermore, we highlight the differences in optimization and network architecture when comparing these two types of networks for deep template matching.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] ESMformer: Error-aware self-supervised transformer for multi-view 3D human pose estimation
    Zhang, Lijun
    Zhou, Kangkang
    Lu, Feng
    Li, Zhenghao
    Shao, Xiaohu
    Zhou, Xiang-Dong
    Shi, Yu
    PATTERN RECOGNITION, 2025, 158
  • [42] Self-supervised 6D Object Pose Estimation for Robot Manipulation
    Deng, Xinke
    Xiang, Yu
    Mousavian, Arsalan
    Eppner, Clemens
    Bretl, Timothy
    Fox, Dieter
    2020 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2020, : 3665 - 3671
  • [43] Self-Supervised Ground-Relative Pose Estimation
    Muller, Bruce R.
    Smith, William A. P.
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 3507 - 3513
  • [44] A Self-supervised Pose Estimation Approach for Construction Machines
    Alshubbak, Ala'a
    Goerges, Daniel
    ADVANCES IN VISUAL COMPUTING, ISVC 2023, PT II, 2023, 14362 : 397 - 408
  • [45] SELF-SUPERVISED LEARNING FOR HUMAN POSE ESTIMATION IN SPORTS
    Ludwig, Katja
    Scherer, Sebastian
    Einfalt, Moritz
    Lienhart, Rainer
    2021 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2021,
  • [46] Exploring Self-Supervised Vision Transformers for Gait Recognition in the Wild
    Cosma, Adrian
    Catruna, Andy
    Radoi, Emilian
    SENSORS, 2023, 23 (05)
  • [47] Jointly Optimal Incremental Learning with Self-Supervised Vision Transformers
    Witzgall, Hanna
    2024 IEEE AEROSPACE CONFERENCE, 2024,
  • [48] Self-supervised Models are Good Teaching Assistants for Vision Transformers
    Wu, Haiyan
    Gao, Yuting
    Zhang, Yinqi
    Lin, Shaohui
    Xie, Yuan
    Sun, Xing
    Li, Ke
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [49] Self-Supervised Augmented Vision Transformers for Remote Physiological Measurement
    Pang, Liyu
    Li, Xiaoou
    Wang, Zhen
    Lei, Xueyi
    Pei, Yulong
    2023 8TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYTICS, ICCCBDA, 2023, : 623 - 627
  • [50] Monocular Robot Navigation with Self-Supervised Pretrained Vision Transformers
    Saavedra-Ruiz, Miguel
    Morin, Sacha
    Paull, Liam
    2022 19TH CONFERENCE ON ROBOTS AND VISION (CRV 2022), 2022, : 197 - 204