Self-supervised Vision Transformers for 3D pose estimation of novel objects

被引:1
|
作者
Thalhammer, Stefan [1 ]
Weibel, Jean-Baptiste [1 ]
Vincze, Markus [1 ]
Garcia-Rodriguez, Jose [2 ]
机构
[1] TU Wien, Automat & Control Inst, Gusshausstr 27-29, A-1040 Vienna, Austria
[2] Univ Alicante, Dept Comp Technol, Carr San Vicente del Raspeig, Alicante 03690, Spain
关键词
Object pose estimation; Template matching; Vision transformer; Self-supervised learning;
D O I
10.1016/j.imavis.2023.104816
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Object pose estimation is important for object manipulation and scene understanding. In order to improve the general applicability of pose estimators, recent research focuses on providing estimates for novel objects, that is, objects unseen during training. Such works use deep template matching strategies to retrieve the closest template connected to a query image, which implicitly provides object class and pose. Despite the recent success and improvements of Vision Transformers over CNNs for many vision tasks, the state of the art uses CNN-based approaches for novel object pose estimation. This work evaluates and demonstrates the differences between self-supervised CNNs and Vision Transformers for deep template matching. In detail, both types of approaches are trained using contrastive learning to match training images against rendered templates of isolated objects. At test time such templates are matched against query images of known and novel objects under challenging settings, such as clutter, occlusion and object symmetries, using masked cosine similarity. The presented results not only demonstrate that Vision Transformers improve matching accuracy over CNNs but also that for some cases pre-trained Vision Transformers do not need fine-tuning to achieve the improvement. Furthermore, we highlight the differences in optimization and network architecture when comparing these two types of networks for deep template matching.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] An Empirical Study of Training Self-Supervised Vision Transformers
    Chen, Xinlei
    Xie, Saining
    He, Kaiming
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 9620 - 9629
  • [32] Self-Supervised 3D Scene Flow Estimation Guided by Superpoints
    Shen, Yaqi
    Hui, Le
    Xie, Jin
    Yang, Jian
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 5271 - 5280
  • [33] 3D Object Aided Self-Supervised Monocular Depth Estimation
    Wei, Songlin
    Chen, Guodong
    Chi, Wenzheng
    Wang, Zhenhua
    Sun, Lining
    2022 IEEE/RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS (IROS), 2022, : 10635 - 10642
  • [34] Self-Supervised 3D Traversability Estimation With Proxy Bank Guidance
    Bae, Jihwan
    Seo, Junwon
    Kim, Taekyung
    Jeon, Hae-Gon
    Kwak, Kiho
    Shim, Inwook
    IEEE ACCESS, 2023, 11 : 51490 - 51501
  • [35] Self-Supervised 3D Face Reconstruction via Conditional Estimation
    Wen, Yandong
    Liu, Weiyang
    Raj, Bhiksha
    Singh, Rita
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 13269 - 13278
  • [36] Domes to Drones: Self-Supervised Active Triangulation for 3D Human Pose Reconstruction
    Pirinen, Aleksis
    Gartner, Erik
    Sminchisescu, Cristian
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [37] MAPConNet: Self-supervised 3D Pose Transfer with Mesh and Point Contrastive Learning
    Sun, Jiaze
    Chen, Zhixiang
    Kim, Tae-Kyun
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 14406 - 14416
  • [38] SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction
    Huang, Yuanhui
    Zheng, Wenzhao
    Zhang, Borui
    Zhou, Jie
    Lu, Jiwen
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 19946 - 19956
  • [39] A self-supervised spatio-temporal attention network for video-based 3D infant pose estimation
    Yin, Wang
    Chen, Linxi
    Huang, Xinrui
    Huang, Chunling
    Wang, Zhaohong
    Bian, Yang
    Wan, You
    Zhou, Yuan
    Han, Tongyan
    Yi, Ming
    MEDICAL IMAGE ANALYSIS, 2024, 96
  • [40] Graph-Based CNNs With Self-Supervised Module for 3D Hand Pose Estimation From Monocular RGB
    Guo, Shaoxiang
    Rigall, Eric
    Qi, Lin
    Dong, Xinghui
    Li, Haiyan
    Dong, Junyu
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (04) : 1514 - 1525