Self-supervised Vision Transformers for 3D pose estimation of novel objects

被引：1

作者：

Thalhammer, Stefan ^{[1
]}

Weibel, Jean-Baptiste ^{[1
]}

Vincze, Markus ^{[1
]}

Garcia-Rodriguez, Jose ^{[2
]}

机构：

[1] TU Wien, Automat & Control Inst, Gusshausstr 27-29, A-1040 Vienna, Austria

[2] Univ Alicante, Dept Comp Technol, Carr San Vicente del Raspeig, Alicante 03690, Spain

来源：

IMAGE AND VISION COMPUTING | 2023年 / 139卷

关键词：

Object pose estimation; Template matching; Vision transformer; Self-supervised learning;

D O I：

10.1016/j.imavis.2023.104816

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Object pose estimation is important for object manipulation and scene understanding. In order to improve the general applicability of pose estimators, recent research focuses on providing estimates for novel objects, that is, objects unseen during training. Such works use deep template matching strategies to retrieve the closest template connected to a query image, which implicitly provides object class and pose. Despite the recent success and improvements of Vision Transformers over CNNs for many vision tasks, the state of the art uses CNN-based approaches for novel object pose estimation. This work evaluates and demonstrates the differences between self-supervised CNNs and Vision Transformers for deep template matching. In detail, both types of approaches are trained using contrastive learning to match training images against rendered templates of isolated objects. At test time such templates are matched against query images of known and novel objects under challenging settings, such as clutter, occlusion and object symmetries, using masked cosine similarity. The presented results not only demonstrate that Vision Transformers improve matching accuracy over CNNs but also that for some cases pre-trained Vision Transformers do not need fine-tuning to achieve the improvement. Furthermore, we highlight the differences in optimization and network architecture when comparing these two types of networks for deep template matching.

引用

页数：9

共 50 条

[1] Self-supervised 3D human pose estimation from video
Gholami, Mohsen
Rezaei, Ahmad
Rhodin, Helge
Ward, Rabab
Wang, Z. Jane
NEUROCOMPUTING, 2022, 488 : 97 - 106
[2] Self-supervised Vision Transformers for Writer Retrieval
Raven, Tim
Matei, Arthur
Fink, Gernot A.
DOCUMENT ANALYSIS AND RECOGNITION-ICDAR 2024, PT II, 2024, 14805 : 380 - 396
[3] Ssman: self-supervised masked adaptive network for 3D human pose estimation
Yu Shi
Tianyi Yue
Hu Zhao
Guoping He
Keyan Ren
Machine Vision and Applications, 2024, 35
[4] Multi-View 3D Human Pose Estimation with Self-Supervised Learning
Chang, Inho
Park, Min-Gyu
Kim, Jaewoo
Yoon, Ju Hong
3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE IN INFORMATION AND COMMUNICATION (IEEE ICAIIC 2021), 2021, : 255 - 257
[5] Ssman: self-supervised masked adaptive network for 3D human pose estimation
Shi, Yu
Yue, Tianyi
Zhao, Hu
He, Guoping
Ren, Keyan
MACHINE VISION AND APPLICATIONS, 2024, 35 (03)
[6] 3D Human Pose Machines with Self-Supervised Learning
Wang, Keze
Lin, Liang
Jiang, Chenhan
Qian, Chen
Wei, Pengxu
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2020, 42 (05) : 1069 - 1082
[7] Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization
Zhichao Ma
Kan Li
Yang Li
Applied Intelligence, 2023, 53 : 3864 - 3876
[8] Self-supervised method for 3D human pose estimation with consistent shape and viewpoint factorization
Ma, Zhichao
Li, Kan
Li, Yang
APPLIED INTELLIGENCE, 2023, 53 (04) : 3864 - 3876
[9] Self-supervised vision transformers for semantic segmentation
Gu, Xianfan
Hu, Yingdong
Wen, Chuan
Gao, Yang
COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 251
[10] Self-Supervised Vision Transformers for Malware Detection
Seneviratne, Sachith
Shariffdeen, Ridwan
Rasnayaka, Sanka
Kasthuriarachchi, Nuran
IEEE ACCESS, 2022, 10 : 103121 - 103135

← 1 2 3 4 5 →