Efficient Multi-Object Pose Estimation using Multi-Resolution Deformable Attention and Query Aggregation

被引:1
作者
Periyasamy, Arul Selvam [1 ]
Tsaturyan, Vladimir [1 ]
Behnke, Sven [1 ]
机构
[1] Univ Bonn, Autonomous Intelligent Syst, Bonn, Germany
来源
2023 SEVENTH IEEE INTERNATIONAL CONFERENCE ON ROBOTIC COMPUTING, IRC 2023 | 2023年
关键词
D O I
10.1109/IRC59093.2023.00047
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Object pose estimation is a long-standing problem in computer vision. Recently, attention-based vision transformer models have achieved state-of-the-art results in many computer vision applications. Exploiting the permutation-invariant nature of the attention mechanism, a family of vision transformer models formulate multi-object pose estimation as a set prediction problem. However, existing vision transformer models for multi-object pose estimation rely exclusively on the attention mechanism. Convolutional neural networks, on the other hand, hard-wire various inductive biases into their architecture. In this paper, we investigate incorporating inductive biases in vision transformer models for multi-object pose estimation, which facilitates learning long-range dependencies while circumventing the costly global attention. In particular, we use multi-resolution deformable attention, where the attention operation is performed only between a few deformed reference points. Furthermore, we propose a query aggregation mechanism that enables increasing the number of object queries without increasing the computational complexity. We evaluate the proposed model on the challenging YCB-Video dataset and report state-of-the-art results.
引用
收藏
页码:247 / 254
页数:8
相关论文
共 35 条
[1]  
Amini A., 2021, GERMAN C PATTERN REC
[2]   YOLOPose: Transformer-Based Multi-object 6D Pose Estimation Using Keypoint Regression [J].
Amini, Arash ;
Periyasamy, Arul Selvam ;
Behnke, Sven .
INTELLIGENT AUTONOMOUS SYSTEMS 17, IAS-17, 2023, 577 :392-406
[3]  
Behnke S, 2003, LECT NOTES COMPUT SC, V2766, P1
[4]  
Brachmann Eric, 2020, heiDATA
[5]   DSAC - Differentiable RANSAC for Camera Localization [J].
Brachmann, Eric ;
Krull, Alexander ;
Nowozin, Sebastian ;
Shotton, Jamie ;
Michel, Frank ;
Gumhold, Stefan ;
Rother, Carsten .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2492-2500
[6]   ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation [J].
Capellen, Catherine ;
Schwarz, Max ;
Behnke, Sven .
PROCEEDINGS OF THE 15TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS, VOL 5: VISAPP, 2020, :162-172
[7]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[8]  
Cohen Nadav, 2017, INT C LEARNING REPRE
[9]  
Dosovitskiy A., 2021, INT C LEARNING REPRE, P1
[10]  
Hinterstoisser S., 2012, 11 ASIAN C COMPUTER, DOI [DOI 10.1007/978-3-642-37331-242, 10.1007/978-3- 642-37331-2_42, DOI 10.1007/978-3-642-37331-2_42]