Efficient Multi-Object Pose Estimation using Multi-Resolution Deformable Attention and Query Aggregation

被引：1

作者：

Periyasamy, Arul Selvam ^{[1
]}

Tsaturyan, Vladimir ^{[1
]}

Behnke, Sven ^{[1
]}

机构：

[1] Univ Bonn, Autonomous Intelligent Syst, Bonn, Germany

来源：

2023 SEVENTH IEEE INTERNATIONAL CONFERENCE ON ROBOTIC COMPUTING, IRC 2023 | 2023年

关键词：

D O I：

10.1109/IRC59093.2023.00047

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Object pose estimation is a long-standing problem in computer vision. Recently, attention-based vision transformer models have achieved state-of-the-art results in many computer vision applications. Exploiting the permutation-invariant nature of the attention mechanism, a family of vision transformer models formulate multi-object pose estimation as a set prediction problem. However, existing vision transformer models for multi-object pose estimation rely exclusively on the attention mechanism. Convolutional neural networks, on the other hand, hard-wire various inductive biases into their architecture. In this paper, we investigate incorporating inductive biases in vision transformer models for multi-object pose estimation, which facilitates learning long-range dependencies while circumventing the costly global attention. In particular, we use multi-resolution deformable attention, where the attention operation is performed only between a few deformed reference points. Furthermore, we propose a query aggregation mechanism that enables increasing the number of object queries without increasing the computational complexity. We evaluate the proposed model on the challenging YCB-Video dataset and report state-of-the-art results.

引用

页码：247 / 254

页数：8

共 35 条

[1]

Amini A., 2021, GERMAN C PATTERN REC

[2] YOLOPose: Transformer-Based Multi-object 6D Pose Estimation Using Keypoint Regression [J].

Amini, Arash ;

Periyasamy, Arul Selvam ;

Behnke, Sven .

INTELLIGENT AUTONOMOUS SYSTEMS 17, IAS-17, 2023, 577 :392-406

[3]

Behnke S, 2003, LECT NOTES COMPUT SC, V2766, P1

[4]

Brachmann Eric, 2020, heiDATA

[5] DSAC - Differentiable RANSAC for Camera Localization [J].

Brachmann, Eric ;

Krull, Alexander ;

Nowozin, Sebastian ;

Shotton, Jamie ;

Michel, Frank ;

Gumhold, Stefan ;

Rother, Carsten .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :2492-2500

[6] ConvPoseCNN: Dense Convolutional 6D Object Pose Estimation [J].

Capellen, Catherine ;

Schwarz, Max ;

Behnke, Sven .

PROCEEDINGS OF THE 15TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER VISION, IMAGING AND COMPUTER GRAPHICS THEORY AND APPLICATIONS, VOL 5: VISAPP, 2020, :162-172

[7] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[8]

Cohen Nadav, 2017, INT C LEARNING REPRE

[9]

Dosovitskiy A., 2021, INT C LEARNING REPRE, P1

[10]

Hinterstoisser S., 2012, 11 ASIAN C COMPUTER, DOI [DOI 10.1007/978-3-642-37331-242, 10.1007/978-3- 642-37331-2_42, DOI 10.1007/978-3-642-37331-2_42]

← 1 2 3 4 →