HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation

被引:42
作者
Huang, Lin [1 ]
Tan, Jianchao [2 ]
Meng, Jingjing [1 ]
Liu, Ji [2 ]
Yuan, Junsong [1 ]
机构
[1] Univ Buffalo SUNY, Buffalo, NY 14260 USA
[2] Kwai Inc, Y Tech, Seattle, WA USA
来源
MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年
关键词
3D Hand and Object Poses; Structured Learning; Transformer;
D O I
10.1145/3394171.3413775
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As we use our hands frequently in daily activities, the analysis of hand-object interactions plays a critical role to many multimedia understanding and interaction applications. Different from conventional 3D hand-only and object-only pose estimation, estimating 3D hand-object pose is more challenging due to the mutual occlusions between hand and object, as well as the physical constraints between them. To overcome these issues, we propose to fully utilize the structural correlations among hand joints and object corners in order to obtain more reliable poses. Our work is inspired by structured output learning models in sequence transduction field like Transformer encoder-decoder framework. Besides modeling inherent dependencies from extracted 2D hand-object pose, our proposed Hand-Object Transformer Network (HOT-Net) also captures the structural correlations among 3D hand joints and object corners. Similar to Transformer's autoregressive decoder, by considering structured output patterns, this helps better constrain the output space and leads to more robust pose estimation. However, different from Transformer's sequential modeling mechanism, HOT-Net adopts a novel non-autoregressive decoding strategy for 3D hand-object pose estimation. Specifically, our model removes the Transformer's dependence on previously generated results and explicitly feeds a reference 3D hand-object pose into the decoding process to provide equivalent target pose patterns for parallely localizing each 3D keypoint. To further improve physical validity of estimated hand pose, besides anatomical constraints, we propose a cooperative pose constraint, aiming to enable the hand pose to cooperate with hand shape, to generate hand mesh. We demonstrate real-time speed and state-of-the-art performance on benchmark hand-object datasets for both 3D hand and object poses.
引用
收藏
页码:3136 / 3145
页数:10
相关论文
共 60 条
[31]  
Mueller F, 2017, P IEEE INT C COMP VI
[32]   Training a Feedback Loop for Hand Pose Estimation [J].
Oberweger, Markus ;
Wohlhart, Paul ;
Lepetit, Vincent .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :3316-3324
[33]  
Oberweger Markus, 2019, IEEE T PATTERN ANAL
[34]  
Oikonomidis Iason, 2011, P IEEE INT C COMP VI
[35]  
Panteleris Paschalis, 2015, P BRIT MACH VIS C
[36]  
Peng Sida, 2019, P IEEE C COMP VIS PA
[37]   BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth [J].
Rad, Mahdi ;
Lepetit, Vincent .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :3848-3856
[38]  
Ren S., 2015, PROC ADVNEURAL INF P, V28
[39]   Embodied Hands: Modeling and Capturing Hands and Bodies Together [J].
Romero, Javier ;
Tzionas, Dimitrios ;
Black, Michael J. .
ACM TRANSACTIONS ON GRAPHICS, 2017, 36 (06)
[40]   Hands in Action: Real-Time 3D Reconstruction of Hands in Interaction with Objects [J].
Romero, Javier ;
Kjellstrom, Hedvig ;
Kragic, Danica .
2010 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2010, :458-463