Multi-hypothesis representation learning for transformer-based 3D human pose estimation

被引:21
作者
Li, Wenhao [1 ]
Liu, Hong [1 ]
Tang, Hao [2 ]
Wang, Pichao [3 ,4 ]
机构
[1] Peking Univ, Shenzhen Grad Sch, Key Lab Machine Percept, Beijing, Peoples R China
[2] Swiss Fed Inst Technol, Comp Vis Lab, Zurich, Switzerland
[3] Amazon Prime Video, Seattle, WA USA
[4] Alibaba Grp, Hangzhou, Peoples R China
基金
国家重点研发计划;
关键词
3D Human pose estimation; Transformer; Multi-Hypothesis; Self-Hypothesis; Cross-Hypothesis;
D O I
10.1016/j.patcog.2023.109631
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Despite significant progress, estimating 3D human poses from monocular videos remains a challenging task due to depth ambiguity and self-occlusion. Most existing works attempt to solve both issues by ex-ploiting spatial and temporal relationships. However, those works ignore the fact that it is an inverse problem where multiple feasible solutions (i.e., hypotheses) exist. To relieve this limitation, we propose a Multi-Hypothesis Transformer that learns spatio-temporal representations of multiple plausible pose hypotheses. In order to effectively model multi-hypothesis dependencies and build strong relationships across hypothesis features, we introduce a one-to-many-to-one three-stage framework: (i) Generate mul-tiple initial hypothesis representations; (ii) Model self-hypothesis communication, merge multiple hy-potheses into a single converged representation and then partition it into several diverged hypotheses; (iii) Learn cross-hypothesis communication and aggregate the multi-hypothesis features to synthesize the final 3D pose. Through the above processes, the final representation is enhanced and the synthesized pose is much more accurate. Extensive experiments show that the proposed method achieves state-of -the-art results on two challenging datasets: Human3.6M and MPI-INF-3DHP. The code and models are available at https://github.com/Vegetebird/MHFormer .(c) 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页数:12
相关论文
共 62 条
[1]  
Bishop C., 1994, MIXTURE DENSITY NETW
[2]   Exploiting Spatial-temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks [J].
Cai, Yujun ;
Ge, Liuhao ;
Liu, Jun ;
Cai, Jianfei ;
Cham, Tat-Jen ;
Yuan, Junsong ;
Thalmann, Nadia Magnenat .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2272-2281
[3]   CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification [J].
Chen, Chun-Fu ;
Fan, Quanfu ;
Panda, Rameswar .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :347-356
[4]  
Chen Haoyu, 2021, BMVC
[5]   Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition [J].
Chen, Tianlang ;
Fang, Chen ;
Shen, Xiaohui ;
Zhu, Yiheng ;
Chen, Zhili ;
Luo, Jiebo .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) :198-209
[6]   Cascaded Pyramid Network for Multi-Person Pose Estimation [J].
Chen, Yilun ;
Wang, Zhicheng ;
Peng, Yuxiang ;
Zhang, Zhiqiang ;
Yu, Gang ;
Sun, Jian .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7103-7112
[7]  
Dosovitskiy A., 2021, INT C LEARNING REPRE
[8]  
Fang HS, 2018, AAAI CONF ARTIF INTE, P6821
[9]   PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation [J].
Gong, Kehong ;
Zhang, Jianfeng ;
Feng, Jiashi .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8571-8580
[10]   Single image based 3D human pose estimation via uncertainty learning [J].
Han, Chuchu ;
Yu, Xin ;
Gao, Changxin ;
Sang, Nong ;
Yang, Yi .
PATTERN RECOGNITION, 2022, 132