Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation

被引:5
作者
Zhou, Kangkang [1 ,2 ]
Zhang, Lijun [1 ,2 ]
Lu, Feng [3 ]
Zhou, Xiang-Dong [1 ]
Shi, Yu [1 ]
机构
[1] Chinese Acad Sci, Chongqing Inst Green & Intelligent Technol, Chongqing, Peoples R China
[2] Univ Chinese Acad Sci, Chongqing Sch, Chongqing, Peoples R China
[3] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Peng Cheng Lab, Shenzhen, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
中国国家自然科学基金;
关键词
3D human pose estimation; multi-view fusion; transformer;
D O I
10.1145/3581783.3612098
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In multi-view 3D human pose estimation (HPE), information from different viewpoints is highly variable due to complex factors such as background and occlusion, making cross-view feature extraction and fusion difficult. Most existing methods have problems of over-reliance on camera parameters or insufficient semantic feature extraction. To address these issues, this paper proposes a hierarchical multi-view fusion transformer (HMVformer) framework for 3D HPE, incorporating cross-view feature fusion methods into the spatial and temporal feature extraction process in a coarse-to-fine manner. To begin, global to local attention graph features are extracted and incorporated with the original pose features to better preserve the spatial structure semantic knowledge. Then, various cross-view feature fusion modules are built and embedded into the pose feature extraction for consistent and distinctive information fusion across multiple viewpoints. Furthermore, sequential temporal information is extracted and fused with spatial knowledge for feature refinement and depth uncertainty reduction. Extensive experiments on three popular 3D HPE benchmarks show that HMVformer achieves state-of-the-art results without relying on complex loss functions or providing camera parameters, simple but effective in mitigating depth ambiguity and improving 3D pose prediction accuracy. Codes and models are available(1).
引用
收藏
页码:7512 / 7520
页数:9
相关论文
共 46 条
[1]  
[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.01584
[2]   Generalizable Human Pose Triangulation [J].
Bartol, Kristijan ;
Bojanic, David ;
Petkovic, Tomislav ;
Pribanic, Tomislav .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :11018-11027
[3]  
Bouazizi Arij, 2021, FG
[4]  
Bultmann Simon, 2021, RSS
[5]   Exploiting Spatial-temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks [J].
Cai, Yujun ;
Ge, Liuhao ;
Liu, Jun ;
Cai, Jianfei ;
Cham, Tat-Jen ;
Yuan, Junsong ;
Thalmann, Nadia Magnenat .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2272-2281
[6]  
Chang Inho, 2021, ICAIIC
[7]   Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition [J].
Chen, Tianlang ;
Fang, Chen ;
Shen, Xiaohui ;
Zhu, Yiheng ;
Chen, Zhili ;
Luo, Jiebo .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) :198-209
[8]  
Chen XP, 2021, AAAI CONF ARTIF INTE, V35, P1089
[9]   Cascaded Pyramid Network for Multi-Person Pose Estimation [J].
Chen, Yilun ;
Wang, Zhicheng ;
Peng, Yuxiang ;
Zhang, Zhiqiang ;
Yu, Gang ;
Sun, Jian .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7103-7112
[10]   Three-Dimensional Body and Centre of Mass Kinematics in Alpine Ski Racing Using Differential GNSS and Inertial Sensors [J].
Fasel, Benedikt ;
Spoerri, Joerg ;
Gilgien, Matthias ;
Boffi, Geo ;
Chardonnens, Julien ;
Mueller, Erich ;
Aminian, Kamiar .
REMOTE SENSING, 2016, 8 (08)