Efficient Hierarchical Multi-view Fusion Transformer for 3D Human Pose Estimation

被引：5

作者：

Zhou, Kangkang ^{[1
,2
]}

Zhang, Lijun ^{[1
,2
]}

Lu, Feng ^{[3
]}

Zhou, Xiang-Dong ^{[1
]}

Shi, Yu ^{[1
]}

机构：

[1] Chinese Acad Sci, Chongqing Inst Green & Intelligent Technol, Chongqing, Peoples R China

[2] Univ Chinese Acad Sci, Chongqing Sch, Chongqing, Peoples R China

[3] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Peng Cheng Lab, Shenzhen, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

3D human pose estimation; multi-view fusion; transformer;

D O I：

10.1145/3581783.3612098

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In multi-view 3D human pose estimation (HPE), information from different viewpoints is highly variable due to complex factors such as background and occlusion, making cross-view feature extraction and fusion difficult. Most existing methods have problems of over-reliance on camera parameters or insufficient semantic feature extraction. To address these issues, this paper proposes a hierarchical multi-view fusion transformer (HMVformer) framework for 3D HPE, incorporating cross-view feature fusion methods into the spatial and temporal feature extraction process in a coarse-to-fine manner. To begin, global to local attention graph features are extracted and incorporated with the original pose features to better preserve the spatial structure semantic knowledge. Then, various cross-view feature fusion modules are built and embedded into the pose feature extraction for consistent and distinctive information fusion across multiple viewpoints. Furthermore, sequential temporal information is extracted and fused with spatial knowledge for feature refinement and depth uncertainty reduction. Extensive experiments on three popular 3D HPE benchmarks show that HMVformer achieves state-of-the-art results without relying on complex loss functions or providing camera parameters, simple but effective in mitigating depth ambiguity and improving 3D pose prediction accuracy. Codes and models are available(1).

引用

页码：7512 / 7520

页数：9

共 46 条

[41]

Zhang Jinlu, 2022, PROC CVPR IEEE, P13232, DOI DOI 10.1109/CVPR52688.2022.01288

[42] 4D Association Graph for Realtime Multi-person Motion Capture Using Multiple Video Cameras [J].

Zhang, Yuxiang ;

An, Liang ;

Yu, Tao ;

Li, Xiu ;

Li, Kun ;

Liu, Yebin .

2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :1321-1330

[43] AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild [J].

Zhang, Zhe ;

Wang, Chunyu ;

Qiu, Weichao ;

Qin, Wenhu ;

Zeng, Wenjun .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2021, 129 (03) :703-718

[44] Semantic Graph Convolutional Networks for 3D Human Pose Regression [J].

Zhao, Long ;

Peng, Xi ;

Tian, Yu ;

Kapadia, Mubbasir ;

Metaxas, Dimitris N. .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :3420-3430

[45]

Zhao W., 2022, P IEEE CVF C COMP VI, P20438

[46] 3D Human Pose Estimation with Spatial and Temporal Transformers [J].

Zheng, Ce ;

Zhu, Sijie ;

Mendieta, Matias ;

Yang, Taojiannan ;

Chen, Chen ;

Ding, Zhengming .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11636-11645

← 1 2 3 4 5 →