TSwinPose: Enhanced monocular 3D human pose estimation with JointFlow

被引：7

作者：

Li, Muyu ^{[1
]}

Hu, Henan ^{[2
]}

Xiong, Jingjing ^{[3
]}

Zhao, Xudong ^{[1
]}

Yan, Hong ^{[4
,5
]}

机构：

[1] Dalian Univ Technol, Sch Control Sci & Engn, Dalian 116024, Liaoning, Peoples R China

[2] Dalian Jiaotong Univ, Sch Mech Engn, Dalian 116028, Liaoning, Peoples R China

[3] Hong Kong Appl Sci & Technol Res Inst Co Ltd, Hong Kong, Peoples R China

[4] Ctr Intelligent Multidimens Data Anal Ltd, Hong Kong, Peoples R China

[5] City Univ Hong Kong, Dept Elect Engn, Hong Kong, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2024年 / 249卷

关键词：

Monocular video; 3D human pose estimation; Transformer;

D O I：

10.1016/j.eswa.2024.123545

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Monocular estimation of 3D human poses is challenging due to ambiguity in depths and partial occlusion. Most recent works define this as a 2D -to -3D lifting task, taking 2D key point sequences and using spatial and temporal relationships. However, prior works focus on capturing spatio-temporal correlations but ignore the motion of joints that is needed for continuous estimation. To extend the potential of 2D -to -3D pose estimation, we propose TSwinPose, which learns multi -scale spatio-temporal representations from 2D key point locations and patterns of motion. The input 2D key point sequences are enhanced by JointFlow, which encodes the motion of each human joint. Based on Swin-Transformer, we designed a temporal domain SwinUnet structure to model multi -scale spatio-temporal relationships of human joints across different temporal windows. The final 3D pose generated by multi -stage representations is consistent temporally and has a higher accuracy. Experiments conducted on three benchmark datasets, Human3.6M, MPI-INF-3DHP, and HumanEva-I, demonstrate that TSwinPose achieves performance that is on par with state-of-the-art methods. Moreover, the introduction of JointFlow as a plug-in extension enhances performance significantly, particularly benefiting long-term 2D -to -3D lifting human pose estimation methods.

引用

页数：15

共 50 条

[41]

Zeng AL, 2022, Arxiv, DOI arXiv:2112.13715

[42]

Zeng AL, 2022, Arxiv, DOI arXiv:2203.08713

[43] SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach [J].

Zeng, Ailing ;

Sun, Xiao ;

Huang, Fuyang ;

Liu, Minhao ;

Xu, Qiang ;

Lin, Stephen .

COMPUTER VISION - ECCV 2020, PT XIV, 2020, 12359 :507-523

[44] Separable Flow: Learning Motion Cost Volumes for Optical Flow Estimation [J].

Zhang, Feihu ;

Woodford, Oliver J. ;

Prisacariu, Victor ;

Torr, Philip H. S. .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :10787-10797

[45] MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video [J].

Zhang, Jinlu ;

Tu, Zhigang ;

Yang, Jianyu ;

Chen, Yujin ;

Yuan, Junsong .

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2022, :13222-13232

[46]

Zhang Tianshu, 2020, P IEEE CVF C COMP VI, P7376

[47] Topological Creation of Acoustic Pseudospin Multipoles in a Flow-Free Symmetry-Broken Metamaterial Lattice [J].

Zhang, Zhiwang ;

Wei, Qi ;

Cheng, Ying ;

Zhang, Ting ;

Wu, Dajian ;

Liu, Xiaojun .

PHYSICAL REVIEW LETTERS, 2017, 118 (08)

[48] 3D Human Pose Estimation with Spatial and Temporal Transformers [J].

Zheng, Ce ;

Zhu, Sijie ;

Mendieta, Matias ;

Yang, Taojiannan ;

Chen, Chen ;

Ding, Zhengming .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :11636-11645

[49] Towards 3D Human Pose Estimation in the Wild: a Weakly-supervised Approach [J].

Zhou, Xingyi ;

Huang, Qixing ;

Sun, Xiao ;

Xue, Xiangyang ;

Wei, Yichen .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :398-407

[50]

Zhu W., 2023, P IEEE CVF INT C COM

← 1 2 3 4 5 →