TSwinPose: Enhanced monocular 3D human pose estimation with JointFlow

被引：4

作者：

Li, Muyu ^{[1
]}

Hu, Henan ^{[2
]}

Xiong, Jingjing ^{[3
]}

Zhao, Xudong ^{[1
]}

Yan, Hong ^{[4
,5
]}

机构：

[1] Dalian Univ Technol, Sch Control Sci & Engn, Dalian 116024, Liaoning, Peoples R China

[2] Dalian Jiaotong Univ, Sch Mech Engn, Dalian 116028, Liaoning, Peoples R China

[3] Hong Kong Appl Sci & Technol Res Inst Co Ltd, Hong Kong, Peoples R China

[4] Ctr Intelligent Multidimens Data Anal Ltd, Hong Kong, Peoples R China

[5] City Univ Hong Kong, Dept Elect Engn, Hong Kong, Peoples R China

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2024年 / 249卷

关键词：

Monocular video; 3D human pose estimation; Transformer;

D O I：

10.1016/j.eswa.2024.123545

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Monocular estimation of 3D human poses is challenging due to ambiguity in depths and partial occlusion. Most recent works define this as a 2D -to -3D lifting task, taking 2D key point sequences and using spatial and temporal relationships. However, prior works focus on capturing spatio-temporal correlations but ignore the motion of joints that is needed for continuous estimation. To extend the potential of 2D -to -3D pose estimation, we propose TSwinPose, which learns multi -scale spatio-temporal representations from 2D key point locations and patterns of motion. The input 2D key point sequences are enhanced by JointFlow, which encodes the motion of each human joint. Based on Swin-Transformer, we designed a temporal domain SwinUnet structure to model multi -scale spatio-temporal relationships of human joints across different temporal windows. The final 3D pose generated by multi -stage representations is consistent temporally and has a higher accuracy. Experiments conducted on three benchmark datasets, Human3.6M, MPI-INF-3DHP, and HumanEva-I, demonstrate that TSwinPose achieves performance that is on par with state-of-the-art methods. Moreover, the introduction of JointFlow as a plug-in extension enhances performance significantly, particularly benefiting long-term 2D -to -3D lifting human pose estimation methods.

引用

页数：15

共 50 条

[1]

Chen C. W., 2023, P IEEE CVF INT C COM, P8818

[2]

Hassanin M., 2022, arXiv

[3]

He KM, 2017, IEEE I CONF COMP VIS, P2980, DOI [10.1109/ICCV.2017.322, 10.1109/TPAMI.2018.2844175]

[4] DETERMINING OPTICAL-FLOW [J].

HORN, BKP ;

SCHUNCK, BG .

ARTIFICIAL INTELLIGENCE, 1981, 17 (1-3) :185-203

[5] Exploiting Temporal Information for 3D Human Pose Estimation [J].

Hossain, Mir Rayat Imtiaz ;

Little, James J. .

COMPUTER VISION - ECCV 2018, PT X, 2018, 11214 :69-86

[6] Conditional Directed Graph Convolution for 3D Human Pose Estimation [J].

Hu, Wenbo ;

Zhang, Changgong ;

Zhan, Fangneng ;

Zhang, Lei ;

Wong, Tien-Tsin .

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :602-611

[7] Iterative Residual Refinement for Joint Optical Flow and Occlusion Estimation [J].

Hur, Junhwa ;

Roth, Stefan .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :5747-5756

[8] FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks [J].

Ilg, Eddy ;

Mayer, Nikolaus ;

Saikia, Tonmoy ;

Keuper, Margret ;

Dosovitskiy, Alexey ;

Brox, Thomas .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1647-1655

[9] Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments [J].

Ionescu, Catalin ;

Papava, Dragos ;

Olaru, Vlad ;

Sminchisescu, Cristian .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2014, 36 (07) :1325-1339

[10] Learning to Estimate Hidden Motions with Global Motion Aggregation [J].

Jiang, Shihao ;

Campbell, Dylan ;

Lu, Yao ;

Li, Hongdong ;

Hartley, Richard .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :9752-9761

← 1 2 3 4 5 →