STRFormer: Spatial-Temporal-ReTemporal Transformer for 3D human pose estimation

被引:4
作者
Liu, Xing [1 ]
Tang, Hao [2 ]
机构
[1] Tongji Univ, Coll Elect & Informat Engn, Shanghai 201804, Peoples R China
[2] Swiss Fed Inst Technol, Comp Vis Lab, Zurich, Switzerland
基金
中国国家自然科学基金;
关键词
3D human pose estimation; Spatial-temporal-ReTemporal; Transformer; Reverse temporal encoder; MPJAE loss;
D O I
10.1016/j.imavis.2023.104863
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer-based methods have emerged as the golden standard in 2D-3D human pose estimation from video sequences, largely thanks to their powerful spatial-temporal feature encoders. In the past, researchers have made concerted efforts to engineer spatial and temporal encoders using transformer blocks. This approach involved a dramatic reshaping of the input, transforming it from mere joint information to dynamic joint trajectories. Despite this, the inherent limitations of the spatial-temporal structure have resulted in an inadequate acquisition and subsequent utilization of temporal information. In an attempt to rectify this prevalent issue, our paper proposes a new model, dubbed Spatial-Temporal-ReTemporal Transformer (i.e., STRFormer). This model ingeniously employs two separate temporal transformer blocks to extract the essential temporal motion information from video sequences. Intriguingly, one temporal transformer block is dedicated to the original video sequence, while the other concerns itself with the reversed order video. This novel approach allows for a more thorough investigation and utilization of temporal information from the video sequences. In order to alternate the processing of these two blocks effectively with the spatial block, we focus on maximizing the extraction of temporal domain information. This method leads to a more comprehensive understanding of the pose estimation and its evolution over time. Furthermore, we introduce a novel error metric, Mean Per-Joint Position Acceleration Error (i.e., MPJAE). This advanced metric takes into account the body part velocity in adjacent predicted frames, allowing for a more detailed evaluation of the predicted poses. We conduct extensive experiments on various open benchmarks to evaluate the effectiveness of our proposed model. The results demonstrate that our STRFormer, coupled with the MPJAE loss, achieves highly competitive results when compared with other stateof-the-art models. This illustrates its promising potential and practical applicability in 2D-3D human pose estimation tasks. We plan to release our code publicly for further research.
引用
收藏
页数:11
相关论文
共 65 条
[1]   Real-time 3D human pose estimation without skeletal a priori structures [J].
Bai, Guihu ;
Luo, Yanmin ;
Pan, Xueliang ;
Wang, Jia ;
Guo, Jing-Ming .
IMAGE AND VISION COMPUTING, 2023, 132
[2]   A Reverse Positional Encoding Multi-Head Attention-Based Neural Machine Translation Model for Arabic Dialects [J].
Baniata, Laith H. ;
Kang, Sangwoo ;
Ampomah, Isaac K. E. .
MATHEMATICS, 2022, 10 (19)
[3]  
Cai JL, 2023, Arxiv, DOI arXiv:2302.09790
[4]   Exploiting Spatial-temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks [J].
Cai, Yujun ;
Ge, Liuhao ;
Liu, Jun ;
Cai, Jianfei ;
Cham, Tat-Jen ;
Yuan, Junsong ;
Thalmann, Nadia Magnenat .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2272-2281
[5]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[6]   Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition [J].
Chen, Tianlang ;
Fang, Chen ;
Shen, Xiaohui ;
Zhu, Yiheng ;
Chen, Zhili ;
Luo, Jiebo .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) :198-209
[7]   Cascaded Pyramid Network for Multi-Person Pose Estimation [J].
Chen, Yilun ;
Wang, Zhicheng ;
Peng, Yuxiang ;
Zhang, Zhiqiang ;
Yu, Gang ;
Sun, Jian .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7103-7112
[8]   Optimizing Network Structure for 3D Human Pose Estimation [J].
Ci, Hai ;
Wang, Chunyu ;
Ma, Xiaoxuan ;
Wang, Yizhou .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2262-2271
[9]  
Dai L., 2022, IEEE TCSVT, P3
[10]   Human Pose Estimation for Training Assistance: a Systematic Literature Review [J].
Difini, Gisela Miranda ;
Martins, Marcio Garcia ;
Victoria Barbosa, Jorge Luis .
PROCEEDINGS OF THE 27TH BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA '21), 2021, :189-196