Exploiting Temporal Contexts With Strided Transformer for 3D Human Pose Estimation

被引:118
作者
Li, Wenhao [1 ]
Liu, Hong [1 ]
Ding, Runwei [1 ]
Liu, Mengyuan [2 ]
Wang, Pichao [3 ]
Yang, Wenming [4 ]
机构
[1] Peking Univ, Shenzhen Grad Sch, Key Lab Machine Percept, Beijing, Peoples R China
[2] Sun Yat Sen Univ, Sch Intelligent Syst Engn, Guangzhou, Guangdong, Peoples R China
[3] Alibaba Grp, Bellevue, WA 98004 USA
[4] Tsinghua Univ, Grad Sch Shenzhen, Dept Elect Engn, Shenzhen Engn Lab IS & DRM, Shenzhen, Peoples R China
基金
国家重点研发计划;
关键词
Transformers; Three-dimensional displays; Pose estimation; Task analysis; Videos; Solid modeling; Computer architecture; 3D human pose estimation; transformer; strided convolution; ACTION RECOGNITION;
D O I
10.1109/TMM.2022.3141231
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite the great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of a redundant 2D pose sequence to learn representative representations for generating one 3D pose. To this end, we propose an improved Transformer-based architecture, called Strided Transformer, which simply and effectively lifts a long sequence of 2D joint locations to a single 3D pose. Specifically, a Vanilla Transformer Encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce the redundancy of the sequence, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively shrink the sequence length and aggregate information from local contexts. The modified VTE is termed as Strided Transformer Encoder (STE), which is built upon the outputs of VTE. STE not only effectively aggregates long-range information to a single-vector representation in a hierarchical global and local fashion, but also significantly reduces the computation cost. Furthermore, a full-to-single supervision scheme is designed at both full sequence and single target frame scales applied to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision and hence helps produce smoother and more accurate 3D poses. The proposed Strided Transformer is evaluated on two challenging benchmark datasets, Human3.6 M and HumanEva-I, and achieves state-of-the-art results with fewer parameters. Code and models are available at https://github.com/Vegetebird/StridedTransformer-Pose3D.
引用
收藏
页码:1282 / 1293
页数:12
相关论文
共 65 条
[1]   Exploiting Spatial-temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks [J].
Cai, Yujun ;
Ge, Liuhao ;
Liu, Jun ;
Cai, Jianfei ;
Cham, Tat-Jen ;
Yuan, Junsong ;
Thalmann, Nadia Magnenat .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2272-2281
[2]   End-to-End Object Detection with Transformers [J].
Carion, Nicolas ;
Massa, Francisco ;
Synnaeve, Gabriel ;
Usunier, Nicolas ;
Kirillov, Alexander ;
Zagoruyko, Sergey .
COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229
[3]   Multi-Temporal Depth Motion Maps-Based Local Binary Patterns for 3-D Human Action Recognition [J].
Chen, Chen ;
Liu, Mengyuan ;
Liu, Hong ;
Zhang, Baochang ;
Han, Jungong ;
Kehtarnavaz, Nasser .
IEEE ACCESS, 2017, 5 :22590-22604
[4]   Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition [J].
Chen, Tianlang ;
Fang, Chen ;
Shen, Xiaohui ;
Zhu, Yiheng ;
Chen, Zhili ;
Luo, Jiebo .
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) :198-209
[5]   Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation [J].
Chen, Xipeng ;
Lin, Kwan-Yee ;
Liu, Wentao ;
Qian, Chen ;
Lin, Liang .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :10887-10896
[6]   Cascaded Pyramid Network for Multi-Person Pose Estimation [J].
Chen, Yilun ;
Wang, Zhicheng ;
Peng, Yuxiang ;
Zhang, Zhiqiang ;
Yu, Gang ;
Sun, Jian .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7103-7112
[7]   Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based Action Recognition [J].
Chen, Tailin ;
Zhou, Desen ;
Wang, Jian ;
Wang, Shidong ;
Guan, Yu ;
He, Xuming ;
Ding, Errui .
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, :4334-4342
[8]   Occlusion-Aware Networks for 3D Human Pose Estimation in Video [J].
Cheng, Yu ;
Yang, Bo ;
Wang, Bo ;
Yan, Wending ;
Tan, Robby T. .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :723-732
[9]   Learning 3D Human Pose from Structure and Motion [J].
Dabral, Rishabh ;
Mundhada, Anurag ;
Kusupati, Uday ;
Afaque, Safeer ;
Sharma, Abhishek ;
Jain, Arjun .
COMPUTER VISION - ECCV 2018, PT IX, 2018, 11213 :679-696
[10]  
Dai Zihang, 2020, ADV NEUR IN, V33, P4271