SlowFastFormer for 3D human pose estimation

被引：5

作者：

Zhou, Lu ^{[1
]}

Chen, Yingying ^{[1
]}

Wang, Jinqiao ^{[1
,2
,3
,4
]}

机构：

[1] Chinese Acad Sci, Inst Automat, Fdn Model Res Ctr, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Sch Artificial Intelligence, Beijing 100049, Peoples R China

[3] Wuhan AI Res, Wuhan 430073, Peoples R China

[4] Peng Cheng Lab, Shenzhen 518066, Peoples R China

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 2024年 / 243卷

基金：

中国国家自然科学基金;

关键词：

SlowFastFormer; Transformer; Blending; 3D human pose estimation; Hierarchical supervision;

D O I：

10.1016/j.cviu.2024.103992

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

3D human pose estimation in videos aims at locating the human joints in the 3D space given a temporal sequence. Motion information and skeleton context are two significant elements for pose estimation in videos. In this paper, we propose a SlowFastFormer (slow -fast transformer) network where two branches with different input rates are composed to encode these two different kinds of context. For the slow branch, skeleton context is well learned at a higher frame rate. For the fast branch, motion information is captured at a lower frame rate. Through these two branches, different kinds of context are encoded separately. We fuse these two branches at a later stage to fully utilize the skeleton context and motion information. Afterwards, a blending module is developed to promote the message exchange among multiple branches. In the blending stage, different kinds of context information are exchanged and feature representation is enhanced consequently. Lastly, a hierarchical supervision scheme is tailored where predictions of different levels are inferred in a progressive manner. Our approach achieves competitive performance with lower computation complexity on several benchmarks, i.e., Human3.6M, MPI-INF-3DHP and HumanEva-I.

引用

页数：9

共 70 条

[1]

Ahn J, 2023, Arxiv, DOI arXiv:2309.12304

[2]

[Anonymous], 2023, Comput. Vis. Image Underst., V229

[3]

Cai J., 2023, P IEEE INT C AC SPEE, P1

[4] Exploiting Spatial-temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks [J].

Cai, Yujun ;

Ge, Liuhao ;

Liu, Jun ;

Cai, Jianfei ;

Cham, Tat-Jen ;

Yuan, Junsong ;

Thalmann, Nadia Magnenat .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2272-2281

[5] End-to-End Object Detection with Transformers [J].

Carion, Nicolas ;

Massa, Francisco ;

Synnaeve, Gabriel ;

Usunier, Nicolas ;

Kirillov, Alexander ;

Zagoruyko, Sergey .

COMPUTER VISION - ECCV 2020, PT I, 2020, 12346 :213-229

[6]

Chen H., 2023, INT JOINT C ART INT

[7] Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition [J].

Chen, Tianlang ;

Fang, Chen ;

Shen, Xiaohui ;

Zhu, Yiheng ;

Chen, Zhili ;

Luo, Jiebo .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) :198-209

[8]

Chu XX, 2021, Arxiv, DOI [arXiv:2102.10882, DOI 10.48550/ARXIV.2102.10882]

[9]

Dosovitskiy A, 2021, Arxiv, DOI [arXiv:2010.11929, DOI 10.48550/ARXIV.2010.11929]

[10]

Fang HS, 2018, AAAI CONF ARTIF INTE, P6821

← 1 2 3 4 5 6 7 →