Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation

被引：18

作者：

Li, Wenhao ^{[1
]}

Liu, Mengyuan ^{[1
]}

Liu, Hong ^{[1
]}

Wang, Pichao ^{[2
]}

Cai, Jialun ^{[1
]}

Sebe, Nicu ^{[3
]}

机构：

[1] Peking Univ, Shenzhen Grad Sch, Natl Key Lab Gen Artificial Intelligence, Shenzhen, Peoples R China

[2] Amazon Prime Video, Seattle, WA USA

[3] Univ Trento, Trento, Italy

来源：

2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024 | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1109/CVPR52733.2024.00064

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformers have been successfully applied in the field of video-based 3D human pose estimation. However, the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper, we present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this, we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition, we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens, thereby expanding the network output to the original full-length temporal resolution for fast inference. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and estimation accuracy compared to the original VPT models. For instance, applying to MotionBERT and MixSTE on Human3.6M, our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop, respectively. Code and models are available at https://github.com/NationalGAILab/HoT.

引用

页码：604 / 613

页数：10

共 54 条

[1]

[Anonymous], CVPR, DOI DOI 10.1109/EUMA.1991.336489

[2]

[Anonymous], 2021, CVPR, DOI DOI 10.1109/CVPR46437.2021.00847

[3]

[Anonymous], 2023, WACV, DOI DOI 10.1109/WACV56688.2023.00292

[4]

Bolya Daniel, 2022, ICLR

[5] Exploiting Spatial-temporal Relationships for 3D Pose Estimation via Graph Convolutional Networks [J].

Cai, Yujun ;

Ge, Liuhao ;

Liu, Jun ;

Cai, Jianfei ;

Cham, Tat-Jen ;

Yuan, Junsong ;

Thalmann, Nadia Magnenat .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :2272-2281

[6] Making Vision Transformers Efficient from A Token Sparsification View [J].

Chang, Shuning ;

Wang, Pichao ;

Lin, Ming ;

Wang, Fan ;

Zhang, David Junhao ;

Jin, Rong ;

Shou, Mike Zheng .

2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, :6195-6205

[7]

Chen Haoyong, 2023, Proceedings of the Chinese Society for Electrical Engineering, P581, DOI 10.13334/j.0258-8013.pcsee.213324

[8] MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices [J].

Choi, Sangbum ;

Choi, Seokeon ;

Kim, Changick .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2021, 2021, :2328-2338

[9]

Dou Zhiyang, 2023, ICCV, P15143

[10] Study on density peaks clustering based on k-nearest neighbors and principal component analysis [J].

Du, Mingjing ;

Ding, Shifei ;

Jia, Hongjie .

KNOWLEDGE-BASED SYSTEMS, 2016, 99 :135-145

← 1 2 3 4 5 6 →