Transformer-based rapid human pose estimation network

被引：6

作者：

Wang, Dong ^{[1
]}

Xie, Wenjun ^{[2
,3
]}

Cai, Youcheng ^{[1
]}

Li, Xinjie ^{[1
]}

Liu, Xiaoping ^{[1
]}

机构：

[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China

[2] Hefei Univ Technol, Sch Software, Hefei 230009, Peoples R China

[3] Hefei Univ Technol, Anhui Prov Key Lab Ind Safety & Emergency Technol, Hefei 230601, Peoples R China

来源：

COMPUTERS & GRAPHICS-UK | 2023年 / 116卷

关键词：

Transformer architecture; Human pose estimation; Inference speed; Computational cost; ACTION RECOGNITION; SKELETON;

D O I：

10.1016/j.cag.2023.09.001

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Most current human pose estimation methods pursue excellent performance via large models and intensive computational requirements, resulting in slower models. These methods cannot be effectively adopted for human pose estimation in real applications due to their high memory and computational costs. To achieve a trade-off between accuracy and efficiency, we propose TRPose, a Transformer-based network for human pose estimation rapidly. TRPose consists of an early convolutional stage and a later Transformer stage seamlessly. Concretely, the convolutional stage forms a Rapid Fusion Module (RFM), which efficiently acquires multi-scale features via three parallel convolution branches. The Transformer stage utilizes multi-resolution Transformers to construct a Dual scale Encoder Module (DEM), aiming at learning long-range dependencies from different scale features of the whole human skeletal keypoints. The experiments show that TRPose acquires 74.3 AP and 73.8 AP on COCO validation and testdev datasets with 170+ FPS on a GTX2080Ti, which achieves the better efficiency and effectiveness tradeoffs than most state-of-the-art methods. Our model also outperforms mainstream Transformer-based architectures on MPII dataset, yielding 89.9 PCK@0.5 score on val set without extra data. (c) 2023 Elsevier Ltd. All rights reserved.

引用

页码：317 / 326

页数：10

共 60 条

[1] 2D Human Pose Estimation: New Benchmark and State of the Art Analysis
Andriluka, Mykhaylo
Pishchulin, Leonid
Gehler, Peter
Schiele, Bernt
[J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 3686 - 3693
[2] Towards Few-Annotation Learning for Object Detection: Are Transformer-based Models More Efficient ?
Bouniot, Quentin
Loesch, Angelique
Audigier, Romaric
Habrard, Amaury
[J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 75 - 84
[3] Cao X, 2022, Arxiv, DOI arXiv:2205.05277
[4] Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
Cao, Zhe
Simon, Tomas
Wei, Shih-En
Sheikh, Yaser
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1302 - 1310
[5] Cascaded Pyramid Network for Multi-Person Pose Estimation
Chen, Yilun
Wang, Zhicheng
Peng, Yuxiang
Zhang, Zhiqiang
Yu, Gang
Sun, Jian
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7103 - 7112
[6] HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation
Cheng, Bowen
Xiao, Bin
Wang, Jingdong
Shi, Honghui
Huang, Thomas S.
Zhang, Lei
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 5385 - 5394
[7] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
[8] RMPE: Regional Multi-Person Pose Estimation
Fang, Hao-Shu
Xie, Shuqin
Tai, Yu-Wing
Lu, Cewu
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2353 - 2362
[9] Felzenszwalb PF, 2000, PROC CVPR IEEE, P66, DOI 10.1109/CVPR.2000.854739
[10] Encoder deep interleaved network with multi-scale aggregation for RGB-D salient object detection
Feng, Guang
Meng, Jinyu
Zhang, Lihe
Lu, Huchuan
[J]. PATTERN RECOGNITION, 2022, 128

← 1 2 3 4 5 6 →