Transformer-based rapid human pose estimation network

被引:6
作者
Wang, Dong [1 ]
Xie, Wenjun [2 ,3 ]
Cai, Youcheng [1 ]
Li, Xinjie [1 ]
Liu, Xiaoping [1 ]
机构
[1] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230009, Peoples R China
[2] Hefei Univ Technol, Sch Software, Hefei 230009, Peoples R China
[3] Hefei Univ Technol, Anhui Prov Key Lab Ind Safety & Emergency Technol, Hefei 230601, Peoples R China
来源
COMPUTERS & GRAPHICS-UK | 2023年 / 116卷
关键词
Transformer architecture; Human pose estimation; Inference speed; Computational cost; ACTION RECOGNITION; SKELETON;
D O I
10.1016/j.cag.2023.09.001
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Most current human pose estimation methods pursue excellent performance via large models and intensive computational requirements, resulting in slower models. These methods cannot be effectively adopted for human pose estimation in real applications due to their high memory and computational costs. To achieve a trade-off between accuracy and efficiency, we propose TRPose, a Transformer-based network for human pose estimation rapidly. TRPose consists of an early convolutional stage and a later Transformer stage seamlessly. Concretely, the convolutional stage forms a Rapid Fusion Module (RFM), which efficiently acquires multi-scale features via three parallel convolution branches. The Transformer stage utilizes multi-resolution Transformers to construct a Dual scale Encoder Module (DEM), aiming at learning long-range dependencies from different scale features of the whole human skeletal keypoints. The experiments show that TRPose acquires 74.3 AP and 73.8 AP on COCO validation and testdev datasets with 170+ FPS on a GTX2080Ti, which achieves the better efficiency and effectiveness tradeoffs than most state-of-the-art methods. Our model also outperforms mainstream Transformer-based architectures on MPII dataset, yielding 89.9 PCK@0.5 score on val set without extra data. (c) 2023 Elsevier Ltd. All rights reserved.
引用
收藏
页码:317 / 326
页数:10
相关论文
共 60 条
  • [1] 2D Human Pose Estimation: New Benchmark and State of the Art Analysis
    Andriluka, Mykhaylo
    Pishchulin, Leonid
    Gehler, Peter
    Schiele, Bernt
    [J]. 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, : 3686 - 3693
  • [2] Towards Few-Annotation Learning for Object Detection: Are Transformer-based Models More Efficient ?
    Bouniot, Quentin
    Loesch, Angelique
    Audigier, Romaric
    Habrard, Amaury
    [J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 75 - 84
  • [3] Cao X, 2022, Arxiv, DOI arXiv:2205.05277
  • [4] Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
    Cao, Zhe
    Simon, Tomas
    Wei, Shih-En
    Sheikh, Yaser
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1302 - 1310
  • [5] Cascaded Pyramid Network for Multi-Person Pose Estimation
    Chen, Yilun
    Wang, Zhicheng
    Peng, Yuxiang
    Zhang, Zhiqiang
    Yu, Gang
    Sun, Jian
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7103 - 7112
  • [6] HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation
    Cheng, Bowen
    Xiao, Bin
    Wang, Jingdong
    Shi, Honghui
    Huang, Thomas S.
    Zhang, Lei
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 5385 - 5394
  • [7] Dosovitskiy A, 2021, Arxiv, DOI arXiv:2010.11929
  • [8] RMPE: Regional Multi-Person Pose Estimation
    Fang, Hao-Shu
    Xie, Shuqin
    Tai, Yu-Wing
    Lu, Cewu
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 2353 - 2362
  • [9] Felzenszwalb PF, 2000, PROC CVPR IEEE, P66, DOI 10.1109/CVPR.2000.854739
  • [10] Encoder deep interleaved network with multi-scale aggregation for RGB-D salient object detection
    Feng, Guang
    Meng, Jinyu
    Zhang, Lihe
    Lu, Huchuan
    [J]. PATTERN RECOGNITION, 2022, 128