A simple and efficient channel MLP on token for human pose estimation

被引：1

作者：

Jianglong Huang ^{[1
]}

Chaoqun Hong ^{[1
]}

Rongsheng Xie ^{[1
]}

Lang Ran ^{[1
]}

Jialong Qian ^{[1
]}

机构：

[1] School of Computer and Information Engineering, Xiamen University of Technology, Xiamen

来源：

International Journal of Machine Learning and Cybernetics | 2025年 / 16卷 / 5期

基金：

中国国家自然科学基金;

关键词：

Channel attention; Human pose estimation; Multilayer perceptron; Transformer;

D O I：

10.1007/s13042-024-02483-y

中图分类号：

学科分类号：

摘要：

Human pose estimation is crucial to human-centered visual applications. Recently, transformer-based methods have achieved remarkable performance in human pose estimation. Transformers benefit from the self-attention mechanism, which calculates the correlation between keypoints and images. Furthermore, the multi-head attention mechanism further extends this idea, allowing the model to extract features from different attention. However, as the number of attention heads is increased, the model’s capacity to effectively process channel information becomes constrained. To overcome this limitation, a Channel MLP (CM) module is presented, which effectively improves the performance of TokenPose. The CM module consists of a channel attention mechanism integrated with a Multilayer Perceptron (MLP) block. The network evaluates the importance of each channel in this way, resulting in output features that contain more comprehensive information. The CM module enhances TokenPose’s ability to extract information effectively. Our model achieves 75.2 AP on COCO test-dev set and 90.4 PCKh@0.5 on MPII valid set while keeping similar parameters and computation as TokenPose. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.

引用

页码：3809 / 3817

页数：8

共 48 条

[1] Andriluka M., Pishchulin L., Gehler P., Schiele B., 2d human pose estimation: New benchmark and state of the art analysis, . In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686-3693, (2014)
[2] Cai Y., Wang Z., Luo Z., Yin B., Du A., Wang H., Sun J., Learning delicate local representations for multi-person pose estimation, In: Computer Vision-Eccv 2020: 16Th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16, pp. 455-472, (2020)
[3] Cao Z., Simon T., Wei S.E., Sheikh Y., Realtime multi-person 2d pose estimation using part affinity fields, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291-7299, (2017)
[4] Chen Y., Ma H., Wang J., Wu J., Wu X., Xie X., PD-Net: Quantitative motor function evaluation for Parkinson’s disease via automated hand gesture analysis, In: Proceedings of the 27Th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 2683-2691, (2021)
[5] Chen Y., Wang Z., Peng Y., Zhang Z., Yu G., Sun J., Cascaded pyramid network for multi-person pose estimation, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103-7112, (2018)
[6] Das S., Sharma S., Dai R., Bremond F., Thonnat M., Vpn: Learning video-pose embedding for activities of daily living, . In: Computer Vision-Eccv 2020: 16Th European Conference, Glasgow, UK, August 23-28, 2020, pp. 72-90, (2020)
[7] Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Houlsby N., ) an Image is Worth 16X16 Words: Transformers for Image Recognition at Scale., (2020)
[8] Fang H.S., Xie S., Tai Y.W., Lu C., Rmpe: Regional multi-person pose estimation, Proceedings of the IEEE International Conference on Computer Vision, pp. 2334-2343, (2017)
[9] Fu J., Liu J., Tian H., Li Y., Bao Y., Fang Z., Lu H., Dual attention network for scene segmentation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146-3154, (2019)
[10] Hendrycks D., Gimpel K., Gaussian error linear units (Gelus), (2016)

← 1 2 3 4 5 →