A simple and efficient channel MLP on token for human pose estimation

被引：1

作者：

Jianglong Huang ^{[1
]}

Chaoqun Hong ^{[1
]}

Rongsheng Xie ^{[1
]}

Lang Ran ^{[1
]}

Jialong Qian ^{[1
]}

机构：

[1] School of Computer and Information Engineering, Xiamen University of Technology, Xiamen

来源：

International Journal of Machine Learning and Cybernetics | 2025年 / 16卷 / 5期

基金：

中国国家自然科学基金;

关键词：

Channel attention; Human pose estimation; Multilayer perceptron; Transformer;

D O I：

10.1007/s13042-024-02483-y

中图分类号：

学科分类号：

摘要：

Human pose estimation is crucial to human-centered visual applications. Recently, transformer-based methods have achieved remarkable performance in human pose estimation. Transformers benefit from the self-attention mechanism, which calculates the correlation between keypoints and images. Furthermore, the multi-head attention mechanism further extends this idea, allowing the model to extract features from different attention. However, as the number of attention heads is increased, the model’s capacity to effectively process channel information becomes constrained. To overcome this limitation, a Channel MLP (CM) module is presented, which effectively improves the performance of TokenPose. The CM module consists of a channel attention mechanism integrated with a Multilayer Perceptron (MLP) block. The network evaluates the importance of each channel in this way, resulting in output features that contain more comprehensive information. The CM module enhances TokenPose’s ability to extract information effectively. Our model achieves 75.2 AP on COCO test-dev set and 90.4 PCKh@0.5 on MPII valid set while keeping similar parameters and computation as TokenPose. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.

引用

页码：3809 / 3817

页数：8

共 48 条

[41] Xin W., Liu R., Liu Y., Chen Y., Yu W., Miao Q., Transformer for skeleton-based action recognition: a review of recent advances, Neurocomputing, 537, pp. 164-186, (2023)
[42] Zhou D., Yu Z., Xie E., Xiao C., Anandkumar A., Feng J., Alvarez J.M., Understanding the robustness in vision transformers, N: International Conference on Machine Learning. PMLR, pp. 27378-27394, (2022)
[43] Wang X., Shi N., Wang G., Shao J., Zhao S., A multi-channel parallel keypoint fusion framework for human pose estimation, Electronics, 12, 19, (2023)
[44] Gu K., Yang L., Yao A., Removing the bias of integral pose regression, Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11067-11076, (2021)
[45] Zhang M., Et al., Human Pose Estimation Based on Parallel Atrous Convolution and Body Structure Constraints, 1, pp. 5553-5563, (2022)
[46] Yang Z., Et al., A Combined local and global structure module for human pose estimation., pp. 1913-1923, (2021)
[47] Chen W., Sang H., Wang J., Et al., WTGCN: wavelet transform graph convolution network for pedestrian trajectory prediction, Int J Mach Learn Cyber, (2024)
[48] He C., Zhang J., Chen L., Et al., Domain adaptation with optimized feature distribution for streamer action recognition in live video, Int J Mach Learn Cyber, (2024)

← 1 2 3 4 5 →