A simple and efficient channel MLP on token for human pose estimation

被引:1
作者
Jianglong Huang [1 ]
Chaoqun Hong [1 ]
Rongsheng Xie [1 ]
Lang Ran [1 ]
Jialong Qian [1 ]
机构
[1] School of Computer and Information Engineering, Xiamen University of Technology, Xiamen
基金
中国国家自然科学基金;
关键词
Channel attention; Human pose estimation; Multilayer perceptron; Transformer;
D O I
10.1007/s13042-024-02483-y
中图分类号
学科分类号
摘要
Human pose estimation is crucial to human-centered visual applications. Recently, transformer-based methods have achieved remarkable performance in human pose estimation. Transformers benefit from the self-attention mechanism, which calculates the correlation between keypoints and images. Furthermore, the multi-head attention mechanism further extends this idea, allowing the model to extract features from different attention. However, as the number of attention heads is increased, the model’s capacity to effectively process channel information becomes constrained. To overcome this limitation, a Channel MLP (CM) module is presented, which effectively improves the performance of TokenPose. The CM module consists of a channel attention mechanism integrated with a Multilayer Perceptron (MLP) block. The network evaluates the importance of each channel in this way, resulting in output features that contain more comprehensive information. The CM module enhances TokenPose’s ability to extract information effectively. Our model achieves 75.2 AP on COCO test-dev set and 90.4 PCKh@0.5 on MPII valid set while keeping similar parameters and computation as TokenPose. © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2024.
引用
收藏
页码:3809 / 3817
页数:8
相关论文
共 48 条
  • [11] Hu J., Shen L., Sun G., Squeeze-and-excitation networks, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132-7141, (2018)
  • [12] LeCun Y., Boser B., Denker J.S., Henderson D., Howard R.E., Hubbard W., Jackel L.D., Backpropagation applied to handwritten zip code recognition, Neural computation, 1, 4, pp. 541-551, (1989)
  • [13] Li K., Wang S., Zhang X., Xu Y., Xu W., Tu Z., Pose recognition with cascade transformers, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1944-1953, (2021)
  • [14] Li X., Wang W., Hu X., Yang J., Selective kernel networks, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 510-519, (2019)
  • [15] Li Y., Zhang S., Wang Z., Yang S., Yang W., Xia S.T., Zhou E., Tokenpose: Learning keypoint tokens for human pose estimation, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11313-11322, (2021)
  • [16] Lin K., Wang L., Liu Z., End-to-end human pose and mesh reconstruction with transformers, . In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1954-1963, (2021)
  • [17] Lin T.Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Zitnick C.L., Microsoft coco: Common objects in context. In: Computer Vision-ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, pp. 740-755, (2014)
  • [18] Ma H., Chen L., Kong D., Wang Z., Liu X., Tang H., Xie X., Transfusion: Cross-View Fusion with Transformer for 3D Human Pose Estimation, (2021)
  • [19] Ma H., Wang Z., Chen Y., Kong D., Chen L., Liu X., Xie X., Ppt: Token-pruned pose transformer for monocular and multi-view human pose estimation, European Conference on Computer Vision, pp. 424-442, (2022)
  • [20] Mao W., Ge Y., Shen C., Tian Z., Wang X., Wang Z., Den Hengel A.V., Poseur: Direct human pose regression with transformers, European Conference on Computer Vision, pp. 72-88, (2022)