YOLO-Rlepose: Improved YOLO Based on Swin Transformer and Rle-Oks Loss for Multi-Person Pose Estimation

被引：5

作者：

Jiang, Yi ^{[1
]}

Yang, Kexin ^{[1
]}

Zhu, Jinlin ^{[1
]}

Qin, Li ^{[2
]}

机构：

[1] Harbin Univ Sci & Technol, Dept Commun Engn, Harbin 150080, Peoples R China

[2] Harbin Univ Sci & Technol, Dept Engn Mech, Harbin 150080, Peoples R China

来源：

ELECTRONICS | 2024年 / 13卷 / 03期

关键词：

human pose estimation; deep learning; convolutional neural network; transformer;

D O I：

10.3390/electronics13030563

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, there has been significant progress in human pose estimation, fueled by the widespread adoption of deep convolutional neural networks. However, despite these advancements, multi-person 2D pose estimation still remains highly challenging due to factors such as occlusion, noise, and non-rigid body movements. Currently, most multi-person pose estimation approaches handle joint localization and association separately. This study proposes a direct regression-based method to estimate the 2D human pose from a single image. The authors name this network YOLO-Rlepose. Compared to traditional methods, YOLO-Rlepose leverages Transformer models to better capture global dependencies between image feature blocks and preserves sufficient spatial information for keypoint detection through a multi-head self-attention mechanism. To further improve the accuracy of the YOLO-Rlepose model, this paper proposes the following enhancements. Firstly, this study introduces the C3 Module with Swin Transformer (C3STR). This module builds upon the C3 module in You Only Look Once (YOLO) by incorporating a Swin Transformer branch, enhancing the YOLO-Rlepose model's ability to capture global information and rich contextual information. Next, a novel loss function named Rle-Oks loss is proposed. The loss function facilitates the training process by learning the distributional changes through Residual Log-likelihood Estimation. To assign different weights based on the importance of different keypoints in the human body, this study introduces a weight coefficient into the loss function. The experiments proved the efficiency of the proposed YOLO-Rlepose model. On the COCO dataset, the model outperforms the previous SOTA method by 2.11% in AP.

引用

页数：16

共 40 条

[1] Ailing Zeng, 2020, Computer Vision - ECCV 2020. 16th European Conference. Proceedings. Lecture Notes in Computer Science (LNCS 12359), P507, DOI 10.1007/978-3-030-58568-6_30
[2] PoseTrack: A Benchmark for Human Pose Estimation and Tracking
Andriluka, Mykhaylo
Iqbal, Umar
Insafutdinov, Eldar
Pishchulin, Leonid
Milan, Anton
Gall, Juergen
Schiele, Bernt
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5167 - 5176
[3] Bochkovskiy A, 2020, Arxiv, DOI [arXiv:2004.10934, DOI 10.48550/ARXIV.2004.10934]
[4] Cai Y., 2020, COMPUTER VISION ECCV, P455, DOI 10.1007/978-3-030-58580
[5] Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
Cao, Zhe
Simon, Tomas
Wei, Shih-En
Sheikh, Yaser
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1302 - 1310
[6] Human Pose Estimation with Iterative Error Feedback
Carreira, Joao
Agrawal, Pulkit
Fragkiadaki, Katerina
Malik, Jitendra
[J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4733 - 4742
[7] Cascaded Pyramid Network for Multi-Person Pose Estimation
Chen, Yilun
Wang, Zhicheng
Peng, Yuxiang
Zhang, Zhiqiang
Yu, Gang
Sun, Jian
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7103 - 7112
[8] HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation
Cheng, Bowen
Xiao, Bin
Wang, Jingdong
Shi, Honghui
Huang, Thomas S.
Zhang, Lei
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 5385 - 5394
[9] Choi H., 2020, P EUR C COMP VIS, P769
[10] Du Y, 2015, PROC CVPR IEEE, P1110, DOI 10.1109/CVPR.2015.7298714

← 1 2 3 4 →