Mobile-friendly and multi-feature aggregation via transformer for human pose estimation

被引：0

作者：

Li, Biao ^{[1
,2
]}

Tang, Shoufeng ^{[1
]}

Li, Wenyi ^{[1
,2
]}

机构：

[1] China Univ Min & Technol, Sch Informat & Control Engn, Xuzhou 221116, Peoples R China

[2] Suzhou Univ, Sch Mech & Elect Engn, Suzhou 234000, Peoples R China

来源：

IMAGE AND VISION COMPUTING | 2025年 / 153卷

关键词：

Human pose estimation; Lightweight network; Multi-feature aggregation; Hybrid architecture;

D O I：

10.1016/j.imavis.2024.105343

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Human pose estimation is pivotal for human-centric visual tasks, yet deploying such models on mobile devices remains challenging due to high parameter counts and computational demands. In this paper, we study Mobile-Friendly and Multi-Feature Aggregation architectural designs for human pose estimation and propose a novel model called MobileMultiPose. Specifically, a lightweight aggregation method, incorporating multi- scale and multi-feature, mitigates redundant shallow semantic extraction and local deep semantic constraints. To efficiently aggregate diverse local and global features, a lightweight transformer module, constructed from a self-attention mechanism with linear complexity, is designed, achieving deep fusion of shallow and deep semantics. Furthermore, a multi-scale loss supervision method is incorporated into the training process to enhance model performance, facilitating the effective fusion of edge information across various scales. Extensive experiments show that the smallest variant of MobileMultiPose outperforms lightweight models (MobileNetv2, ShuffleNetv2, and Small HRNet) by 0.7, 5.4, and 10.1 points, respectively, on the COCO validation set, with fewer parameters and FLOPs. In particular, the largest MobileMultiPose variant achieves an impressive AP score of 72.4 on the COCO test-dev set, notably, its parameters and FLOPs are only 16% and 18% of HRNet-W32, and 7% and 9% of DARK, respectively. We aim to offer novel insights into designing lightweight and efficient feature extraction networks, supporting mobile-friendly model deployment.

引用

页数：12

共 75 条

[1] 2D Human Pose Estimation: New Benchmark and State of the Art Analysis [J].

Andriluka, Mykhaylo ;

Pishchulin, Leonid ;

Gehler, Peter ;

Schiele, Bernt .

2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2014, :3686-3693

[2]

[Anonymous], Akahori, W., Hirai, T., Morishima, S. Dynamic subtitle placement considering the region of interest and speaker location (F. Imai, A. Tremeau, J. Braz, Eds.) [Funding Information: We thank S. Kawamura, T. Kato and T. Fukusato (Waseda University, Japan) for their advisory. This research was supported by JST ACCEL and CREST. Publisher Copyright: 2017 by SCITEPRESS - Science and Technology Publications, Lda.

[3]

12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2017

[4]

Conference date: 27-02-2017 Through 01-03-2017]. In: In Visapp (F. Imai, A. Tremeau, J. Braz, Eds.). Ed. by Imai, F., Tremeau, A., Braz, J. VISIGRAPP 2017 - Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. Funding Information: We thank S. Kawamura, T. Kato and T. Fukusato (Waseda University, Japan) for their advisory. This research was supported by JST ACCEL and CREST. Publisher Copyright: 2017 by SCITEPRESS - Sci

[5]

12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2017

[6]

Conference date: 27-02-2017 Through 01-03-2017. SciTePress, 2017, 102-109.

[7] Learning Delicate Local Representations for Multi-person Pose Estimation [J].

Cai, Yuanhao ;

Wang, Zhicheng ;

Luo, Zhengxiong ;

Yin, Binyi ;

Du, Angang ;

Wang, Haoqian ;

Zhang, Xiangyu ;

Zhou, Xinyu ;

Zhou, Erjin ;

Sun, Jian .

COMPUTER VISION - ECCV 2020, PT III, 2020, 12348 :455-472

[8] Human Pose Estimation with Iterative Error Feedback [J].

Carreira, Joao ;

Agrawal, Pulkit ;

Fragkiadaki, Katerina ;

Malik, Jitendra .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :4733-4742

[9]

Chen K., 2023, Rtmpose: Real-time multi-person pose estimation based on mmpose

[10] Cascaded Pyramid Network for Multi-Person Pose Estimation [J].

Chen, Yilun ;

Wang, Zhicheng ;

Peng, Yuxiang ;

Zhang, Zhiqiang ;

Yu, Gang ;

Sun, Jian .

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :7103-7112

← 1 2 3 4 5 6 7 8 →