Mobile-friendly and multi-feature aggregation via transformer for human pose estimation

被引:0
作者
Li, Biao [1 ,2 ]
Tang, Shoufeng [1 ]
Li, Wenyi [1 ,2 ]
机构
[1] China Univ Min & Technol, Sch Informat & Control Engn, Xuzhou 221116, Peoples R China
[2] Suzhou Univ, Sch Mech & Elect Engn, Suzhou 234000, Peoples R China
关键词
Human pose estimation; Lightweight network; Multi-feature aggregation; Hybrid architecture;
D O I
10.1016/j.imavis.2024.105343
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human pose estimation is pivotal for human-centric visual tasks, yet deploying such models on mobile devices remains challenging due to high parameter counts and computational demands. In this paper, we study Mobile-Friendly and Multi-Feature Aggregation architectural designs for human pose estimation and propose a novel model called MobileMultiPose. Specifically, a lightweight aggregation method, incorporating multi- scale and multi-feature, mitigates redundant shallow semantic extraction and local deep semantic constraints. To efficiently aggregate diverse local and global features, a lightweight transformer module, constructed from a self-attention mechanism with linear complexity, is designed, achieving deep fusion of shallow and deep semantics. Furthermore, a multi-scale loss supervision method is incorporated into the training process to enhance model performance, facilitating the effective fusion of edge information across various scales. Extensive experiments show that the smallest variant of MobileMultiPose outperforms lightweight models (MobileNetv2, ShuffleNetv2, and Small HRNet) by 0.7, 5.4, and 10.1 points, respectively, on the COCO validation set, with fewer parameters and FLOPs. In particular, the largest MobileMultiPose variant achieves an impressive AP score of 72.4 on the COCO test-dev set, notably, its parameters and FLOPs are only 16% and 18% of HRNet-W32, and 7% and 9% of DARK, respectively. We aim to offer novel insights into designing lightweight and efficient feature extraction networks, supporting mobile-friendly model deployment.
引用
收藏
页数:12
相关论文
共 75 条
[11]   Dynamic Convolution: Attention over Convolution Kernels [J].
Chen, Yinpeng ;
Dai, Xiyang ;
Liu, Mengchen ;
Chen, Dongdong ;
Yuan, Lu ;
Liu, Zicheng .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :11027-11036
[12]   HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation [J].
Cheng, Bowen ;
Xiao, Bin ;
Wang, Jingdong ;
Shi, Honghui ;
Huang, Thomas S. ;
Zhang, Lei .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, :5385-5394
[13]  
Cheng HP, 2022, PR MACH LEARN RES, V188
[14]  
Chu XX, 2021, ADV NEUR IN
[15]   ConViT: improving vision transformers with soft convolutional inductive biases [J].
d'Ascoli, Stephane ;
Touvron, Hugo ;
Leavitt, Matthew L. ;
Morcos, Ari S. ;
Biroli, Giulio ;
Sagun, Levent .
JOURNAL OF STATISTICAL MECHANICS-THEORY AND EXPERIMENT, 2022, 2022 (11)
[16]   Deformable Convolutional Networks [J].
Dai, Jifeng ;
Qi, Haozhi ;
Xiong, Yuwen ;
Li, Yi ;
Zhang, Guodong ;
Hu, Han ;
Wei, Yichen .
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :764-773
[17]  
Dosovitskiy A., 2020, ICLR 2021
[18]   Revisiting Skeleton-based Action Recognition [J].
Duan, Haodong ;
Zhao, Yue ;
Chen, Kai ;
Lin, Dahua ;
Dai, Bo .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :2959-2968
[19]   AlphaPose: Whole-Body Regional Multi-Person Pose Estimation and Tracking in Real-Time [J].
Fang, Hao-Shu ;
Li, Jiefeng ;
Tang, Hongyang ;
Xu, Chao ;
Zhu, Haoyi ;
Xiu, Yuliang ;
Li, Yong-Lu ;
Lu, Cewu .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) :7157-7173
[20]   The rapid construction method of human body model for virtual try-on on mobile terminal based on MDD-Net [J].
Fang, Naiyu ;
Qiu, Lemiao ;
Zhang, Shuyou ;
Wang, Zili ;
Gu, Ye ;
Hu, Kerui .
SOFT COMPUTING, 2022, 26 (22) :12023-12039