UViT: Efficient and lightweight U-shaped hybrid vision transformer for human pose estimation

被引：0

作者：

Li B. ^{[1
,2
]}

Tang S. ^{[1
]}

Li W. ^{[1
,2
]}

机构：

[1] School of Information and Control Engineering, China University of Mining and Technology, Xuzhou

[2] School of Mechanical and Electronic Engineering, Suzhou University, Suzhou

来源：

Journal of Intelligent and Fuzzy Systems | 2024年 / 46卷 / 04期

关键词：

attention mechanism; context enhancement; lightweight network; multi-branch structure; Pose estimation;

D O I：

10.3233/JIFS-231440

中图分类号：

学科分类号：

摘要：

Pose estimation plays a crucial role in human-centered vision applications and has advanced significantly in recent years. However, prevailing approaches use extremely complex structural designs for obtaining high scores on the benchmark dataset, hampering edge device applications. In this study, an efficient and lightweight human pose estimation problem is investigated. Enhancements are made to the context enhancement module of the U-shaped structure to improve the multi-scale local modeling capability. With a transformer structure, a lightweight transformer block was designed to enhance the local feature extraction and global modeling ability. Finally, a lightweight pose estimation network-U-shaped Hybrid Vision Transformer, UViT-was developed. The minimal network UViT-T achieved a 3.9% improvement in AP scores on the COCO validation set with fewer model parameters and computational complexity compared with the best-performing V2 version of the MobileNet series. Specifically, with an input size of 384×288, UViT-T achieves an impressive AP score of 70.2 on the COCO test-dev set, with only 1.52 M parameters and 2.32 GFLOPs. The inference speed is approximately twice that of general-purpose networks. This study provides an efficient and lightweight design idea and method for the human pose estimation task and provides theoretical support for its deployment on edge devices. © 2024-IOS Press. All rights reserved.

引用

页码：8345 / 8359

页数：14

共 50 条

[41] Exploiting Temporal Contexts With Strided Transformer for 3D Human Pose Estimation [J].

Li, Wenhao ;

Liu, Hong ;

Ding, Runwei ;

Liu, Mengyuan ;

Wang, Pichao ;

Yang, Wenming .

IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 :1282-1293

[42] Hybrid attention adaptive sampling network for human pose estimation in videos [J].

Song, Qianyun ;

Zhang, Hao ;

Liu, Yanan ;

Sun, Shouzheng ;

Xu, Dan .

COMPUTER ANIMATION AND VIRTUAL WORLDS, 2024, 35 (04)

[43] A Local-Global Estimator Based on Large Kernel CNN and Transformer for Human Pose Estimation and Running Pose Measurement [J].

Wu, Qingtian ;

Wu, Yongfei ;

Zhang, Yu ;

Zhang, Liming .

IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, 2022, 71

[44] Transformer Network-Aided Relative Pose Estimation for Non-cooperative Spacecraft Using Vision Sensor [J].

Ahmed, Jamal ;

Arshad, Awais ;

Bang, Hyochoong ;

Choi, Yoonhyuk .

INTERNATIONAL JOURNAL OF AERONAUTICAL AND SPACE SCIENCES, 2024, 25 (03) :1146-1165

[45] PCDPose: enhancing the lightweight 2D human pose estimation model with pose-enhancing attention and context broadcasting [J].

Tian, Zhenyuan ;

Fu, Weina ;

Wozniak, Marcin ;

Liu, Shuai .

PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)

[46] HumanPoseNet: An all-transformer architecture for pose estimation with efficient patch expansion and attentional feature refinement [J].

Gupta, Varun ;

Yadav, Ankit ;

Vishwakarma, Dinesh Kumar .

EXPERT SYSTEMS WITH APPLICATIONS, 2024, 244

[47] Ghost-HRNet: a lightweight high-resolution network for efficient human pose estimation with enhanced multi-scale feature fusion [J].

Zheng, Xiaoyu ;

Zhuang, Liping ;

Chen, Dewang .

PATTERN ANALYSIS AND APPLICATIONS, 2025, 28 (02)

[48] Mobile-friendly and multi-feature aggregation via transformer for human pose estimation [J].

Li, Biao ;

Tang, Shoufeng ;

Li, Wenyi .

IMAGE AND VISION COMPUTING, 2025, 153

[49] Frame-Padded Multiscale Transformer for Monocular 3D Human Pose Estimation [J].

Zhong, Yuanhong ;

Yang, Guangxia ;

Zhong, Daidi ;

Yang, Xun ;

Wang, Shanshan .

IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 :6191-6201

[50] Efficient Monocular Human Pose Estimation Based on Deep Learning Methods: A Survey [J].

Yan, Xuke ;

Liu, Bo ;

Qu, Guangzhi .

IEEE ACCESS, 2024, 12 :72650-72661

← 1 2 3 4 5 →