GTIGNet: Global Topology Interaction Graphormer Network for 3D hand pose estimation

被引:0
作者
Liu, Yanjun [1 ]
Fan, Wanshu [1 ]
Wang, Cong [2 ]
Wen, Shixi [3 ]
Yang, Xin [4 ]
Zhang, Qiang [1 ,4 ]
Wei, Xiaopeng [4 ]
Zhou, Dongsheng [1 ,4 ]
机构
[1] Dalian Univ, Sch Software Engn, Natl & Local Joint Engn Lab Comp Aided Design, Dalian, Peoples R China
[2] Ctr Adv Reliabil & Safety CAiRS, Hong Kong, Peoples R China
[3] Dalian Univ, Sch Informat Engn, Dalian, Peoples R China
[4] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian, Peoples R China
基金
中国国家自然科学基金;
关键词
3D hand pose estimation; Transformer; GCN; Topology; 3D computer vision; SIGN-LANGUAGE RECOGNITION;
D O I
10.1016/j.neunet.2025.107221
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Estimating 3D hand poses from monocular RGB images presents a series of challenges, including complex hand structures, self-occlusions, and depth ambiguities. Existing methods often fall short of capturing the longdistance dependencies of skeletal and non-skeletal connections for hand joints. To address these limitations, we introduce the Global Topology Interaction Graphormer Network (GTIGNet), a novel deep learning architecture designed to improve 3D hand pose estimation. Our model incorporates a Context-Aware Attention Block (CAAB) within the 2D pose estimator to enhance the extraction of multi-scale features, yielding more accurate 2D joint heatmaps to support the task that followed. Additionally, we introduce a High-Order Graphormer that explicitly and implicitly models the topological structure of hand joints, thereby enhancing feature interaction. Ablation studies confirm the effectiveness of our approach, and experimental results on four challenging datasets, Rendered Hand Dataset (RHD), Stereo Hand Pose Benchmark (STB), First-Person Hand Action Benchmark (FPHA), and FreiHAND Dataset, indicate that GTIGNet achieves state-of-the-art performance in 3D hand pose estimation. Notably, our model achieves an impressive Mean Per Joint Position Error (MPJPE) of 9.98 mm on RHD, 6.12 mm on STB, 11.15 mm on FPHA and 10.97 mm on FreiHAND.
引用
收藏
页数:14
相关论文
共 90 条
  • [1] Abd El-Hafeez T., A new system for extracting and detecting skin color regions from pdf documents, IJCSE, 9, 2, pp. 2838-2846, (2010)
  • [2] Abdi H., Williams L.J., Principal component analysis, Wiley Interdisciplinary Reviews: Computational Statistics, 2, 4, pp. 433-459, (2010)
  • [3] Adaloglou N., Chatzis T., Papastratis I., Stergioulas A., Papadopoulos G.T., Zacharopoulou V., Et al., A comprehensive study on deep learning-based methods for sign language recognition, TMM, 24, pp. 1750-1762, (2022)
  • [4] Ali A.A., El-Hafeez T.A., Mohany Y.K., An accurate system for face detection and recognition, Journal of Advances in Mathematics and Computer Science, 33, 3, pp. 1-19, (2019)
  • [5] Ali A.A., El-Hafeez T., Mohany Y., A robust and efficient system to detect human faces based on facial features, Asian Journal of Research in Computer Science, 2, 4, pp. 1-12, (2019)
  • [6] Azad R., Asadi-Aghbolaghi M., Kasaei S., Escalera S., Dynamic 3D hand gesture recognition by learning weighted depth motion maps, TCSVT, 29, 6, pp. 1729-1740, (2019)
  • [7] 11134, (2018)
  • [8] Boukhayma A., de Bem R.A., Torr P.H.S., 3D hand shape and pose from images in the wild, CVPR, pp. 10843-10852, (2019)
  • [9] Cai Y., Ge L., Cai J., Yuan J., Weakly-supervised 3D hand pose estimation from monocular RGB images, ECCV, 11210, pp. 678-694, (2018)
  • [10] Chen X., Liu Y., Dong Y., Zhang X., Ma C., Xiong Y., Et al., MobRecon: Mobile-friendly hand mesh reconstruction from monocular image, CVPR, pp. 20512-20522, (2022)