Spatial-Temporal Graph Transformer for Skeleton-Based Sign Language Recognition

被引:0
作者
Xiao, Zhengye [1 ]
Lin, Shiquan [1 ]
Wan, Xiuan [1 ]
Fang, Yuchun [1 ]
Ni, Lan [2 ]
机构
[1] Shanghai Univ, Sch Comp Engn & Sci, Shanghai, Peoples R China
[2] Shanghai Univ, Coll Liberal Arts, Shanghai, Peoples R China
来源
NEURAL INFORMATION PROCESSING, ICONIP 2022, PT VI | 2023年 / 1793卷
基金
上海市自然科学基金; 中国国家自然科学基金;
关键词
Continuous sign language recognition; Transformer; Graph neural network;
D O I
10.1007/978-981-99-1645-0_12
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For continuous sign language recognition (CSLR), the skeleton sequence is insusceptible to environmental variances and achieves much attention. Previous studies mainly employ hand-craft features or the spatial-temporal graph convolution networks for skeleton modality and neglect the importance of capturing the information between distant nodes and the long-term context in CSLR. To learn more robust spatial-temporal features for CSLR, we propose a Spatial-Temporal Graph Transformer (STGT) model for skeleton-based CSLR. With the self-attention mechanism, the human skeleton graph is treated as a fully connected graph, and the relationship between distant nodes can be established directly in the spatial dimension. In the temporal dimension, the long-term context can be learned easily due to the characteristic of the transformer. Moreover, we propose graph positional embedding and graph multi-head self-attention to help the STGT distinguish the meanings of different nodes. We conduct the ablation study on the action recognition dataset to validate the effectiveness and analyze the advantages of our method. The experimental results on two CSLR datasets demonstrate the superiority of the STGT on skeleton-based CSLR.
引用
收藏
页码:137 / 149
页数:13
相关论文
共 24 条
[1]  
Baevski Alexei, 2018, INT C LEARN REPR
[2]   OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields [J].
Cao, Zhe ;
Hidalgo, Gines ;
Simon, Tomas ;
Wei, Shih-En ;
Sheikh, Yaser .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) :172-186
[3]   Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].
Carreira, Joao ;
Zisserman, Andrew .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733
[4]   Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition [J].
Cheng, Ke ;
Zhang, Yifan ;
Cao, Congqi ;
Shi, Lei ;
Cheng, Jian ;
Lu, Hanqing .
COMPUTER VISION - ECCV 2020, PT XXIV, 2020, 12369 :536-553
[5]   Spatial-Temporal Graph Convolutional Networks for Sign Language Recognition [J].
de Amorim, Cleison Correia ;
Macedo, David ;
Zanchettin, Cleber .
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2019: WORKSHOP AND SPECIAL SESSIONS, 2019, 11731 :646-657
[6]  
Graves Alex, 2006, INT C MACH LEARN, P369, DOI 10.1145/1143844.1143891
[7]  
Guo D, 2018, AAAI CONF ARTIF INTE, P6845
[8]  
Huang J, 2018, AAAI CONF ARTIF INTE, P2257
[9]   Boundary-Adaptive Encoder With Attention Method for Chinese Sign Language Recognition [J].
Huang, Shiliang ;
Ye, Zhongfu .
IEEE ACCESS, 2021, 9 :70948-70960
[10]   Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers [J].
Koller, Oscar ;
Forster, Jens ;
Ney, Hermann .
COMPUTER VISION AND IMAGE UNDERSTANDING, 2015, 141 :108-125