Self-Supervised Representation Learning With Spatial-Temporal Consistency for Sign Language Recognition

被引:0
作者
Zhao, Weichao [1 ]
Zhou, Wengang [1 ]
Hu, Hezhen [2 ]
Wang, Min [3 ]
Li, Houqiang [1 ]
机构
[1] Univ Sci & Technol China, MoE Key Lab Brain Inspired Intelligent Percept & C, Hefei 230027, Peoples R China
[2] Univ Texas Austin, Visual Informat Grp, Austin, TX 78705 USA
[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230027, Peoples R China
基金
中国国家自然科学基金;
关键词
Sign language; Task analysis; Semantics; Representation learning; Knowledge transfer; Feature extraction; Skeleton; Sign language recognition; skeleton-based; self-supervised learning; contrastive learning;
D O I
10.1109/TIP.2024.3416881
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, there have been efforts to improve the performance in sign language recognition by designing self-supervised learning methods. However, these methods capture limited information from sign pose data in a frame-wise learning manner, leading to sub-optimal solutions. To this end, we propose a simple yet effective self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency from two distinct perspectives and learn instance discriminative representation for sign language recognition. On one hand, since the semantics of sign language are expressed by the cooperation of fine-grained hands and coarse-grained trunks, we utilize both granularity information and encode them into latent spaces. The consistency between hand and trunk features is constrained to encourage learning consistent representation of instance samples. On the other hand, inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Additionally, we further bridge the interaction between the embedding spaces of both modalities, facilitating bidirectional knowledge transfer to enhance sign language representation. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin. The source code is publicly available at https://github.com/sakura/Code.
引用
收藏
页码:4188 / 4201
页数:14
相关论文
共 76 条
  • [1] Albanie Samuel, 2020, EUR C COMP VIS
  • [2] Towards Zero-Shot Sign Language Recognition
    Bilge, Yunus Can
    Cinbis, Ramazan Gokberk
    Ikizler-Cinbis, Nazli
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) : 1217 - 1232
  • [3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
    Carreira, Joao
    Zisserman, Andrew
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4724 - 4733
  • [4] Chen PH, 2021, AAAI CONF ARTIF INTE, V35, P1045
  • [5] Chen T., 2020, P 37 INT C MACHINE, P1597
  • [6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [7] Ding S., 2023, P IEEECVF INT C COMP, P16945
  • [8] Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition
    Du, Yong
    Fu, Yun
    Wang, Liang
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (07) : 3010 - 3022
  • [9] A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
    Feichtenhofer, Christoph
    Fan, Haoqi
    Xiong, Bo
    Girshick, Ross
    He, Kaiming
    [J]. 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3298 - 3308
  • [10] Gidaris S., 2018, ARXIV