Self-Supervised Representation Learning With Spatial-Temporal Consistency for Sign Language Recognition

被引：1

作者：

Zhao, Weichao ^{[1
]}

Zhou, Wengang ^{[1
]}

Hu, Hezhen ^{[2
]}

Wang, Min ^{[3
]}

Li, Houqiang ^{[1
]}

机构：

[1] Univ Sci & Technol China, MoE Key Lab Brain Inspired Intelligent Percept & C, Hefei 230027, Peoples R China

[2] Univ Texas Austin, Visual Informat Grp, Austin, TX 78705 USA

[3] Hefei Comprehens Natl Sci Ctr, Inst Artificial Intelligence, Hefei 230027, Peoples R China

来源：

IEEE TRANSACTIONS ON IMAGE PROCESSING | 2024年 / 33卷

基金：

中国国家自然科学基金;

关键词：

Sign language; Task analysis; Semantics; Representation learning; Knowledge transfer; Feature extraction; Skeleton; Sign language recognition; skeleton-based; self-supervised learning; contrastive learning;

D O I：

10.1109/TIP.2024.3416881

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, there have been efforts to improve the performance in sign language recognition by designing self-supervised learning methods. However, these methods capture limited information from sign pose data in a frame-wise learning manner, leading to sub-optimal solutions. To this end, we propose a simple yet effective self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency from two distinct perspectives and learn instance discriminative representation for sign language recognition. On one hand, since the semantics of sign language are expressed by the cooperation of fine-grained hands and coarse-grained trunks, we utilize both granularity information and encode them into latent spaces. The consistency between hand and trunk features is constrained to encourage learning consistent representation of instance samples. On the other hand, inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling. Additionally, we further bridge the interaction between the embedding spaces of both modalities, facilitating bidirectional knowledge transfer to enhance sign language representation. Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin. The source code is publicly available at https://github.com/sakura/Code.

引用

页码：4188 / 4201

页数：14

共 76 条

[1] BSL-1K: Scaling Up Co-articulated Sign Language Recognition Using Mouthing Cues [J].

Albanie, Samuel ;

Varol, Gul ;

Momeni, Liliane ;

Afouras, Triantafyllos ;

Chung, Joon Son ;

Fox, Neil ;

Zisserman, Andrew .

COMPUTER VISION - ECCV 2020, PT XI, 2020, 12356 :35-53

[2] Towards Zero-Shot Sign Language Recognition [J].

Bilge, Yunus Can ;

Cinbis, Ramazan Gokberk ;

Ikizler-Cinbis, Nazli .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) :1217-1232

[3] Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset [J].

Carreira, Joao ;

Zisserman, Andrew .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :4724-4733

[4]

Chen PH, 2021, AAAI CONF ARTIF INTE, V35, P1045

[5]

Chen Ting, 2019, INT C MACH LEARN, DOI DOI 10.22489/CINC.2017.065-469

[6]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[7]

Ding S., 2023, P IEEECVF INT C COMP, P16945

[8] Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition [J].

Du, Yong ;

Fu, Yun ;

Wang, Liang .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2016, 25 (07) :3010-3022

[9] A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning [J].

Feichtenhofer, Christoph ;

Fan, Haoqi ;

Xiong, Bo ;

Girshick, Ross ;

He, Kaiming .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3298-3308

[10]

Gidaris S., 2018, P ICLR

← 1 2 3 4 5 6 7 8 →