Cross-modal knowledge distillation for continuous sign language recognition

被引:0
作者
Gao, Liqing [1 ]
Shi, Peng [1 ]
Hu, Lianyu [1 ]
Feng, Jichao [1 ]
Zhu, Lei [2 ]
Wan, Liang [1 ]
Feng, Wei [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Sch Comp Sci & Technol, Tianjin 300350, Peoples R China
[2] Hong Kong Univ Sci & Technol Guangzhou, Guangzhou, Peoples R China
关键词
Sign language recognition; Knowledge distillation; Cross-modal; Attention mechanism;
D O I
10.1016/j.neunet.2024.106587
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Continuous Sign Language Recognition (CSLR) is a task which converts a sign language video into a sequence. The existing deep learning based sign language recognition methods usually rely on large-scale training data and rich supervised information. However, current sign language datasets are limited, and are only annotated at sentence-level rather than frame-level. Inadequate supervision of sign language poses a serious challenge for sign language recognition, which may result in insufficient training of language recognition models. To address above problems, we propose a cross-modal knowledge distillation method for continuous sign language recognition, which contains two teacher models and one student model. One of the teacher models is the Sign2Text dialogue teacher model, which takes a sign language video a dialogue sentence as input and outputs the sign language recognition result. The other teacher model the Text2Gloss translation teacher model, which targets to translate a text sentence into a gloss sequence. Both teacher models can provide information-rich soft labels to assist the training of the student model, which is a general sign language recognition model. We conduct extensive experiments on multiple commonly used sign language datasets, i.e., PHOENIX 2014T, CSL-Daily and QSL, the results show that the proposed cross-modal knowledge distillation method can effectively improve the sign language recognition accuracy by transferring multi-modal information from teacher models to the student model. Code is available https://github.com/glq-1992/cross-modal-knowledge-distillation_new.
引用
收藏
页数:13
相关论文
共 62 条
  • [1] Sign Pose-based Transformer for Word-level Sign Language Recognition
    Bohacek, Matyas
    Hruz, Marek
    [J]. 2022 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW 2022), 2022, : 182 - 191
  • [2] Camgoz Necati Cihan, 2020, Computer Vision - ECCV 2020 Workshops. Proceedings. Lecture Notes in Computer Science (LNCS 12538), P301, DOI 10.1007/978-3-030-66823-5_18
  • [3] Camgöz NC, 2020, PROC CVPR IEEE, P10020, DOI 10.1109/CVPR42600.2020.01004
  • [4] Neural Sign Language Translation
    Camgoz, Necati Cihan
    Hadfield, Simon
    Koller, Oscar
    Ney, Hermann
    Bowden, Richard
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7784 - 7793
  • [5] Super-resolution guided knowledge distillation for low-resolution image classification
    Chen, Hongyuan
    Pei, Yanting
    Zhao, Hongwei
    Huang, Yaping
    [J]. PATTERN RECOGNITION LETTERS, 2022, 155 : 62 - 68
  • [6] A Simple Multi-Modality Transfer Learning Baseline for Sign Language Translation
    Chen, Yutong
    Wei, Fangyun
    Sun, Xiao
    Wu, Zhirong
    Lin, Stephen
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5110 - 5120
  • [7] A Deep Neural Framework for Continuous Sign Language Recognition by Iterative Training
    Cui, Runpeng
    Liu, Hu
    Zhang, Changshui
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2019, 21 (07) : 1880 - 1891
  • [8] Recurrent Convolutional Neural Networks for Continuous Sign Language Recognition by Staged Optimization
    Cui, Runpeng
    Liu, Hu
    Zhang, Changshui
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1610 - 1618
  • [9] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [10] Full transformer network with masking future for word-level sign language recognition q
    Du, Yao
    Xie, Pan
    Wang, Mingye
    Hu, Xiaohui
    Zhao, Zheng
    Liu, Jiaqi
    [J]. NEUROCOMPUTING, 2022, 500 : 115 - 123