Application of virtual human sign language translation based on speech recognition

被引:2
作者
Li, Xin [1 ]
Yang, Shuying [1 ]
Guo, Haiming [1 ]
机构
[1] Tianjin Univ Technol, Sch Comp Sci & Engn, Tianjin 300384, Peoples R China
关键词
Speech recognition; Sign language translation; SSM; FLASH; NEURAL-NETWORKS;
D O I
10.1016/j.specom.2023.06.001
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
For the application problem of speech recognition to sign language translation, we conducted a study in two parts: improving speech recognition's effectiveness and promoting the application of sign language translation. The mainstream frequency-domain feature has achieved great success in speech recognition. However, it fails to capture the instantaneous gap in speech, and the time-domain feature makes up for this deficiency. In order to combine the advantages of frequency and time domain features, an acoustic architecture with a joint time domain encoder and frequency domain encoder is proposed. A new time-domain feature based on SSM (StateSpace-Model) is proposed in the time- domain encoder and encoded using the GRU model. A new model, ConFLASH, is proposed in the frequency domain encoder, which is a lightweight model combining CNN and FLASH (a variant of the Transformer model). It not only reduces the computational complexity of the Transformer model but also effectively integrates the global modeling advantages of the Transformer model and the local modeling advantages of CNN. The Transducer structure is used to decode speech after the encoders are joined. This acoustic model is named GRU-ConFLASH- Transducer. On the self-built dataset and open-source dataset speechocean, it achieves optimal WER (Word Error Rate) of 2.6% and 4.7%. In addition, to better realize the visual application of sign language translation, a 3D virtual human model is designed and developed.
引用
收藏
页数:12
相关论文
共 57 条
  • [1] Convolutional Neural Networks for Speech Recognition
    Abdel-Hamid, Ossama
    Mohamed, Abdel-Rahman
    Jiang, Hui
    Deng, Li
    Penn, Gerald
    Yu, Dong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) : 1533 - 1545
  • [2] [Anonymous], 2017, DIGIT TECHNOL APPL, P63
  • [3] [Anonymous], 2015, ACTA PHYS SIN, V64, P117
  • [4] Balakrishnan J.M., 2022, AM J ART, V7, P1
  • [5] Bie AL, 2020, Arxiv, DOI arXiv:1911.03604
  • [6] Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105
  • [7] Cho KYHY, 2014, Arxiv, DOI [arXiv:1406.1078, DOI 10.48550/ARXIV.1406.1078]
  • [8] Deep Gesture Video Generation With Learning on Regions of Interest
    Cui, Runpeng
    Cao, Zhong
    Pan, Weishen
    Zhang, Changshui
    Wang, Jianqiang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (10) : 2551 - 2563
  • [9] Dieleman Sander, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P6964, DOI 10.1109/ICASSP.2014.6854950
  • [10] Graves A, 2012, Arxiv, DOI arXiv:1211.3711