Application of virtual human sign language translation based on speech recognition

被引:2
作者
Li, Xin [1 ]
Yang, Shuying [1 ]
Guo, Haiming [1 ]
机构
[1] Tianjin Univ Technol, Sch Comp Sci & Engn, Tianjin 300384, Peoples R China
关键词
Speech recognition; Sign language translation; SSM; FLASH; NEURAL-NETWORKS;
D O I
10.1016/j.specom.2023.06.001
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
For the application problem of speech recognition to sign language translation, we conducted a study in two parts: improving speech recognition's effectiveness and promoting the application of sign language translation. The mainstream frequency-domain feature has achieved great success in speech recognition. However, it fails to capture the instantaneous gap in speech, and the time-domain feature makes up for this deficiency. In order to combine the advantages of frequency and time domain features, an acoustic architecture with a joint time domain encoder and frequency domain encoder is proposed. A new time-domain feature based on SSM (StateSpace-Model) is proposed in the time- domain encoder and encoded using the GRU model. A new model, ConFLASH, is proposed in the frequency domain encoder, which is a lightweight model combining CNN and FLASH (a variant of the Transformer model). It not only reduces the computational complexity of the Transformer model but also effectively integrates the global modeling advantages of the Transformer model and the local modeling advantages of CNN. The Transducer structure is used to decode speech after the encoders are joined. This acoustic model is named GRU-ConFLASH- Transducer. On the self-built dataset and open-source dataset speechocean, it achieves optimal WER (Word Error Rate) of 2.6% and 4.7%. In addition, to better realize the visual application of sign language translation, a 3D virtual human model is designed and developed.
引用
收藏
页数:12
相关论文
共 57 条
  • [21] Jaitly N, 2011, INT CONF ACOUST SPEE, P5884
  • [22] Kayahan D., 2019, 2019 IEEE INT S INNO, DOI 10.1109/INISTA.2019.8778347
  • [23] Selection of Key Frames for 3D Reconstruction in Real Time
    Koschel, Alan
    Mueller, Christoph
    Reiterer, Alexander
    [J]. ALGORITHMS, 2021, 14 (11)
  • [24] Kriman S, 2020, INT CONF ACOUST SPEE, P6124, DOI [10.1109/icassp40776.2020.9053889, 10.1109/ICASSP40776.2020.9053889]
  • [25] Smart Wearable Hand Device for Sign Language Interpretation System With Sensors Fusion
    Lee, Boon Giin
    Lee, Su Min
    [J]. IEEE SENSORS JOURNAL, 2018, 18 (03) : 1224 - 1232
  • [26] Likhomanenko T., 2021, arXiv
  • [27] Mang Q, 2020, INT CONF ACOUST SPEE, P7829, DOI [10.1109/ICASSP40776.2020.9053896, 10.1109/icassp40776.2020.9053896]
  • [28] Principal component analysis based on block-norm minimization
    Mi, Jian-Xun
    Zhu, Quanwei
    Lu, Jia
    [J]. APPLIED INTELLIGENCE, 2019, 49 (06) : 2169 - 2177
  • [29] Mohamed A, 2020, Arxiv, DOI arXiv:1904.11660
  • [30] Audio-visual speech recognition using deep learning
    Noda, Kuniaki
    Yamaguchi, Yuki
    Nakadai, Kazuhiro
    Okuno, Hiroshi G.
    Ogata, Tetsuya
    [J]. APPLIED INTELLIGENCE, 2015, 42 (04) : 722 - 737