Application of virtual human sign language translation based on speech recognition

被引：3

作者：

Li, Xin ^{[1
]}

Yang, Shuying ^{[1
]}

Guo, Haiming ^{[1
]}

机构：

[1] Tianjin Univ Technol, Sch Comp Sci & Engn, Tianjin 300384, Peoples R China

来源：

SPEECH COMMUNICATION | 2023年 / 152卷

关键词：

Speech recognition; Sign language translation; SSM; FLASH; NEURAL-NETWORKS;

D O I：

10.1016/j.specom.2023.06.001

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

For the application problem of speech recognition to sign language translation, we conducted a study in two parts: improving speech recognition's effectiveness and promoting the application of sign language translation. The mainstream frequency-domain feature has achieved great success in speech recognition. However, it fails to capture the instantaneous gap in speech, and the time-domain feature makes up for this deficiency. In order to combine the advantages of frequency and time domain features, an acoustic architecture with a joint time domain encoder and frequency domain encoder is proposed. A new time-domain feature based on SSM (StateSpace-Model) is proposed in the time- domain encoder and encoded using the GRU model. A new model, ConFLASH, is proposed in the frequency domain encoder, which is a lightweight model combining CNN and FLASH (a variant of the Transformer model). It not only reduces the computational complexity of the Transformer model but also effectively integrates the global modeling advantages of the Transformer model and the local modeling advantages of CNN. The Transducer structure is used to decode speech after the encoders are joined. This acoustic model is named GRU-ConFLASH- Transducer. On the self-built dataset and open-source dataset speechocean, it achieves optimal WER (Word Error Rate) of 2.6% and 4.7%. In addition, to better realize the visual application of sign language translation, a 3D virtual human model is designed and developed.

引用

页数：12

共 57 条

[21]

Jaitly N, 2011, INT CONF ACOUST SPEE, P5884

[22]

Kayahan D., 2019, 2019 IEEE INT S INN, P1, DOI DOI 10.1109/INISTA.2019.8778347

[23] Selection of Key Frames for 3D Reconstruction in Real Time [J].

Koschel, Alan ;

Mueller, Christoph ;

Reiterer, Alexander .

ALGORITHMS, 2021, 14 (11)

[24]

Kriman S, 2020, INT CONF ACOUST SPEE, P6124, DOI [10.1109/ICASSP40776.2020.9053889, 10.1109/icassp40776.2020.9053889]

[25] Smart Wearable Hand Device for Sign Language Interpretation System With Sensors Fusion [J].

Lee, Boon Giin ;

Lee, Su Min .

IEEE SENSORS JOURNAL, 2018, 18 (03) :1224-1232

[26]

Likhomanenko T., 2020, arXiv

[27]

Mang Q, 2020, INT CONF ACOUST SPEE, P7829, DOI [10.1109/icassp40776.2020.9053896, 10.1109/ICASSP40776.2020.9053896]

[28] Principal component analysis based on block-norm minimization [J].

Mi, Jian-Xun ;

Zhu, Quanwei ;

Lu, Jia .

APPLIED INTELLIGENCE, 2019, 49 (06) :2169-2177

[29]

Mohamed A, 2020, Arxiv, DOI arXiv:1904.11660

[30] Audio-visual speech recognition using deep learning [J].

Noda, Kuniaki ;

Yamaguchi, Yuki ;

Nakadai, Kazuhiro ;

Okuno, Hiroshi G. ;

Ogata, Tetsuya .

APPLIED INTELLIGENCE, 2015, 42 (04) :722-737

← 1 2 3 4 5 6 →