Application of virtual human sign language translation based on speech recognition

被引：3

作者：

Li, Xin ^{[1
]}

Yang, Shuying ^{[1
]}

Guo, Haiming ^{[1
]}

机构：

[1] Tianjin Univ Technol, Sch Comp Sci & Engn, Tianjin 300384, Peoples R China

来源：

SPEECH COMMUNICATION | 2023年 / 152卷

关键词：

Speech recognition; Sign language translation; SSM; FLASH; NEURAL-NETWORKS;

D O I：

10.1016/j.specom.2023.06.001

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

For the application problem of speech recognition to sign language translation, we conducted a study in two parts: improving speech recognition's effectiveness and promoting the application of sign language translation. The mainstream frequency-domain feature has achieved great success in speech recognition. However, it fails to capture the instantaneous gap in speech, and the time-domain feature makes up for this deficiency. In order to combine the advantages of frequency and time domain features, an acoustic architecture with a joint time domain encoder and frequency domain encoder is proposed. A new time-domain feature based on SSM (StateSpace-Model) is proposed in the time- domain encoder and encoded using the GRU model. A new model, ConFLASH, is proposed in the frequency domain encoder, which is a lightweight model combining CNN and FLASH (a variant of the Transformer model). It not only reduces the computational complexity of the Transformer model but also effectively integrates the global modeling advantages of the Transformer model and the local modeling advantages of CNN. The Transducer structure is used to decode speech after the encoders are joined. This acoustic model is named GRU-ConFLASH- Transducer. On the self-built dataset and open-source dataset speechocean, it achieves optimal WER (Word Error Rate) of 2.6% and 4.7%. In addition, to better realize the visual application of sign language translation, a 3D virtual human model is designed and developed.

引用

页数：12

共 57 条

[1] Convolutional Neural Networks for Speech Recognition [J].

Abdel-Hamid, Ossama ;

Mohamed, Abdel-Rahman ;

Jiang, Hui ;

Deng, Li ;

Penn, Gerald ;

Yu, Dong .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) :1533-1545

[2]

[Anonymous], 2017, DIGIT TECHNOL APPL, P63

[3]

[Anonymous], 2015, ACTA PHYS SIN, V64, P117

[4]

Balakrishnan J.M., 2022, American Journal of Art and Design, V7, P1, DOI DOI 10.11648/J.AJAD.20220701.11

[5]

Bie AL, 2020, Arxiv, DOI arXiv:1911.03604

[6]

Chiu CC, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4774, DOI 10.1109/ICASSP.2018.8462105

[7]

Cho KYHY, 2014, Arxiv, DOI arXiv:1406.1078

[8] Deep Gesture Video Generation With Learning on Regions of Interest [J].

Cui, Runpeng ;

Cao, Zhong ;

Pan, Weishen ;

Zhang, Changshui ;

Wang, Jianqiang .

IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (10) :2551-2563

[9]

Dieleman Sander, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P6964, DOI 10.1109/ICASSP.2014.6854950

[10]

Graves A, 2012, Arxiv, DOI arXiv:1211.3711

← 1 2 3 4 5 6 →