Performance evaluation of automatic speech recognition systems on integrated noise-network distorted speech

被引:6
作者
Kumalija, Elhard [1 ]
Nakamoto, Yukikazu [1 ]
机构
[1] Univ Hyogo, Grad Sch Appl Informat, Kobe, Hyogo, Japan
来源
FRONTIERS IN SIGNAL PROCESSING | 2022年 / 2卷
关键词
audio signal processing; automatic speech recognition; deep learning; speech-to-text; voice over IP;
D O I
10.3389/frsip.2022.999457
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
In VoIP applications, such as Interactive Voice Response and VoIP-phone conversation transcription, speech signals are degraded not only by environmental noise but also by transmission network quality, and distortions induced by encoding and decoding algorithms. Therefore, there is a need for automatic speech recognition (ASR) systems to handle integrated noise-network distorted speech. In this study, we present a comparative analysis of a speech-to-text system trained on clean speech against one trained on integrated noise-network distorted speech. Training an ASR model on noise-network distorted speech dataset improves its robustness. Although the performance of an ASR model trained on clean speech depends on noise type, this is not the case when noise is further distorted by network transmission. The model trained on noise-network distorted speech exhibited a 60% improvement rate in the word error rate (WER), word match rate (MER), and word information lost (WIL) over the model trained on clean speech. Furthermore, the ASR model trained with noise-network distorted speech could tolerate a jitter of less than 20% and a packet loss of less than 15%, without a decrease in performance. However, WER, MER, and WIL increased in proportion to the jitter and packet loss as they exceeded 20% and 15%, respectively. Additionally, the model trained on noise-network distorted speech exhibited higher robustness compared to that trained on clean speech. The ASR model trained on noise-network distorted speech can also tolerate signal-to-noise (SNR) values of 5 dB and above, without the loss of performance, independent of noise type.
引用
收藏
页数:10
相关论文
共 33 条
[1]  
Ardila R, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P4218
[2]   The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines [J].
Barker, Jon ;
Watanabe, Shinji ;
Vincent, Emmanuel ;
Trmal, Jan .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1561-1565
[3]  
Barker J, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P504, DOI 10.1109/ASRU.2015.7404837
[4]   The PASCAL CHiME speech separation and recognition challenge [J].
Barker, Jon ;
Vincent, Emmanuel ;
Ma, Ning ;
Christensen, Heidi ;
Green, Phil .
COMPUTER SPEECH AND LANGUAGE, 2013, 27 (03) :621-633
[5]  
BROWN KL, 1995, INT CONF ACOUST SPEE, P105, DOI 10.1109/ICASSP.1995.479284
[6]  
Charniak E., 2000, Bllip 1987-89 wsj corpus release 1, P36
[7]   Quality assessment of interactive voice applications [J].
da Silva, Ana Paula Couto ;
Varela, Martin ;
de Souza e Silva, Edmundo ;
Leao, Rosa M. M. ;
Rubino, Gerardo .
COMPUTER NETWORKS, 2008, 52 (06) :1179-1192
[8]  
Furui S., 2000, ASR2000 AUTOMATIC SP
[9]  
Garofolo J.S., 1993, TIMIT acoustic phonetic continuous speech corpus
[10]  
Hannun A, 2014, Arxiv, DOI arXiv:1412.5567