Robust Automatic Speech Recognition for Call Center Applications

被引：1

作者：

Felipe Parra-Gallego, Luis ^{[1
,2
]}

Arias-Vergara, Tomas ^{[1
,3
]}

Orozco Arroyave, Juan Rafael ^{[1
,3
]}

机构：

[1] Univ Antioquia UdeA, GITA Lab Fac Engn, Medellin, Colombia

[2] Konecta Grp SAS, Medellin, Colombia

[3] Friedrich Alexander Univ Erlangen Nurnberg, Pattern Recognit Lab, Erlangen, Germany

来源：

APPLIED COMPUTER SCIENCES IN ENGINEERING, WEA 2021 | 2021年 / 1431卷

关键词：

ASR; Noise reduction; Speech enhancement; Speech-to-text;

D O I：

10.1007/978-3-030-86702-7_7

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

This paper is focused on developing an Automatic Speech Recognition (ASR) system robust against different noisy scenarios. ASR systems are widely used in call centers to convert telephone recordings into text transcriptions which are further used as input to automatically evaluate the Quality of the Service (QoS). Since the evaluation of the QoS and the customer satisfaction is performed by analyzing the text resulting from the ASR system, this process highly depends on the accuracy of the transcription. Given that the calls are usually recorded in non-controlled acoustic conditions, the accuracy of the ASR is typically decreased. To address this problem, we first evaluated four different hybrid architectures: (1) Gaussian Mixture Models (GMM) (baseline), (2) Time Delay Neural Network (TDNN), (3) Long Short-Term Memory (LSTM), and (4) Gated Recurrent Unit (GRU). The evaluation is performed considering a total of 478,6 h of recordings collected in a real call-center. Each recording has its respective transcription and three perceptual labels about the level of noise present during the phone-call: Low level of noise (LN), Medium Level of noise (ML), and High Level of noise (HN). The LSTM-based model achieved the best performance in the MN and HN scenarios with 22, 55% and 27, 99% of word error rate (WER), respectively. Additionally, we implemented a denoiser based on GRUs to enhance the speech signals and the results improved in 1,16% in the HN scenario.

引用

页码：72 / 83

页数：12

共 18 条

[1]

Agrawal P., 2020, ARXIV PREPRINT ARXIV

[2]

[Anonymous], 2017, New Era for Robust Speech Recognition, Exploiting Deep Learning

[3]

[Anonymous], TIMIT ACOUSTIC PHONE

[4] The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines [J].

Barker, Jon ;

Watanabe, Shinji ;

Vincent, Emmanuel ;

Trmal, Jan .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1561-1565

[5]

Panayotov V, 2015, INT CONF ACOUST SPEE, P5206, DOI 10.1109/ICASSP.2015.7178964

[6]

Park Y, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2070

[7] Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks [J].

Pascual, Santiago ;

Ravanelli, Mirco ;

Serra, Joan ;

Bonafonte, Antonio ;

Bengio, Yoshua .

INTERSPEECH 2019, 2019, :161-165

[8]

PAUL DB, 1992, SPEECH AND NATURAL LANGUAGE, P357

[9] Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks [J].

Povey, Daniel ;

Cheng, Gaofeng ;

Wang, Yiming ;

Li, Ke ;

Xu, Hainan ;

Yarmohamadi, Mahsa ;

Khudanpur, Sanjeev .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3743-3747

[10]

Povey Daniel, 2011, WORKSH AUT SPEECH RE

← 1 2 →