Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks

被引：58

作者：

Ogawa, Atsunori ^{[1
]}

Hori, Takaaki ^{[2
]}

机构：

[1] NTT Corp, NIT Commun Sci Labs, 2-4 Hikaridai, Seika, Kyoto, Japan

[2] Mitsubishi Elect Res Labs, 201 Broadway, Cambridge, MA 02139 USA

来源：

SPEECH COMMUNICATION | 2017年 / 89卷

关键词：

Automatic speech recognition; Error detection; Accuracy estimation; Conditional random fields; Deep bidirectional recurrent neural; networks; CONFIDENCE MEASURES; BAYES-RISK;

D O I：

10.1016/j.specom.2017.02.009

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Recurrent neural networks (RNNs) have recently been applied as the classifiers for sequential labeling problems. In this paper, deep bidirectional RNNs (DBRNNs) are applied to error detection in automatic speech recognition (ASR), which is a sequential labeling problem. We investigate three types of ASR error detection tasks, i.e. confidence estimation, out-of-vocabulary word detection and error type classification. We also estimate ASR accuracy, i.e. percent correct and word accuracy, from the error type classification results. Experimental results using English and Japanese lecture speech corpora show that the DBRNNs greatly outperform conditional random fields (CRFs) and the other NN structures, i.e. deep feedforward NNs (DNNs) and deep unidirectional RNNs (DURNNs). These performance improvements are because the DBRNNs can take the longer bidirectional context of input feature vectors into account and can model highly nonlinear relationships between the input feature vectors and output labels. In detailed analyses, the DBRNNs show a better generalization ability than the CRFs. These results are thanks to the ability of the DBRNNs to represent (embed) the words in a low-dimensional continuous value vector space. In addition, the superiority of the DBRNNs to the DNNs and DURNNs indicates that the average length of the context of the input feature vectors required for ASR error detection is only a few time steps, however, it will change (lengthen) depending on the situation. (C) 2017 Elsevier B.V. All rights reserved.

引用

页码：70 / 83

页数：14

共 62 条

[1]

[Anonymous], 2007, INTERSPEECH 2007

[2]

[Anonymous], 2005, INTERSPEECH

[3]

[Anonymous], P SSPR 2003

[4]

[Anonymous], NTT TECHNICAL REV

[5]

Brueckner Raymond, 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), P4823, DOI 10.1109/ICASSP.2014.6854518

[6] Combination of strongly and weakly constrained recognizers for reliable detection of OOVS [J].

Burget, Lukas ;

Schwarz, Petr ;

Matejka, Pavel ;

Hannemann, Mirko ;

Rastrow, Ariya ;

White, Christopher ;

Khudanpur, Sanjeev ;

Hermansky, Hynek ;

Cernocky, Jan .

2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, :4081-+

[7] DISCRIMINATIVE TRAINING OF HIERARCHICAL ACOUSTIC MODELS FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION [J].

Chang, Hung-An ;

Glass, James R. .

2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, :4481-4484

[8]

Evermann G, 2000, INT CONF ACOUST SPEE, P1655, DOI 10.1109/ICASSP.2000.862067

[9]

Fayolle J, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1942

[10] Learning precise timing with LSTM recurrent networks [J].

Gers, FA ;

Schraudolph, NN ;

Schmidhuber, J .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (01) :115-143

← 1 2 3 4 5 6 7 →