Improved Deep Duel Model for Rescoring N-best Speech Recognition List Using Backward LSTMLM and Ensemble Encoders

被引：4

作者：

Ogawa, Atsunori ^{[1
]}

Delcroix, Marc ^{[1
]}

Karita, Shigeki ^{[1
]}

Nakatani, Tomohiro ^{[1
]}

机构：

[1] NTT Corp, NTT Commun Sci Labs, Kyoto, Japan

来源：

INTERSPEECH 2019 | 2019年

关键词：

speech recognition; N-best rescoring; deep duel model; backward LSTMLM; ensemble encoders;

D O I：

10.21437/Interspeech.2019-1949

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

We have proposed a neural network (NN) model called a deep duel model (DDM) for rescoring N-best speech recognition hypothesis lists. A DDM is composed of a long short-term memory (LSTM)-based encoder followed by a fully-connected linear layer-based binary-class classifier. Given the feature vector sequences of two hypotheses in an N-best list, the DDM encodes the features and selects the hypothesis that has the lower word error rate (WER) based on the output binary-class probabilities. By repeating this one-on-one hypothesis comparison (duel) for each hypothesis pair in the N-best list, we can find the oracle (lowest WER) hypothesis as the survivor of the duels. We showed that the DDM can exploit the score provided by a forward LSTM-based recurrent NN language model (LSTMLM) as an additional feature to accurately select the hypotheses. In this study, we further improve the selection performance by introducing two modifications, i.e. adding the score provided by a backward LSTMLM, which uses succeeding words to predict the current word, and employing ensemble encoders, which have a high feature encoding capability. By combining these two modifications, our DDM achieves an over 10% relative WER reduction from a strong baseline obtained using both the forward and backward LSTMLMs.

引用

页码：3900 / 3904

页数：5

共 42 条

[1] [Anonymous], 2014, P INTERSPEECH
[2] [Anonymous], 2018, P ICLR
[3] [Anonymous], 2018, P 5 INT WORKSH SPEEC
[4] Arisoy E, 2015, INT CONF ACOUST SPEE, P5421, DOI 10.1109/ICASSP.2015.7179007
[5] The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
Barker, Jon
Watanabe, Shinji
Vincent, Emmanuel
Trmal, Jan
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 1561 - 1565
[6] Caruana R., 1993, P 10 INT C INT C MAC, P48, DOI [DOI 10.1016/B978-1-55860-307-3.50012-5, 10.1016/b978-1-55860-307-3.50012-5]
[7] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[8] Investigating Bidirectional Recurrent Neural Network Language Models for Speech Recognition
Chen, X.
Ragni, A.
Liu, X.
Gales, M. J. F.
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 269 - 273
[9] Building Acoustic Model Ensembles by Data Sampling With Enhanced Trainings and Features
Chen, Xin
Zhao, Yunxin
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (03): : 498 - 507
[10] Chorowski J., 2014, P DEEP LEARN REPR IC

← 1 2 3 4 5 →