Combination of end-to-end and hybrid models for speech recognition

被引：11

作者：

Wong, Jeremy H. M. ^{[1
]}

Gaur, Yashesh ^{[1
]}

Zhao, Rui ^{[1
]}

Lu, Liang ^{[1
]}

Sun, Eric ^{[1
]}

Li, Jinyu ^{[1
]}

Gong, Yifan ^{[1
]}

机构：

[1] Microsoft Speech & Language Grp, Redmond, WA 98052 USA

来源：

INTERSPEECH 2020 | 2020年

关键词：

Combination; end-to-end; hybrid; minimum Bayes' risk; speech recognition;

D O I：

10.21437/Interspeech.2020-2141

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recent studies suggest that it may now be possible to construct end-to-end Neural Network (NN) models that perform on-par with, or even outperform, hybrid models in speech recognition. These models differ in their designs, and as such, may exhibit diverse and complementary error patterns. A combination between the predictions of these models may therefore yield significant gains. This paper studies the feasibility of performing hypothesis-level combination between hybrid and end-to-end NN models. The end-to-end NN models often exhibit a bias in their posteriors toward short hypotheses, and this may adversely affect Minimum Bayes' Risk (MBR) combination methods. MBR training and length normalisation can be used to reduce this bias. Models are trained on Microsoft's 75 thousand hours of anonymised data and evaluated on test sets with 1.8 million words. The results show that significant gains can be obtained by combining the hypotheses of hybrid and end-to-end NN models together.

引用

页码：1783 / 1787

页数：5

共 25 条

[11]

Hu X, 2019, IN C IND ENG ENG MAN, P1403, DOI [10.1109/ieem44572.2019.8978562, 10.1109/IEEM44572.2019.8978562]

[12] Overall risk criterion estimation of hidden Markov model parameters [J].

Kaiser, J ;

Horvat, B ;

Kacic, Z .

SPEECH COMMUNICATION, 2002, 38 (3-4) :383-398

[13]

Li JY, 2020, INT CONF ACOUST SPEE, P7699, DOI [10.1109/ICASSP40776.2020.9054387, 10.1109/icassp40776.2020.9054387]

[14]

Li JY, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P114, DOI [10.1109/ASRU46091.2019.9003906, 10.1109/asru46091.2019.9003906]

[15] Finding consensus in speech recognition: word error minimization and other applications of confusion networks [J].

Mangu, L ;

Brill, E ;

Stolcke, A .

COMPUTER SPEECH AND LANGUAGE, 2000, 14 (04) :373-400

[16]

Parthasarathi S. H. K., 2019, INT CONF ACOUST SPEE, P6670, DOI DOI 10.1109/icassp.2019.8683690

[17] Lower Frame Rate Neural Network Acoustic Models [J].

Pundak, Golan ;

Sainath, Tara N. .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :22-26

[18] Two-Pass End-to-End Speech Recognition [J].

Sainath, Tara N. ;

Pang, Ruoming ;

Rybach, David ;

He, Yanzhang ;

Prabhavalkar, Rohit ;

Li, Wei ;

Visontai, Mirko ;

Liang, Qiao ;

Strohman, Trevor ;

Wu, Yonghui ;

McGraw, Ian ;

Chiu, Chung-Cheng .

INTERSPEECH 2019, 2019, :2773-2777

[19]

Sainath TN, 2020, INT CONF ACOUST SPEE, P6059, DOI [10.1109/ICASSP40776.2020.9054188, 10.1109/icassp40776.2020.9054188]

[20]

Sak H, 2014, INTERSPEECH, P338

← 1 2 3 →