Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring

被引：6

作者：

Li, Qiujia ^{[1
]}

Zhang, Chao ^{[1
,2
]}

Woodland, Philip C. ^{[1
]}

机构：

[1] Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England

[2] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China

来源：

SPEECH COMMUNICATION | 2023年 / 147卷

关键词：

Speech recognition; System combination; Hybrid DNN-HMM systems; Attention-based encoder-decoder models; Lattice rescore; SPEECH; NETWORKS;

D O I：

10.1016/j.specom.2022.12.002

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The traditional hybrid deep neural network (DNN)-hidden Markov model (HMM) system and attention-based encoder-decoder (AED) model are both commonly used automatic speech recognition (ASR) approaches with distinct characteristics and advantages. While hybrid systems are per-frame-based and highly modularised to leverage external phonetic and linguistic knowledge, AED models operate on a per-label basis and jointly learn the acoustic and language information using a single model in an end-to-end trainable fashion. In this paper, we propose combining these two approaches in a two-pass rescoring framework. The first-pass uses hybrid ASR systems to facilitate streaming and controllable ASR, and the second-pass re-scores the N-best hypotheses or lattices produced by the first-pass hybrid DNN-HMM system with AED models. We also propose an improved algorithm for lattice rescoring with AED models. Experiments show the combined two-pass systems achieve competitive performance without using extra speech or text data on two standard ASR tasks. For the 80-hour AMI IHM dataset, the combined system has a 13.7% word error rate (WER) on the evaluation set and is up to a 29% relative WER reduction over the individual systems. For the 300-hour Switchboard dataset, the WERs of the combined system are 5.7% and 12.1% on Switchboard and CallHome subsets of Hub5'00, and 13.2% and 7.6% on Switchboard Cellular and Fisher subsets of RT03, and are up to a 33% relative reduction in WER over the individual systems.

引用

页码：12 / 21

页数：10

共 57 条

[1] [Anonymous], 1997, Statistical Methods for Speech Recognition
[2] Bahdanau D, 2015, 3 INT C LEARN REPR
[3] AN INEQUALITY WITH APPLICATIONS TO STATISTICAL ESTIMATION FOR PROBABILISTIC FUNCTIONS OF MARKOV PROCESSES AND TO A MODEL FOR ECOLOGY
BAUM, LE
EAGON, JA
[J]. BULLETIN OF THE AMERICAN MATHEMATICAL SOCIETY, 1967, 73 (03) : 360 - &
[4] Bengio Y., 1993, ADV NEURAL INF PROCE
[5] Bourlard H.A., 1994, CONNECTIONIST SPEECH, V247
[6] Carletta J, 2005, LECT NOTES COMPUT SC, V3869, P28
[7] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[8] Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition
Chen, Xie
Liu, Xunying
Wang, Yu
Ragni, Anton
Wong, Jeremy H. M.
Gales, Mark J. F.
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (09) : 1444 - 1454
[9] An exploration of dropout with LSTMs
Cheng, Gaofeng
Peddinti, Vijayaditya
Povey, Daniel
Manohar, Vimal
Khudanpur, Sanjeev
Yan, Yonghong
[J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1586 - 1590
[10] Chiu C.-C., 2018, INT C LEARN REPR

← 1 2 3 4 5 6 →