Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring

被引:7
作者
Li, Qiujia [1 ]
Zhang, Chao [1 ,2 ]
Woodland, Philip C. [1 ]
机构
[1] Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England
[2] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China
关键词
Speech recognition; System combination; Hybrid DNN-HMM systems; Attention-based encoder-decoder models; Lattice rescore; SPEECH; NETWORKS;
D O I
10.1016/j.specom.2022.12.002
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The traditional hybrid deep neural network (DNN)-hidden Markov model (HMM) system and attention-based encoder-decoder (AED) model are both commonly used automatic speech recognition (ASR) approaches with distinct characteristics and advantages. While hybrid systems are per-frame-based and highly modularised to leverage external phonetic and linguistic knowledge, AED models operate on a per-label basis and jointly learn the acoustic and language information using a single model in an end-to-end trainable fashion. In this paper, we propose combining these two approaches in a two-pass rescoring framework. The first-pass uses hybrid ASR systems to facilitate streaming and controllable ASR, and the second-pass re-scores the N-best hypotheses or lattices produced by the first-pass hybrid DNN-HMM system with AED models. We also propose an improved algorithm for lattice rescoring with AED models. Experiments show the combined two-pass systems achieve competitive performance without using extra speech or text data on two standard ASR tasks. For the 80-hour AMI IHM dataset, the combined system has a 13.7% word error rate (WER) on the evaluation set and is up to a 29% relative WER reduction over the individual systems. For the 300-hour Switchboard dataset, the WERs of the combined system are 5.7% and 12.1% on Switchboard and CallHome subsets of Hub5'00, and 13.2% and 7.6% on Switchboard Cellular and Fisher subsets of RT03, and are up to a 33% relative reduction in WER over the individual systems.
引用
收藏
页码:12 / 21
页数:10
相关论文
共 57 条
[41]   A Comparison of Sequence-to-Sequence Models for Speech Recognition [J].
Prabhavalkar, Rohit ;
Rao, Kanishka ;
Sainath, Tara N. ;
Li, Bo ;
Johnson, Leif ;
Jaitly, Navdeep .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :939-943
[42]   Lower Frame Rate Neural Network Acoustic Models [J].
Pundak, Golan ;
Sainath, Tara N. .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :22-26
[43]   Two-Pass End-to-End Speech Recognition [J].
Sainath, Tara N. ;
Pang, Ruoming ;
Rybach, David ;
He, Yanzhang ;
Prabhavalkar, Rohit ;
Li, Wei ;
Visontai, Mirko ;
Liang, Qiao ;
Strohman, Trevor ;
Wu, Yonghui ;
McGraw, Ian ;
Chiu, Chung-Cheng .
INTERSPEECH 2019, 2019, :2773-2777
[44]  
Sainath TN, 2020, INT CONF ACOUST SPEE, P6059, DOI [10.1109/icassp40776.2020.9054188, 10.1109/ICASSP40776.2020.9054188]
[45]   ADVANCING RNN TRANSDUCER TECHNOLOGY FOR SPEECH RECOGNITION [J].
Saon, George ;
Tueske, Zoltan ;
Bolanos, Daniel ;
Kingsbury, Brian .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5654-5658
[46]  
Su H, 2013, INT CONF ACOUST SPEE, P6664, DOI 10.1109/ICASSP.2013.6638951
[47]   TRANSFORMER LANGUAGE MODELS WITH LSTM-BASED CROSS-UTTERANCE INFORMATION REPRESENTATION [J].
Sun, G. ;
Zhang, C. ;
Woodland, P. C. .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :7363-7367
[48]   On the limit of English conversational speech recognition [J].
Tuske, Zoltan ;
Saon, George ;
Kingsbury, Brian .
INTERSPEECH 2021, 2021, :2062-2066
[49]   Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard [J].
Tuske, Zoltan ;
Saon, George ;
Audhkhasi, Kartik ;
Kingsbury, Brian .
INTERSPEECH 2020, 2020, :551-555
[50]  
Vaswani A, 2017, ADV NEUR IN, V30