Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring

被引:6
作者
Li, Qiujia [1 ]
Zhang, Chao [1 ,2 ]
Woodland, Philip C. [1 ]
机构
[1] Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England
[2] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China
关键词
Speech recognition; System combination; Hybrid DNN-HMM systems; Attention-based encoder-decoder models; Lattice rescore; SPEECH; NETWORKS;
D O I
10.1016/j.specom.2022.12.002
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The traditional hybrid deep neural network (DNN)-hidden Markov model (HMM) system and attention-based encoder-decoder (AED) model are both commonly used automatic speech recognition (ASR) approaches with distinct characteristics and advantages. While hybrid systems are per-frame-based and highly modularised to leverage external phonetic and linguistic knowledge, AED models operate on a per-label basis and jointly learn the acoustic and language information using a single model in an end-to-end trainable fashion. In this paper, we propose combining these two approaches in a two-pass rescoring framework. The first-pass uses hybrid ASR systems to facilitate streaming and controllable ASR, and the second-pass re-scores the N-best hypotheses or lattices produced by the first-pass hybrid DNN-HMM system with AED models. We also propose an improved algorithm for lattice rescoring with AED models. Experiments show the combined two-pass systems achieve competitive performance without using extra speech or text data on two standard ASR tasks. For the 80-hour AMI IHM dataset, the combined system has a 13.7% word error rate (WER) on the evaluation set and is up to a 29% relative WER reduction over the individual systems. For the 300-hour Switchboard dataset, the WERs of the combined system are 5.7% and 12.1% on Switchboard and CallHome subsets of Hub5'00, and 13.2% and 7.6% on Switchboard Cellular and Fisher subsets of RT03, and are up to a 33% relative reduction in WER over the individual systems.
引用
收藏
页码:12 / 21
页数:10
相关论文
共 57 条
  • [1] [Anonymous], 1997, Statistical Methods for Speech Recognition
  • [2] Bahdanau D, 2015, 3 INT C LEARN REPR
  • [3] AN INEQUALITY WITH APPLICATIONS TO STATISTICAL ESTIMATION FOR PROBABILISTIC FUNCTIONS OF MARKOV PROCESSES AND TO A MODEL FOR ECOLOGY
    BAUM, LE
    EAGON, JA
    [J]. BULLETIN OF THE AMERICAN MATHEMATICAL SOCIETY, 1967, 73 (03) : 360 - &
  • [4] Bengio Y., 1993, ADV NEURAL INF PROCE
  • [5] Bourlard H.A., 1994, CONNECTIONIST SPEECH, V247
  • [6] Carletta J, 2005, LECT NOTES COMPUT SC, V3869, P28
  • [7] Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
  • [8] Exploiting Future Word Contexts in Neural Network Language Models for Speech Recognition
    Chen, Xie
    Liu, Xunying
    Wang, Yu
    Ragni, Anton
    Wong, Jeremy H. M.
    Gales, Mark J. F.
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (09) : 1444 - 1454
  • [9] An exploration of dropout with LSTMs
    Cheng, Gaofeng
    Peddinti, Vijayaditya
    Povey, Daniel
    Manohar, Vimal
    Khudanpur, Sanjeev
    Yan, Yonghong
    [J]. 18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1586 - 1590
  • [10] Chiu C.-C., 2018, INT C LEARN REPR