Combining hybrid DNN-HMM ASR systems with attention-based models using lattice rescoring

被引：7

作者：

Li, Qiujia ^{[1
]}

Zhang, Chao ^{[1
,2
]}

Woodland, Philip C. ^{[1
]}

机构：

[1] Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England

[2] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China

来源：

SPEECH COMMUNICATION | 2023年 / 147卷

关键词：

Speech recognition; System combination; Hybrid DNN-HMM systems; Attention-based encoder-decoder models; Lattice rescore; SPEECH; NETWORKS;

D O I：

10.1016/j.specom.2022.12.002

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

The traditional hybrid deep neural network (DNN)-hidden Markov model (HMM) system and attention-based encoder-decoder (AED) model are both commonly used automatic speech recognition (ASR) approaches with distinct characteristics and advantages. While hybrid systems are per-frame-based and highly modularised to leverage external phonetic and linguistic knowledge, AED models operate on a per-label basis and jointly learn the acoustic and language information using a single model in an end-to-end trainable fashion. In this paper, we propose combining these two approaches in a two-pass rescoring framework. The first-pass uses hybrid ASR systems to facilitate streaming and controllable ASR, and the second-pass re-scores the N-best hypotheses or lattices produced by the first-pass hybrid DNN-HMM system with AED models. We also propose an improved algorithm for lattice rescoring with AED models. Experiments show the combined two-pass systems achieve competitive performance without using extra speech or text data on two standard ASR tasks. For the 80-hour AMI IHM dataset, the combined system has a 13.7% word error rate (WER) on the evaluation set and is up to a 29% relative WER reduction over the individual systems. For the 300-hour Switchboard dataset, the WERs of the combined system are 5.7% and 12.1% on Switchboard and CallHome subsets of Hub5'00, and 13.2% and 7.6% on Switchboard Cellular and Fisher subsets of RT03, and are up to a 33% relative reduction in WER over the individual systems.

引用

页码：12 / 21

页数：10

共 57 条

[41] A Comparison of Sequence-to-Sequence Models for Speech Recognition [J].

Prabhavalkar, Rohit ;

Rao, Kanishka ;

Sainath, Tara N. ;

Li, Bo ;

Johnson, Leif ;

Jaitly, Navdeep .

18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :939-943

[42] Lower Frame Rate Neural Network Acoustic Models [J].

Pundak, Golan ;

Sainath, Tara N. .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :22-26

[43] Two-Pass End-to-End Speech Recognition [J].

Sainath, Tara N. ;

Pang, Ruoming ;

Rybach, David ;

He, Yanzhang ;

Prabhavalkar, Rohit ;

Li, Wei ;

Visontai, Mirko ;

Liang, Qiao ;

Strohman, Trevor ;

Wu, Yonghui ;

McGraw, Ian ;

Chiu, Chung-Cheng .

INTERSPEECH 2019, 2019, :2773-2777

[44]

Sainath TN, 2020, INT CONF ACOUST SPEE, P6059, DOI [10.1109/icassp40776.2020.9054188, 10.1109/ICASSP40776.2020.9054188]

[45] ADVANCING RNN TRANSDUCER TECHNOLOGY FOR SPEECH RECOGNITION [J].

Saon, George ;

Tueske, Zoltan ;

Bolanos, Daniel ;

Kingsbury, Brian .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5654-5658

[46]

Su H, 2013, INT CONF ACOUST SPEE, P6664, DOI 10.1109/ICASSP.2013.6638951

[47] TRANSFORMER LANGUAGE MODELS WITH LSTM-BASED CROSS-UTTERANCE INFORMATION REPRESENTATION [J].

Sun, G. ;

Zhang, C. ;

Woodland, P. C. .

2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :7363-7367

[48] On the limit of English conversational speech recognition [J].

Tuske, Zoltan ;

Saon, George ;

Kingsbury, Brian .

INTERSPEECH 2021, 2021, :2062-2066

[49] Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard [J].

Tuske, Zoltan ;

Saon, George ;

Audhkhasi, Kartik ;

Kingsbury, Brian .

INTERSPEECH 2020, 2020, :551-555

[50]

Vaswani A, 2017, ADV NEUR IN, V30

← 1 2 3 4 5 6 →