Improving Performance of End-to-End ASR on Numeric Sequences

被引：11

作者：

Peyser, Cal ^{[1
]}

Zhang, Hao ^{[1
]}

Sainath, Tara N. ^{[1
]}

Wu, Zelin ^{[1
]}

机构：

[1] Google Inc, Mountain View, CA 94043 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

D O I：

10.21437/Interspeech.2019-1345

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recognizing written domain numeric utterances (e.g., I need $1.25.) can be challenging for ASR systems, particularly when numeric sequences are not seen during training. This out-ofvocabulary (OOV) issue is addressed in conventional ASR systems by training part of the model on spoken domain utterances (e.g., I need one dollar and twenty five cents.), for which numeric sequences are composed of in-vocabulary numbers, and then using an FST verbalizer to denormalize the result. Unfortunately, conventional ASR models are not suitable for the low memory setting of on-device speech recognition. E2E models such as RNN-T are attractive for on-device ASR, as they fold the AM, PM and LM of a conventional model into one neural network. However, in the on-device setting the large memory footprint of an FST denormer makes spoken domain training more difficult. In this paper, we investigate techniques to improve E2E model performance on numeric data. We find that using a text-to-speech system to generate additional numeric training data, as well as using a small-footprint neural network to perform spoken-to-written domain denorming, yields improvement in several numeric classes. In the case of the longest numeric sequences, we see reduction of WER by up to a factor of 8

引用

页码：2185 / 2189

页数：5

共 50 条

[1] DOES SPEECH ENHANCEMENTWORK WITH END-TO-END ASR OBJECTIVES?: EXPERIMENTAL ANALYSIS OF MULTICHANNEL END-TO-END ASR
Ochiai, Tsubasa
Watanabe, Shinji
Katagiri, Shigeru
2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
[2] META-LEARNING FOR IMPROVING RARE WORD RECOGNITION IN END-TO-END ASR
Lux, Florian
Ngoc Thang Vu
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5974 - 5978
[3] Towards Lifelong Learning of End-to-end ASR
Chang, Heng-Jui
Lee, Hung-yi
Lee, Lin-shan
INTERSPEECH 2021, 2021, : 2551 - 2555
[4] Contextual Biasing for End-to-End Chinese ASR
Zhang, Kai
Zhang, Qiuxia
Wang, Chung-Che
Jang, Jyh-Shing Roger
IEEE ACCESS, 2024, 12 : 92960 - 92975
[5] UNSUPERVISED MODEL ADAPTATION FOR END-TO-END ASR
Sivaraman, Ganesh
Casal, Ricardo
Garland, Matt
Khoury, Elie
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6987 - 6991
[6] End-to-End Topic Classification without ASR
Dong, Zexian
Liu, Jia
Zhang, Wei-Qiang
2019 IEEE 19TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT 2019), 2019,
[7] Phonemic competition in end-to-end ASR models
ten Bosch, Louis
Bentum, Martijn
Boves, Lou
INTERSPEECH 2023, 2023, : 586 - 590
[8] INCORPORATING WRITTEN DOMAIN NUMERIC GRAMMARS INTO END-TO-END CONTEXTUAL SPEECH RECOGNITION SYSTEMS FOR IMPROVED RECOGNITION OF NUMERIC SEQUENCES
Haynor, Ben
Aleksic, Petar S.
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7809 - 7813
[9] IMPROVING PROPER NOUN RECOGNITION IN END-TO-END ASR BY CUSTOMIZATION OF THE MWER LOSS CRITERION
Peyser, Cal
Sainath, Tara N.
Pundak, Golan
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7789 - 7793
[10] Improving end-to-end performance by active queue management
Ku, CF
Chen, SJ
Ho, JM
Chang, RI
AINA 2005: 19th International Conference on Advanced Information Networking and Applications, Vol 2, 2005, : 337 - 340

← 1 2 3 4 5 →