STREAMING, FAST AND ACCURATE ON-DEVICE INVERSE TEXT NORMALIZATION FOR AUTOMATIC SPEECH RECOGNITION

被引:3
作者
Gaur, Yashesh [1 ]
Kibre, Nick [1 ]
Xue, Jian [1 ]
Shu, Kangyuan [1 ]
Wang, Yuhui [1 ]
Alphanso, Issac [1 ]
Li, Jinyu [1 ]
Gong, Yifan [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
来源
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT | 2022年
关键词
Inverse Text Normalization; Automatic Speech Recognition; on-device; streaming;
D O I
10.1109/SLT54892.2023.10022543
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic Speech Recognition (ASR) systems typically yield output in lexical form. However, humans prefer a written form output. To bridge this gap, ASR systems usually employ Inverse Text Normalization (ITN). In previous works, Weighted Finite State Transducers (WFST) have been employed to do ITN. WFSTs are nicely suited to this task but their size and run-time costs can make deployment on embedded applications challenging. In this paper, we describe the development of an on-device ITN system that is streaming, lightweight & accurate. At the core of our system is a streaming transformer tagger, that tags lexical tokens from ASR. The tag informs which ITN category might be applied, if at all. Following that, we apply an ITN-category-specific WFST, only on the tagged text, to reliably perform the ITN conversion. We show that the proposed ITN solution performs equivalent to strong baselines, while being significantly smaller in size and retaining customization capabilities.
引用
收藏
页码:237 / 244
页数:8
相关论文
共 26 条
[1]  
Alphonso Issac, 2018, RANKING APPROACH COM, V12, P664
[2]  
Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937
[3]   DEVELOPING REAL-TIME STREAMING TRANSFORMER TRANSDUCER FOR SPEECH RECOGNITION ON LARGE-SCALE DATASET [J].
Chen, Xie ;
Wu, Yu ;
Wang, Zhenghao ;
Liu, Shujie ;
Li, Jinyu .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5904-5908
[4]  
Fan Angela, 2020, arXiv
[5]  
Graves A., 2012, arXiv
[6]  
Han S., 2016, PROC INT C LEARN REP
[7]  
Ju YC, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2179
[8]  
Kudo T, 2018, CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, P66
[9]   On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition [J].
Li, Jinyu ;
Wu, Yu ;
Gaur, Yashesh ;
Wang, Chengyi ;
Zhao, Rui ;
Liu, Shujie .
INTERSPEECH 2020, 2020, :1-5
[10]  
Li Jinyu, 2014, INTERSPEECH, P2, DOI 10.21437/INTERSPEECH.2014-432