Spelling-Aware Word-Based End-to-End ASR

被引:1
作者
Egorova, Ekaterina [1 ]
Vydana, Hari Krishna [1 ]
Burget, Lukas [1 ]
Cernocky, Jan Honza [1 ]
机构
[1] Brno Univ Technol, Fac Informat Technol Speech FIT, CS-61090 Brno, Czech Republic
基金
欧盟地平线“2020”;
关键词
Training; Vocabulary; Task analysis; Decoding; Predictive models; Training data; Recurrent neural networks; ASR; end-to-end; listen attend and spell architecture; OOV;
D O I
10.1109/LSP.2022.3192199
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
We propose a new end-to-end architecture for automatic speech recognition that expands the "listen, attend and spell" (LAS) paradigm. While the main word-predicting network is trained to predict words, the secondary, speller network, is optimized to predict word spellings from inner representations of the main network (e.g. word embeddings or context vectors from the attention module). We show that this joint training improves the word error rate of a word-based system and enables solving additional tasks, such as out-of-vocabulary word detection and recovery. The tests are conducted on LibriSpeech dataset consisting of 1000 h of read speech.
引用
收藏
页码:1729 / 1733
页数:5
相关论文
共 22 条
[1]   Direct Acoustics-to-Word Models for English Conversational Speech Recognition [J].
Audhkhasi, Kartik ;
Ramabhadran, Bhuvana ;
Saon, George ;
Picheny, Michael ;
Nahamoo, David .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :959-963
[2]  
Bengio S, 2015, ADV NEUR IN, V28
[3]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[4]  
Chan William, 2021, ABS210402133 CORR
[5]  
Chorowski J, 2015, ADV NEUR IN, V28
[6]   W2V-BERT: COMBINING CONTRASTIVE LEARNING AND MASKED LANGUAGE MODELING FOR SELF-SUPERVISED SPEECH PRE-TRAINING [J].
Chung, Yu-An ;
Zhang, Yu ;
Han, Wei ;
Chiu, Chung-Cheng ;
Qin, James ;
Pang, Ruoming ;
Wu, Yonghui .
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, :244-250
[7]  
Goyal P, 2018, Arxiv, DOI arXiv:1706.02677
[8]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040
[9]  
Inan H., 2017, ICLR, P1
[10]  
Kingma D. P., 2015, INT C LEARN REPR ICL, P1