END-TO-END TRAINING OF A LARGE VOCABULARY END-TO-END SPEECH RECOGNITION SYSTEM

被引:0
作者
Kim, Chanwoo [1 ]
Kim, Sungsoo [1 ]
Kim, Kwangyoun [1 ]
Kumar, Mehul [1 ]
Kim, Jiyeon [1 ]
Lee, Kyungmin [1 ]
Han, Changwoo [1 ]
Garg, Abhinav [1 ]
Kim, Eunhyang [1 ]
Shin, Minkyoo [1 ]
Singh, Shatrughan [1 ]
Heck, Larry [1 ]
Gowda, Dhananjaya [1 ]
机构
[1] Samsung Res, Seoul, South Korea
来源
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019) | 2019年
关键词
end-to-end speech recognition; distributed training; example server; data augmentation; acoustic simulation; DEEP-NEURAL-NETWORKS; DATA AUGMENTATION;
D O I
10.1109/asru46091.2019.9003976
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we present an end-to-end training framework for building state-of-the-art end-to-end speech recognition systems. Our training system utilizes a cluster of Central Processing Units (CPUs) and Graphics Processing Units (GPUs). The entire data reading, large scale data augmentation, neural network parameter updates are all performed "on-the-fly". We use vocal tract length perturbation [1] and an acoustic simulator [2] for data augmentation. The processed features and labels are sent to the GPU cluster. The Horovod allreduce approach is employed to train neural network parameters. We evaluated the effectiveness of our system on the standard Librispeech corpus [3] and the 10,000-hr anonymized Bixby English dataset. Our end-to-end speech recognition system built using this training infrastructure showed a 2.44 % WER on test-clean of the LibriSpeech test set after applying shallow fusion with a Transformer language model (LM). For the proprietary English Bixby open domain test set, we obtained a WER of 7.92 % using a Bidirectional Full Attention (BFA) end-to-end model after applying shallow fusion with an RNN-LM. When the monotonic chunckwise attention (MoCha) based approach is employed for streaming speech recognition, we obtained a WER of 9.95 % on the same Bixby open domain test set.
引用
收藏
页码:562 / 569
页数:8
相关论文
共 59 条
[1]  
Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265
[2]  
[Anonymous], 2019 IEEE AUT SPEECH
[3]  
[Anonymous], THESIS
[4]  
[Anonymous], IBM SPECTR VERS 10 R
[5]  
[Anonymous], IEEE ACM T AUDIO SPE
[6]  
[Anonymous], INT C LEARN REPR APR
[7]  
[Anonymous], 2019 IEEE AUT SPEECH
[8]  
[Anonymous], 2013, INT C MACH LEARN ICM
[9]  
[Anonymous], 2017, ACCURATE LARGE MINIB
[10]  
[Anonymous], 2013, P INT C LEARN REPR