Hybrid end-to-end model for Kazakh speech recognition

被引:15
作者
Mamyrbayev O.Z. [1 ,3 ]
Oralbekova D.O. [1 ,2 ]
Alimhan K. [1 ,4 ]
Nuranbayeva B.M. [5 ]
机构
[1] Institute of Information and Computational Technologies CS MES RK, 28 Shevchenko Str., Almaty
[2] Satbayev University, Almaty
[3] Al-Farabi Kazakh National University, Almaty
[4] L.N. Gumilyov Eurasian National University, Satpayev Str., 2, Nur-Sultan
[5] Caspian University, Dostyk 85A, Almaty
关键词
Attention; Automatic speech recognition; Connectionist temporal classification; End-to-end; Low resource language;
D O I
10.1007/s10772-022-09983-8
中图分类号
学科分类号
摘要
Modern automatic speech recognition systems based on end-to-end (E2E) models show good results in terms of the accuracy of language recognition, which have large corpuses for several thousand hours of speech for system training. Such models require a very large amount of training data, which is problematic for low-resource languages like the Kazakh language. However, many studies have shown that the combination of connectionist temporal classification with other E2E models improves the performance of systems even with limited training data. In this regard, the speech corpus of the Kazakh language was assembled, and this corpus was expanded using the augmentation method. Our work presents the implementation of a joint model of CTC and the attention mechanism for recognition of Kazakh speech, which solves the problem of rapid decoding and training of the system. The results demonstrated that the proposed E2E model using language models improved the system performance and showed the best result on our dataset for the Kazakh language. As a result of the experiment, the system achieved competitive results in Kazakh speech recognition. © 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
引用
收藏
页码:261 / 270
页数:9
相关论文
共 29 条
[1]  
Alsayadi H., Abdelhamid A., Hegazy I., Fayed Z., Arabic speech recognition using end-to-end deep learning, IET Signal Processing, (2021)
[2]  
Amirgaliyev N., Kuanyshbay D., Baimuratov O., Development of automatic speech recognition for Kazakh language using transfer learning, Speech Recognition for Kazakh Language Project, (2020)
[3]  
Brown J., Smaragdis P., Hidden Markov and Gaussian mixture models for automatic call classification, The Journal of the Acoustical Society of America, 125, pp. EL221-EL224, (2009)
[4]  
Chan W., Jaitly N., Le Q., Vinyals O., Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, 2016B IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960-4964, (2016)
[5]  
Chan W., Jaitly N., Le Q., Vinyals O., Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960-4964, (2016)
[6]  
Chen J., Nishimura R., Kitaoka N., End-to-end recognition of streaming Japanese speech using CTC and local attention, APSIPA Transactions on Signal and Information Processing, (2020)
[7]  
Emiru E., Li Y., Fesseha A., Diallo M., Improving Amharic Speech Recognition System using connectionist temporal classification with attention model and phoneme-based byte-pair-encodings, Information, 12, (2021)
[8]  
Graves A., Fernandez S., Gomez F., Schmidhuber J., Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural 'networks, ICML 2006—Proceedings of the 23Rd International Conference on Machine Learning, 2006, pp. 369-376, (2006)
[9]  
Hinton G., Deng L., Yu D., Dahl G., Mohamed A.-R., Jaitly N., Senior A., Vanhoucke V., Nguyen P., Sainath T., Kingsbury B., Deep neural networks for acoustic modeling in speech recognition, IEEE Signal Processing Magazine, 29, 6, pp. 82-97, (2012)
[10]  
Hori T., Watanabe S., Zhang Y., Chan W., Advances in Joint CTC–attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM, INTERSPEECH 2017, (2017)