End-to-End Large Vocabulary Speech Recognition for the Serbian Language

被引:6
作者
Popovic, Branislav [1 ,2 ]
Pakoci, Edvin [1 ,2 ]
Pekar, Darko [1 ,2 ]
机构
[1] Univ Novi Sad, Dept Power Elect & Telecommun Engn, Fac Tech Sci, Trg Dositeja Obradovica 6, Novi Sad 21000, Serbia
[2] AlfaNum Speech Technol, Bulevar Vojvode Stepe 40, Novi Sad 21000, Serbia
来源
SPEECH AND COMPUTER, SPECOM 2017 | 2017年 / 10458卷
关键词
Eesen; End-to-end; LSTM; Speech recognition; Serbian;
D O I
10.1007/978-3-319-66429-3_33
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents the results of a large vocabulary speech recognition for the Serbian language, developed by using Eesen end-to-end framework. Eesen involves training a single deep recurrent neural network, containing a number of bidirectional long short-term memory layers, modeling the connection between the speech and a set of context-independent lexicon units. This approach reduces the amount of expert knowledge needed in order to develop other competitive speech recognition systems. The training is based on a connectionist temporal classification, while decoding allows the usage of weighted finite-state transducers. This provides much faster and more efficient decoding in comparison to other similar systems. A corpus of approximately 215 h of audio data (about 171 h of speech and 44 h of silence, or 243 male and 239 female speakers) was employed for the training (about 90%) and testing (about 10%) purposes. On a set of more than 120000 words, the word error rate of 14.68% and the character error rate of 3.68% is achieved.
引用
收藏
页码:343 / 352
页数:10
相关论文
共 12 条
[1]  
Allauzen C, 2007, LECT NOTES COMPUT SC, V4783, P11
[2]  
Graves A., 2006, INT C MACH LEARN
[3]  
KNESER R, 1995, INT CONF ACOUST SPEE, P181, DOI 10.1109/ICASSP.1995.479394
[4]  
Miao YJ, 2015, Arxiv, DOI arXiv:1507.08240
[5]   Weighted finite-state transducers in speech recognition [J].
Mohri, M ;
Pereira, F ;
Riley, M .
COMPUTER SPEECH AND LANGUAGE, 2002, 16 (01) :69-88
[6]   A Phonetic Segmentation Procedure Based on Hidden Markov Models [J].
Pakoci, Edvin ;
Popovic, Branislav ;
Jakovljevic, Niksa ;
Pekar, Darko ;
Yassa, Fathy .
SPEECH AND COMPUTER, 2016, 9811 :67-74
[7]  
Popovic B., 2014, 10 DIGITAL SPEECH IM, P31
[8]   Deep Neural Network Based Continuous Speech Recognition for Serbian Using the Kaldi Toolkit [J].
Popovic, Branislav ;
Ostrogonac, Stevan ;
Pakoci, Edvin ;
Jakovljevic, Niksa ;
Delic, Vlado .
SPEECH AND COMPUTER (SPECOM 2015), 2015, 9319 :186-192
[9]  
Povey D, 2002, INT CONF ACOUST SPEE, P105
[10]  
Povey D., 2011, IEEE 2011 WORKSH AUT