Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies

被引:23
作者
Enarvi, Seppo [1 ]
Smit, Peter [1 ]
Virpioja, Sami [1 ]
Kurimo, Mikko [1 ]
机构
[1] Aalto Univ, Dept Signal Proc & Acoust, Espoo 02150, Finland
基金
芬兰科学院;
关键词
Artificial neural networks; automatic speech recognition; language modeling; subword units; word classes; NEURAL-NETWORKS;
D O I
10.1109/TASLP.2017.2743344
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands ofwords. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variations due to colloquial pronunciations, in addition to the word compounding and inflections. Very large vocabularies are also needed, for example, when the recognition of rare proper names is important. Previously, very large vocabularies have been efficiently modeled in conventional n-gram language models either by splitting words into subword units or by clustering words into classes. While vocabulary size is not as critical anymore in modern speech recognition systems, training time and memory consumption become an issue when state-of-the-art neural network language models are used. In this paper, we investigate techniques that address the vocabulary size issue by reducing the effective vocabulary size and by processing large vocabulariesmore efficiently. The experimental results in conversational Finnish and Estonian speech recognition indicate that properly defined word classes improve recognition accuracy. Subword n-gram models are not better on evaluation data than word n-gram models constructed from a vocabulary that includes all the words in the training corpus. However, when recurrent neural network (RNN) language models are used, their ability to utilize long contexts gives a larger gain to subword-basedmodeling. Our best results are fromRNN language models that are based on statistical morphs. We show that the suitable size for a subword vocabulary depends on the language. Using time delay neural network acoustic models, we were able to achieve new state of the art in Finnish and Estonian conversational speech recognition, 27.1% word error rate in the Finnish task and 21.9% in the Estonian task.
引用
收藏
页码:2085 / 2097
页数:13
相关论文
共 50 条
[31]   Training of Automatic Speech Recognition System on Noised Speech [J].
Prodeus, Arkadiy ;
Kukharicheva, Kateryna .
2016 4TH INTERNATIONAL CONFERENCE ON METHODS AND SYSTEMS OF NAVIGATION AND MOTION CONTROL (MSNMC), 2016, :221-223
[32]   SPEECH DISFLUENCIES MODELING IN AUTOMATIC SPEECH RECOGNITION SYSTEMS [J].
Vasilisa, Verkhodanova O. ;
Alexey, Karpov A. .
TOMSK STATE UNIVERSITY JOURNAL, 2012, (363) :10-+
[33]   Evaluation of an Automatic Speech Recognition Platform for Dysarthric Speech [J].
Calvo, Irene ;
Tropea, Peppino ;
Vigano, Mauro ;
Scialla, Maria ;
Cavalcante, Agnieszka B. ;
Grajzer, Monika ;
Gilardone, Marco ;
Corbo, Massimo .
FOLIA PHONIATRICA ET LOGOPAEDICA, 2021, 73 (05) :432-441
[34]   On the Influence of Automatic Segmentation and Clustering in Automatic Speech Recognition [J].
Lopez-Otero, Paula ;
Docio-Fernandez, Laura ;
Garcia-Mateo, Carmen ;
Cardenal-Lopez, Antonio .
ADVANCES IN SPEECH AND LANGUAGE TECHNOLOGIES FOR IBERIAN LANGUAGES, 2012, 328 :49-58
[35]   Towards Automatic Assessment of Aphasia Speech Using Automatic Speech Recognition Techniques [J].
Qin, Ying ;
Lee, Tan ;
Kong, Anthony Pak Hin ;
Law, Sam Po .
2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
[36]   Counterfactually Fair Automatic Speech Recognition [J].
Sari, Leda ;
Hasegawa-Johnson, Mark ;
Yoo, Chang D. .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :3515-3525
[37]   Unsupervised Automatic Speech Recognition: A review [J].
Aldarmaki, Hanan ;
Ullah, Asad ;
Ram, Sreepratha ;
Zaki, Nazar .
SPEECH COMMUNICATION, 2022, 139 :76-91
[38]   Continual Learning in Automatic Speech Recognition [J].
Sadhu, Samik ;
Hermansky, Hynek .
INTERSPEECH 2020, 2020, :1246-1250
[39]   The WaveSurfer Automatic Speech Recognition Plugin [J].
Salvi, Giampiero ;
Vanhainen, Niklas .
LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, :3067-3071
[40]   Acoustic Analysis for Automatic Speech Recognition [J].
O'Shaughnessy, Douglas .
PROCEEDINGS OF THE IEEE, 2013, 101 (05) :1038-1053