Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies

被引:23
|
作者
Enarvi, Seppo [1 ]
Smit, Peter [1 ]
Virpioja, Sami [1 ]
Kurimo, Mikko [1 ]
机构
[1] Aalto Univ, Dept Signal Proc & Acoust, Espoo 02150, Finland
基金
芬兰科学院;
关键词
Artificial neural networks; automatic speech recognition; language modeling; subword units; word classes; NEURAL-NETWORKS;
D O I
10.1109/TASLP.2017.2743344
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands ofwords. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variations due to colloquial pronunciations, in addition to the word compounding and inflections. Very large vocabularies are also needed, for example, when the recognition of rare proper names is important. Previously, very large vocabularies have been efficiently modeled in conventional n-gram language models either by splitting words into subword units or by clustering words into classes. While vocabulary size is not as critical anymore in modern speech recognition systems, training time and memory consumption become an issue when state-of-the-art neural network language models are used. In this paper, we investigate techniques that address the vocabulary size issue by reducing the effective vocabulary size and by processing large vocabulariesmore efficiently. The experimental results in conversational Finnish and Estonian speech recognition indicate that properly defined word classes improve recognition accuracy. Subword n-gram models are not better on evaluation data than word n-gram models constructed from a vocabulary that includes all the words in the training corpus. However, when recurrent neural network (RNN) language models are used, their ability to utilize long contexts gives a larger gain to subword-basedmodeling. Our best results are fromRNN language models that are based on statistical morphs. We show that the suitable size for a subword vocabulary depends on the language. Using time delay neural network acoustic models, we were able to achieve new state of the art in Finnish and Estonian conversational speech recognition, 27.1% word error rate in the Finnish task and 21.9% in the Estonian task.
引用
收藏
页码:2085 / 2097
页数:13
相关论文
共 50 条
  • [21] The 2001 BYBLOS english large vocabulary conversational speech recognition system
    Matsoukas, S
    Colthurst, T
    Kimball, O
    Solomonoff, A
    Richardson, F
    Quillen, C
    Gish, H
    Dognin, P
    2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 721 - 724
  • [22] The BBN Byblos 1997 Large Vocabulary conversational Speech Recognition system
    Zavaliagkos, G
    McDonough, J
    Miller, D
    El-Jaroudi, A
    Billa, J
    Richardson, F
    Ma, K
    Siu, M
    Gish, H
    PROCEEDINGS OF THE 1998 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-6, 1998, : 905 - 908
  • [23] Chinese speech recognition system with very large vocabulary
    Qin, Y
    Mo, FY
    Li, CL
    Guan, DH
    ICSP '96 - 1996 3RD INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, PROCEEDINGS, VOLS I AND II, 1996, : 817 - 820
  • [24] Evaluation of Automatic Speech Recognition Prototype for Estonian Language in Radiology Domain: A Pilot Study
    Paats, A.
    Alumaee, T.
    Meister, E.
    Fridolin, I.
    16TH NORDIC-BALTIC CONFERENCE ON BIOMEDICAL ENGINEERING, 2015, 48 : 96 - 99
  • [25] Automatic transcription of conversational telephone speech
    Hain, T
    Woodland, PC
    Evermann, G
    Gales, MJF
    Liu, XY
    Moore, GL
    Povey, D
    Wang, L
    IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2005, 13 (06): : 1173 - 1185
  • [26] Automatic linguistic segmentation of conversational speech
    Stolcke, A
    Shriberg, E
    ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 1005 - 1008
  • [27] Speech recognition on Mandarin Call Home: A large-vocabulary, conversational, and telephone speech corpus
    Liu, FH
    Picheny, M
    Srinivasa, P
    Monkowski, M
    Chen, JL
    1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 157 - 160
  • [28] ISOLATED WORD RECOGNITION FOR LARGE VOCABULARIES
    RABINER, LR
    ROSENBERG, AE
    WILPON, JG
    KEILIN, WJ
    BELL SYSTEM TECHNICAL JOURNAL, 1982, 61 (10): : 2989 - 3005
  • [29] Towards Large Vocabulary Automatic Speech Recognition for Latvian
    Salimbajevs, Askars
    Pinnis, Marcis
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, BALTIC HLT 2014, 2014, 268 : 236 - 243
  • [30] Strategies for lexical access to very large vocabularies
    Fissore, Luciano
    Laface, Piero
    Micca, Giorgio
    Pieraccini, Roberto
    CSELT Technical Reports, 1988, 16 (07): : 601 - 609