Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies

被引:23
|
作者
Enarvi, Seppo [1 ]
Smit, Peter [1 ]
Virpioja, Sami [1 ]
Kurimo, Mikko [1 ]
机构
[1] Aalto Univ, Dept Signal Proc & Acoust, Espoo 02150, Finland
基金
芬兰科学院;
关键词
Artificial neural networks; automatic speech recognition; language modeling; subword units; word classes; NEURAL-NETWORKS;
D O I
10.1109/TASLP.2017.2743344
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands ofwords. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variations due to colloquial pronunciations, in addition to the word compounding and inflections. Very large vocabularies are also needed, for example, when the recognition of rare proper names is important. Previously, very large vocabularies have been efficiently modeled in conventional n-gram language models either by splitting words into subword units or by clustering words into classes. While vocabulary size is not as critical anymore in modern speech recognition systems, training time and memory consumption become an issue when state-of-the-art neural network language models are used. In this paper, we investigate techniques that address the vocabulary size issue by reducing the effective vocabulary size and by processing large vocabulariesmore efficiently. The experimental results in conversational Finnish and Estonian speech recognition indicate that properly defined word classes improve recognition accuracy. Subword n-gram models are not better on evaluation data than word n-gram models constructed from a vocabulary that includes all the words in the training corpus. However, when recurrent neural network (RNN) language models are used, their ability to utilize long contexts gives a larger gain to subword-basedmodeling. Our best results are fromRNN language models that are based on statistical morphs. We show that the suitable size for a subword vocabulary depends on the language. Using time delay neural network acoustic models, we were able to achieve new state of the art in Finnish and Estonian conversational speech recognition, 27.1% word error rate in the Finnish task and 21.9% in the Estonian task.
引用
收藏
页码:2085 / 2097
页数:13
相关论文
共 50 条
  • [1] Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian
    Varjokallio, Matti
    Virpioja, Sami
    Kurimo, Mikko
    COMPUTER SPEECH AND LANGUAGE, 2021, 66
  • [2] Class n-Gram Models for Very Large Vocabulary Speech Recognition of Finnish and Estonian
    Varjokallio, Matti
    Kurimo, Mikko
    Virpioja, Sami
    STATISTICAL LANGUAGE AND SPEECH PROCESSING, SLSP 2016, 2016, 9918 : 133 - 144
  • [3] A SPEECH RECOGNITION SYSTEM FOR LARGE VOCABULARIES
    UEBLER, J
    REINFELDER, HE
    SIEMENS FORSCHUNGS-UND ENTWICKLUNGSBERICHTE-SIEMENS RESEARCH AND DEVELOPMENT REPORTS, 1987, 16 (02): : 42 - 49
  • [4] LEXICAL ACCESS TO LARGE VOCABULARIES FOR SPEECH RECOGNITION
    FISSORE, L
    LAFACE, P
    MICCA, G
    PIERACCINI, R
    IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1989, 37 (08): : 1197 - 1213
  • [5] Automatic Speech Recognition of Conversational Speech in Individuals With Disordered Speech
    Tobin, Jimmy
    Nelson, Phillip
    MacDonald, Bob
    Heywood, Rus
    Cave, Richard
    Seaver, Katie
    Desjardins, Antoine
    Jiang, Pan-Pan
    Green, Jordan R.
    JOURNAL OF SPEECH LANGUAGE AND HEARING RESEARCH, 2024, 67 (11): : 4176 - 4185
  • [6] HARPY SPEECH RECOGNITION SYSTEM - PERFORMANCE WITH LARGE VOCABULARIES
    LOWERRE, B
    REDDY, R
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1976, 60 : S10 - S11
  • [7] Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech
    Tranter, SE
    Yu, K
    Evermann, G
    Woodland, RC
    2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 753 - 756
  • [8] Pronunciation change in conversational speech and its implications for automatic speech recognition
    Saraçlar, M
    Khudanpur, S
    COMPUTER SPEECH AND LANGUAGE, 2004, 18 (04): : 375 - 395
  • [9] Estonian Large Vocabulary Speech Recognition System for Radiology
    Alumaee, Tanel
    Meister, Einar
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, 2010, 219 : 33 - 38
  • [10] Directed Speech Separation for Automatic Speech Recognition of Long-form Conversational Speech
    Paturi, Rohit
    Srinivasan, Sundararajan
    Kirchhoff, Katrin
    Romero, Daniel Garcia
    INTERSPEECH 2022, 2022, : 5388 - 5392