Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies

被引：23

作者：

Enarvi, Seppo ^{[1
]}

Smit, Peter ^{[1
]}

Virpioja, Sami ^{[1
]}

Kurimo, Mikko ^{[1
]}

机构：

[1] Aalto Univ, Dept Signal Proc & Acoust, Espoo 02150, Finland

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2017年 / 25卷 / 11期

基金：

芬兰科学院;

关键词：

Artificial neural networks; automatic speech recognition; language modeling; subword units; word classes; NEURAL-NETWORKS;

D O I：

10.1109/TASLP.2017.2743344

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands ofwords. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variations due to colloquial pronunciations, in addition to the word compounding and inflections. Very large vocabularies are also needed, for example, when the recognition of rare proper names is important. Previously, very large vocabularies have been efficiently modeled in conventional n-gram language models either by splitting words into subword units or by clustering words into classes. While vocabulary size is not as critical anymore in modern speech recognition systems, training time and memory consumption become an issue when state-of-the-art neural network language models are used. In this paper, we investigate techniques that address the vocabulary size issue by reducing the effective vocabulary size and by processing large vocabulariesmore efficiently. The experimental results in conversational Finnish and Estonian speech recognition indicate that properly defined word classes improve recognition accuracy. Subword n-gram models are not better on evaluation data than word n-gram models constructed from a vocabulary that includes all the words in the training corpus. However, when recurrent neural network (RNN) language models are used, their ability to utilize long contexts gives a larger gain to subword-basedmodeling. Our best results are fromRNN language models that are based on statistical morphs. We show that the suitable size for a subword vocabulary depends on the language. Using time delay neural network acoustic models, we were able to achieve new state of the art in Finnish and Estonian conversational speech recognition, 27.1% word error rate in the Finnish task and 21.9% in the Estonian task.

引用

页码：2085 / 2097

页数：13

共 50 条

[1] Morphologically motivated word classes for very large vocabulary speech recognition of Finnish and Estonian
Varjokallio, Matti
Virpioja, Sami
Kurimo, Mikko
COMPUTER SPEECH AND LANGUAGE, 2021, 66
[2] Class n-Gram Models for Very Large Vocabulary Speech Recognition of Finnish and Estonian
Varjokallio, Matti
Kurimo, Mikko
Virpioja, Sami
STATISTICAL LANGUAGE AND SPEECH PROCESSING, SLSP 2016, 2016, 9918 : 133 - 144
[3] A SPEECH RECOGNITION SYSTEM FOR LARGE VOCABULARIES
UEBLER, J
REINFELDER, HE
SIEMENS FORSCHUNGS-UND ENTWICKLUNGSBERICHTE-SIEMENS RESEARCH AND DEVELOPMENT REPORTS, 1987, 16 (02): : 42 - 49
[4] LEXICAL ACCESS TO LARGE VOCABULARIES FOR SPEECH RECOGNITION
FISSORE, L
LAFACE, P
MICCA, G
PIERACCINI, R
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1989, 37 (08): : 1197 - 1213
[5] Automatic Speech Recognition of Conversational Speech in Individuals With Disordered Speech
Tobin, Jimmy
Nelson, Phillip
MacDonald, Bob
Heywood, Rus
Cave, Richard
Seaver, Katie
Desjardins, Antoine
Jiang, Pan-Pan
Green, Jordan R.
JOURNAL OF SPEECH LANGUAGE AND HEARING RESEARCH, 2024, 67 (11): : 4176 - 4185
[6] HARPY SPEECH RECOGNITION SYSTEM - PERFORMANCE WITH LARGE VOCABULARIES
LOWERRE, B
REDDY, R
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1976, 60 : S10 - S11
[7] Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech
Tranter, SE
Yu, K
Evermann, G
Woodland, RC
2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS: SPEECH PROCESSING, 2004, : 753 - 756
[8] Pronunciation change in conversational speech and its implications for automatic speech recognition
Saraçlar, M
Khudanpur, S
COMPUTER SPEECH AND LANGUAGE, 2004, 18 (04): : 375 - 395
[9] Estonian Large Vocabulary Speech Recognition System for Radiology
Alumaee, Tanel
Meister, Einar
HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE, 2010, 219 : 33 - 38
[10] Directed Speech Separation for Automatic Speech Recognition of Long-form Conversational Speech
Paturi, Rohit
Srinivasan, Sundararajan
Kirchhoff, Katrin
Romero, Daniel Garcia
INTERSPEECH 2022, 2022, : 5388 - 5392

← 1 2 3 4 5 →