Automatic Speech Recognition With Very Large Conversational Finnish and Estonian Vocabularies

被引：23

作者：

Enarvi, Seppo ^{[1
]}

Smit, Peter ^{[1
]}

Virpioja, Sami ^{[1
]}

Kurimo, Mikko ^{[1
]}

机构：

[1] Aalto Univ, Dept Signal Proc & Acoust, Espoo 02150, Finland

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2017年 / 25卷 / 11期

基金：

芬兰科学院;

关键词：

Artificial neural networks; automatic speech recognition; language modeling; subword units; word classes; NEURAL-NETWORKS;

D O I：

10.1109/TASLP.2017.2743344

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Today, the vocabulary size for language models in large vocabulary speech recognition is typically several hundreds of thousands ofwords. While this is already sufficient in some applications, the out-of-vocabulary words are still limiting the usability in others. In agglutinative languages the vocabulary for conversational speech should include millions of word forms to cover the spelling variations due to colloquial pronunciations, in addition to the word compounding and inflections. Very large vocabularies are also needed, for example, when the recognition of rare proper names is important. Previously, very large vocabularies have been efficiently modeled in conventional n-gram language models either by splitting words into subword units or by clustering words into classes. While vocabulary size is not as critical anymore in modern speech recognition systems, training time and memory consumption become an issue when state-of-the-art neural network language models are used. In this paper, we investigate techniques that address the vocabulary size issue by reducing the effective vocabulary size and by processing large vocabulariesmore efficiently. The experimental results in conversational Finnish and Estonian speech recognition indicate that properly defined word classes improve recognition accuracy. Subword n-gram models are not better on evaluation data than word n-gram models constructed from a vocabulary that includes all the words in the training corpus. However, when recurrent neural network (RNN) language models are used, their ability to utilize long contexts gives a larger gain to subword-basedmodeling. Our best results are fromRNN language models that are based on statistical morphs. We show that the suitable size for a subword vocabulary depends on the language. Using time delay neural network acoustic models, we were able to achieve new state of the art in Finnish and Estonian conversational speech recognition, 27.1% word error rate in the Finnish task and 21.9% in the Estonian task.

引用

页码：2085 / 2097

页数：13

共 50 条

[21] Automatic speech recognition systems [J].

Catariov, A .

Information Technologies 2004, 2004, 5822 :83-93

[22] Automatic speech recognition: a survey [J].

Mishaim Malik ;

Muhammad Kamran Malik ;

Khawar Mehmood ;

Imran Makhdoom .

Multimedia Tools and Applications, 2021, 80 :9411-9457

[23] Efficient automatic speech recognition [J].

O'Shaughnessy, D .

PROCEEDINGS OF THE EIGHTH IASTED INTERNATIONAL CONFERENCE ON INTERNET AND MULTIMEDIA SYSTEMS AND APPLICATIONS, 2004, :323-327

[24] NETWORKS FOR SPEECH ENHANCEMENT AND AUTOMATIC SPEECH RECOGNITION [J].

Vu, Thanh T. ;

Bigot, Benjamin ;

Chng, Eng Siong .

2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, :499-503

[25] A Survey of Automatic Speech Recognition for Dysarthric Speech [J].

Qian, Zhaopeng ;

Xiao, Kejing .

ELECTRONICS, 2023, 12 (20)

[26] Improving a Conversational Speech Recognition System Using Phonetic and Neural Transcript Correction [J].

Campos-Soberanis, Mario ;

Campos-Sobrino, Diego ;

Viana-Camara, Rafael .

ADVANCES IN SOFT COMPUTING (MICAI 2021), PT II, 2021, 13068 :46-58

[27] SEMANTIC WORD EMBEDDING NEURAL NETWORK LANGUAGE MODELS FOR AUTOMATIC SPEECH RECOGNITION [J].

Audhkhasi, Kartik ;

Sethy, Abhinav ;

Ramabhadran, Bhuvana .

2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGS, 2016, :5995-5999

[28] A Decade of Discriminative Language Modeling for Automatic Speech Recognition [J].

Saraclar, Murat ;

Dikici, Erinc ;

Arisoy, Ebru .

SPEECH AND COMPUTER (SPECOM 2015), 2015, 9319 :11-22

[29] An Evaluation of Structured Language Modeling for Automatic Speech Recognition [J].

Bjorklund, Johanna ;

Cleophas, Loek ;

Karlsson, My .

JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2017, 23 (11) :1019-1034

[30] Adversarial Attacks on Automatic Speech Recognition (ASR): A Survey [J].

Bhanushali, Amisha Rajnikant ;

Mun, Hyunjun ;

Yun, Joobeom .

IEEE ACCESS, 2024, 12 :88279-88302

← 1 2 3 4 5 →