Deep Neural Network for Automatic Speech Recognition from Indonesian Audio using Several Lexicon Types

被引：1

作者：

Abidin, Taufik Fuadi ^{[1
]}

Misbullah, Alim ^{[1
]}

Ferdhiana, Ridha ^{[2
]}

Aksana, Muammar Zikri ^{[1
]}

Farsiah, Laina ^{[1
]}

机构：

[1] Univ Syiah Kuala, Dept Informat, Banda Aceh, Indonesia

[2] Univ Syiah Kuala, Dept Stat, Banda Aceh, Indonesia

来源：

2020 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICELTICS 2020) | 2020年

关键词：

Automatic speech recognition; deep neural networks; lexicons; acoustic and language models;

D O I：

10.1109/iceltics50595.2020.9315538

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Recently, automatic speech recognition has benefited from advances in deep neural networks (DNNs) to train and deploy speech recognition models. Speech recognition models enable computers to recognize and translate spoken language into text. In this paper, we present an approach to creating an Indonesian voice-to-text dataset using audio collected from YouTube channels and to evaluating the speech recognition models using several lexicon types. The lexicons are created from unique words of the speech corpus. We compared the performance of Time Delay Neural Network Factorization (TDNNF) and Gaussian Mixture Model - Hidden Markov Model (GMM-HMM) models for mono phone, DELTA+DELTA-DELTA, and Speaker Adaptive Training (SAT) based on the %WER when trained using unvalidated and validated datasets. The results showed that there was no significant difference in the %WER among the lexicon types. Moreover, the results revealed that the models trained using a validated dataset perform better than an unvalidated dataset. Additionally, lexicon enmap_kv_vocab_full returned the best result with 29.41% WER when trained using the TDNNF model on an unvalidated dataset. However, lexicon enmap_vocab_1_char provided the best result, with 11.35% WER, when trained using the TDNNF model on a validated dataset.

引用

页码：113 / 117

页数：5

共 10 条

[1]

Brownstein J., 2018, WHAT WILL HEALTHCARE

[2] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition [J].

Dahl, George E. ;

Yu, Dong ;

Deng, Li ;

Acero, Alex .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01) :30-42

[3]

Povey D., 2011, PROC 2011 WORKSHOP A

[4] Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks [J].

Povey, Daniel ;

Cheng, Gaofeng ;

Wang, Yiming ;

Li, Ke ;

Xu, Hainan ;

Yarmohamadi, Mahsa ;

Khudanpur, Sanjeev .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3743-3747

[5] Purely sequence-trained neural networks for ASR based on lattice-free MMI [J].

Povey, Daniel ;

Peddinti, Vijayaditya ;

Galvez, Daniel ;

Ghahremani, Pegah ;

Manohar, Vimal ;

Na, Xingyu ;

Wang, Yiming ;

Khudanpur, Sanjeev .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :2751-2755

[6]

Santosa A, 2017, 2017 20 C OR CHAPT I, P1

[7]

Sneddon J.N., 2003, The Indonesian Language: Its history and role in modern society

[8]

Stolcke A., 2002, P 7 INT C SPOKEN LAN

[9]

Van der Velde N, 2018, INNOVATIVE USES SPEE

[10]

Weide Robert, 2017, The Carnegie Mellon pronouncing dictionary cmudict. 0.7

← 1 →