Indonesian Continuous Speech Recognition Using CNN and Bidirectional LSTM

被引：0

作者：

Naiborhu, Anwar Petrus F. ^{[1
]}

Endah, Sukmawati Nur ^{[1
]}

机构：

[1] Diponegoro Univ, Dept Informat, Semarang, Indonesia

来源：

2021 5TH INTERNATIONAL CONFERENCE ON INFORMATICS AND COMPUTATIONAL SCIENCES (ICICOS 2021) | 2021年

关键词：

CNN; BLSTM; CTC; speech recognition; continuous speech; Bahasa Indonesia; DEEP NEURAL-NETWORKS;

D O I：

10.1109/ICICOS53627.2021.9651902

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Speech recognition is the machine's ability to recognize words based on speech. Utilization of speech recognition technology into a real form of human-machine interaction. This allows humans to control smart devices more easily. However, the variability of speech such as dialect or accent, vocabulary size, type of recognition, speech speed, environmental noise, and type of microphone will affect the speech recognition rate. In recent years, deep learning approaches such as CNN and BLSTM have been widely used and have provided significant recognition improvements. Inspired by the advantages of CNN in exploiting local interspectral correlations and capturing frequency changes in speech signals and BLSTM in learning the temporal context, this study uses a hybrid CNN and BLSTM models for speech recognition with CTC as decoder. This study uses continuous speech data in Indonesian with five different dialects, namely, Balinese, Bataknese, Javanese, Minangese and Sundanese. There are four test scenarios that carried out sequentially to improve speech recognition performance including layer structure, use of dropouts, number of filters and units, and types of input features. The first three scenarios only use 13 coefficients MFCC without deltas features as the input feature. The results showed that the combination of 2 CNN layers with 64 filters and 2 BLSTM layers with 128 units and the application of a dropout with rate of 0.2 on all hidden layers achieve WER of 37.31%. And the addition of delta and double delta features can reduce the recognition error and achieve WER 10.80%.

引用

页数：6

共 29 条

[1] Convolutional Neural Networks for Speech Recognition
Abdel-Hamid, Ossama
Mohamed, Abdel-Rahman
Jiang, Hui
Deng, Li
Penn, Gerald
Yu, Dong
[J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) : 1533 - 1545
[2] Amodei D, 2016, PR MACH LEARN RES, V48
[3] An Investigation on the Accuracy of Truncated DKLT Representation for Speaker Identification With Short Sequences of Speech Frames
Biagetti, Giorgio
Crippa, Paolo
Falaschetti, Laura
Orcioni, Simone
Turchetti, Claudio
[J]. IEEE TRANSACTIONS ON CYBERNETICS, 2017, 47 (12) : 4235 - 4249
[4] Colle R., 2016, WORLD J BIOL PSYCHIA, V19, P1
[5] Dahl GE, 2013, INT CONF ACOUST SPEE, P8609, DOI 10.1109/ICASSP.2013.6639346
[6] Efendi R., 2019, AUTOMATIC SPEECH REC
[7] Fan RC, 2018, 2018 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), P349, DOI 10.1109/ICALIP.2018.8455731
[8] Multicomponent AM-FM representations: An asymptotically exact approach
Gianfelici, Francesco
Biagetti, Giorgio
Crippa, Paolo
Turchetti, Claudio
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (03): : 823 - 837
[9] Graves A., 2006, INT C MACH LEARN
[10] Graves A, 2013, 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P273, DOI 10.1109/ASRU.2013.6707742

← 1 2 3 →