Indonesian Continuous Speech Recognition Using CNN and Bidirectional LSTM

被引:0
作者
Naiborhu, Anwar Petrus F. [1 ]
Endah, Sukmawati Nur [1 ]
机构
[1] Diponegoro Univ, Dept Informat, Semarang, Indonesia
来源
2021 5TH INTERNATIONAL CONFERENCE ON INFORMATICS AND COMPUTATIONAL SCIENCES (ICICOS 2021) | 2021年
关键词
CNN; BLSTM; CTC; speech recognition; continuous speech; Bahasa Indonesia; DEEP NEURAL-NETWORKS;
D O I
10.1109/ICICOS53627.2021.9651902
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Speech recognition is the machine's ability to recognize words based on speech. Utilization of speech recognition technology into a real form of human-machine interaction. This allows humans to control smart devices more easily. However, the variability of speech such as dialect or accent, vocabulary size, type of recognition, speech speed, environmental noise, and type of microphone will affect the speech recognition rate. In recent years, deep learning approaches such as CNN and BLSTM have been widely used and have provided significant recognition improvements. Inspired by the advantages of CNN in exploiting local interspectral correlations and capturing frequency changes in speech signals and BLSTM in learning the temporal context, this study uses a hybrid CNN and BLSTM models for speech recognition with CTC as decoder. This study uses continuous speech data in Indonesian with five different dialects, namely, Balinese, Bataknese, Javanese, Minangese and Sundanese. There are four test scenarios that carried out sequentially to improve speech recognition performance including layer structure, use of dropouts, number of filters and units, and types of input features. The first three scenarios only use 13 coefficients MFCC without deltas features as the input feature. The results showed that the combination of 2 CNN layers with 64 filters and 2 BLSTM layers with 128 units and the application of a dropout with rate of 0.2 on all hidden layers achieve WER of 37.31%. And the addition of delta and double delta features can reduce the recognition error and achieve WER 10.80%.
引用
收藏
页数:6
相关论文
共 29 条
  • [1] Convolutional Neural Networks for Speech Recognition
    Abdel-Hamid, Ossama
    Mohamed, Abdel-Rahman
    Jiang, Hui
    Deng, Li
    Penn, Gerald
    Yu, Dong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (10) : 1533 - 1545
  • [2] Amodei D, 2016, PR MACH LEARN RES, V48
  • [3] An Investigation on the Accuracy of Truncated DKLT Representation for Speaker Identification With Short Sequences of Speech Frames
    Biagetti, Giorgio
    Crippa, Paolo
    Falaschetti, Laura
    Orcioni, Simone
    Turchetti, Claudio
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2017, 47 (12) : 4235 - 4249
  • [4] Colle R., 2016, WORLD J BIOL PSYCHIA, V19, P1
  • [5] Dahl GE, 2013, INT CONF ACOUST SPEE, P8609, DOI 10.1109/ICASSP.2013.6639346
  • [6] Efendi R., 2019, AUTOMATIC SPEECH REC
  • [7] Fan RC, 2018, 2018 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING (ICALIP), P349, DOI 10.1109/ICALIP.2018.8455731
  • [8] Multicomponent AM-FM representations: An asymptotically exact approach
    Gianfelici, Francesco
    Biagetti, Giorgio
    Crippa, Paolo
    Turchetti, Claudio
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (03): : 823 - 837
  • [9] Graves A., 2006, INT C MACH LEARN
  • [10] Graves A, 2013, 2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P273, DOI 10.1109/ASRU.2013.6707742