INSTANTANEOUS FREQUENCY FILTER-BANK FEATURES FOR LOW RESOURCE SPEECH RECOGNITION USING DEEP RECURRENT ARCHITECTURES

被引:2
作者
Nayak, Shekhar [1 ]
Kumar, C. Shiva [2 ]
Murty, K. Sri Rama [2 ]
机构
[1] Samsung R&D Inst India Bangalore, Bangalore, Karnataka, India
[2] Indian Inst Technol Hyderabad, Dept Elect Engn, Hyderabad, Telangana, India
来源
2021 NATIONAL CONFERENCE ON COMMUNICATIONS (NCC) | 2021年
关键词
Instantaneous frequency; feature extraction; RNN; Li-GRU; speech recognition; COMBINATION; EXTRACTION;
D O I
10.1109/NCC52529.2021.9530049
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recurrent neural networks (RNNs) and its variants have achieved significant success in speech recognition. Long short term memory (LSTM) and gated recurrent units (GRUs) are the two most popular variants which overcome the vanishing gradient problem of RNNs and also learn effectively long term dependencies. Light gated recurrent units (Li-GRUs) are more compact versions of standard GRUs. Li-GRUs have been shown to provide better recognition accuracy with significantly faster training. These different RNN inspired architectures invariably use magnitude based features and the phase information is generally ignored. We propose to incorporate the features derived from the analytic phase of the speech signals for speech recognition using these RNN variants. Instantaneous frequency filter-bank (IFFB) features derived from Fourier transform relations performed at par with the standard MFCC features for recurrent units based acoustic models despite being derived from phase information only. Different system combinations of IFFB features with the magnitude based features provided lowest PER of 12.9% and showed relative improvements of up to 16.8% over standalone MFCC features on TIMIT phone recognition using Li-GRU based architecture. IFFB features significantly outperformed the modified group delay coefficients (MGDC) features in all our experiments.
引用
收藏
页码:105 / 110
页数:6
相关论文
共 36 条
  • [1] [Anonymous], 2011, IEEE WORKSH AUT SPEE
  • [2] Feature extraction technique using ERB like wavelet sub-band periodic and aperiodic decomposition for TIMIT phoneme recognition
    Biswas, Astik
    Sahu, P.
    Bhowmick, Anirban
    Chandra, Mahesh
    [J]. INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2014, 17 (04) : 389 - 399
  • [3] Cho K., 2014, 8 WORKSHOP SYNTAX SE
  • [4] Chung JY, 2014, Arxiv, DOI arXiv:1412.3555
  • [5] Robust AM-FM features for speech recognition
    Dimitriadis, D
    Maragos, P
    Potamianos, A
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2005, 12 (09) : 621 - 624
  • [6] Phase-Aware Signal Processing for Automatic Speech Recognition
    Fahringer, Johannes
    Schrank, Tobias
    Stahl, Johannes
    Mowlaee, Pejman
    Pernkopf, Franz
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 3374 - 3378
  • [7] Learning precise timing with LSTM recurrent networks
    Gers, FA
    Schraudolph, NN
    Schmidhuber, J
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (01) : 115 - 143
  • [8] Glorot X., 2010, P 13 INT C ART INT S, P249
  • [9] Graves A, 2012, STUD COMPUT INTELL, V385, P1, DOI [10.1007/978-3-642-24797-2, 10.1162/neco.1997.9.1.1]
  • [10] Graves A, 2013, INT CONF ACOUST SPEE, P6645, DOI 10.1109/ICASSP.2013.6638947