Developing speaker independent ASR system using limited data through prosody modification based on fuzzy classification of spectral bins

被引:12
作者
Shahnawazuddin, S. [1 ]
Adiga, Nagaraj [2 ]
Sai, B. Tarun [1 ]
Ahmad, Waquar [3 ]
Kathania, Hemant K. [4 ]
机构
[1] Natl Inst Technol Patna, Dept ECE, Patna, Bihar, India
[2] Univ Crete, Dept Comp Sci, Iraklion, Greece
[3] Natl Inst Technol Calicut, Dept ECE, Kattangal, India
[4] Natl Inst Technol Sikkim, Dept ECE, Sikkim, India
关键词
Speaker-independent ASR; Children's speech recognition; Prosody modification; Fuzzy classification; Data augmentation; TIME-SCALE MODIFICATION; SPEECH RECOGNITION; CHILDRENS SPEECH; ADAPTATION; NOISE;
D O I
10.1016/j.dsp.2019.06.015
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
The primary motive of this study is to develop an automatic speech recognition (ASR) system using limited amount of speech data such that it is least affected by speaker-dependent acoustic variations. The two factors contributing towards inter-speaker variability that are focused upon in this work are pitch and speaking-rate variations. In order to simulate such a limited data scenario, an ASR system is trained on adults' speech and tested using speech data from adult as well as child speakers. Compared to adults' speech test case, the recognition rates are noted to be extremely degraded when the test speech is from child speakers. The observed degradation is due to large differences in pitch and speaking-rate between adults' and children's speech along with other factors leading to inter-speaker acoustic variations. To overcome the mismatch in pitch and speaking-rate, two different approaches are proposed in this paper. In the first approach, the pitch and speaking-rate of children's speech test set are explicitly modified using a recently proposed prosody modification technique that exploits fuzzy classification of spectral bins. In the second approach, pitch and speaking-rate of the training data are modified to create newer versions of the data. In order to capture greater acoustic variability, the original and the modified versions are then pooled together. The ASR system trained on augmented data is noted to be more robust towards pitch and speaking-rate variations. Consequently, relative improvements of 17% and 31% over the baseline are obtained on decoding adults' and children's speech test sets, respectively. (C) 2019 Elsevier Inc. All rights reserved.
引用
收藏
页码:34 / 42
页数:9
相关论文
共 55 条
  • [1] Abdel-Hamid O, 2013, INTERSPEECH, P1247
  • [2] Amazon, AM LEX IS SERV BUILD
  • [3] [Anonymous], 2007, P SPEECH LANG TECHN
  • [4] [Anonymous], 2003, 8 EUROPEAN C SPEECH, DOI DOI 10.21437/EUROSPEECH.2003-415
  • [5] [Anonymous], P DIG AUD EFF DAFX 1
  • [6] [Anonymous], 2009, P WORKSH CHILD COMP
  • [7] [Anonymous], 2011, THESIS
  • [8] Batliner A., 2005, P INTERSPEECH, P2761, DOI DOI 10.21437/INTERSPEECH.2005
  • [9] Burnett DC, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1145, DOI 10.1109/ICSLP.1996.607809
  • [10] Audio Time Stretching Using Fuzzy Classification of Spectral Bins
    Damskagg, Eero-Pekka
    Valimaki, Vesa
    [J]. APPLIED SCIENCES-BASEL, 2017, 7 (12):