Multi-setting acoustic feature training for data augmentation of speech recognition

被引:0
作者
Ueno, Sei [1 ]
Lee, Akinobu [1 ]
机构
[1] Nagoya Inst Technol, Gokiso Cho,Showa Ku, Nagoya 4668555, Japan
关键词
Speech recognition; Speech synthesis; Speech diversity; Data augmentation; Domain adaptation; IMPROVING SPEECH; CONSISTENT;
D O I
10.1250/ast.e23.70
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents simple multi -setting log Mel -scale filter bank (lmfb) training methods to fill the gap between real speech and synthesized speech in automatic speech recognition (ASR) data augmentation. While end -to -end ASR has been facing the lack of a sufficient amount of real speech data, its performance has been significantly improved by a data synthesis technique utilizing a TTS system. However, the generated speech from the TTS model is often monotonous and lacks the natural variations in real speech, negatively impacting ASR performance. We propose using multi -setting lmfb features for a data augmentation scheme to mitigate this problem. Multiple lmfb features are extracted with multiple STFT parameter settings that are obtained from well-known parameters for both ASR and TTS tasks. In addition, we also propose training a single TTS model using multi -setting lmfb features with its setting ID embedded in the text -to -Mel network. Experimental evaluations showed that both proposed multi -setting training methods achieved better ASR performance than the baseline single -setting training augmentation methods.
引用
收藏
页码:195 / 203
页数:9
相关论文
共 34 条
  • [1] Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937
  • [2] Improving Speech Recognition using GAN-based Speech Synthesis and Contrastive Unspoken Text Selection
    Chen, Zhehuai
    Rosenberg, Andrew
    Zhang, Yu
    Wang, Gary
    Ramabhadran, Bhuvana
    Moreno, Pedro J.
    [J]. INTERSPEECH 2020, 2020, : 556 - 560
  • [3] Chorowski J., 2014, NIPS 2014 WORKSH DEE
  • [4] SynthASR: Unlocking Synthetic Data for Speech Recognition
    Fazel, Amin
    Yang, Wei
    Liu, Yulan
    Barra-Chicote, Roberto
    Meng, Yixiong
    Maas, Roland
    Droppo, Jasha
    [J]. INTERSPEECH 2021, 2021, : 896 - 900
  • [5] Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI DOI 10.1145/1143844.1143891
  • [6] Conformer: Convolution-augmented Transformer for Speech Recognition
    Gulati, Anmol
    Qin, James
    Chiu, Chung-Cheng
    Parmar, Niki
    Zhang, Yu
    Yu, Jiahui
    Han, Wei
    Wang, Shibo
    Zhang, Zhengdong
    Wu, Yonghui
    Pang, Ruoming
    [J]. INTERSPEECH 2020, 2020, : 5036 - 5040
  • [7] Hayashi T, 2018, IEEE W SP LANG TECH, P426, DOI 10.1109/SLT.2018.8639619
  • [8] SYNT plus plus : UTILIZING IMPERFECT SYNTHETIC DATA TO IMPROVE SPEECH RECOGNITION
    Hu, Ting-Yao
    Armandpour, Mohammadreza
    Shrivastava, Ashish
    Chang, Jen-Hao Rick
    Koppula, Hema
    Tuzel, Oncel
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7682 - 7686
  • [9] Kannan A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5824, DOI 10.1109/ICASSP.2018.8462682
  • [10] Kingma D.P., 2014, arXiv, DOI [DOI 10.48550/ARXIV.1412.6980, 10.48550/arXiv.1412.6980]