Multi-setting acoustic feature training for data augmentation of speech recognition

被引：0

作者：

Ueno, Sei ^{[1
]}

Lee, Akinobu ^{[1
]}

机构：

[1] Nagoya Inst Technol, Gokiso Cho,Showa Ku, Nagoya 4668555, Japan

来源：

ACOUSTICAL SCIENCE AND TECHNOLOGY | 2024年 / 45卷 / 04期

关键词：

Speech recognition; Speech synthesis; Speech diversity; Data augmentation; Domain adaptation; IMPROVING SPEECH; CONSISTENT;

D O I：

10.1250/ast.e23.70

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper presents simple multi -setting log Mel -scale filter bank (lmfb) training methods to fill the gap between real speech and synthesized speech in automatic speech recognition (ASR) data augmentation. While end -to -end ASR has been facing the lack of a sufficient amount of real speech data, its performance has been significantly improved by a data synthesis technique utilizing a TTS system. However, the generated speech from the TTS model is often monotonous and lacks the natural variations in real speech, negatively impacting ASR performance. We propose using multi -setting lmfb features for a data augmentation scheme to mitigate this problem. Multiple lmfb features are extracted with multiple STFT parameter settings that are obtained from well-known parameters for both ASR and TTS tasks. In addition, we also propose training a single TTS model using multi -setting lmfb features with its setting ID embedded in the text -to -Mel network. Experimental evaluations showed that both proposed multi -setting training methods achieved better ASR performance than the baseline single -setting training augmentation methods.

引用

页码：195 / 203

页数：9

共 34 条

[1] Battenberg E, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P206, DOI 10.1109/ASRU.2017.8268937
[2] Improving Speech Recognition using GAN-based Speech Synthesis and Contrastive Unspoken Text Selection
Chen, Zhehuai
Rosenberg, Andrew
Zhang, Yu
Wang, Gary
Ramabhadran, Bhuvana
Moreno, Pedro J.
[J]. INTERSPEECH 2020, 2020, : 556 - 560
[3] Chorowski J., 2014, NIPS 2014 WORKSH DEE
[4] SynthASR: Unlocking Synthetic Data for Speech Recognition
Fazel, Amin
Yang, Wei
Liu, Yulan
Barra-Chicote, Roberto
Meng, Yixiong
Maas, Roland
Droppo, Jasha
[J]. INTERSPEECH 2021, 2021, : 896 - 900
[5] Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI DOI 10.1145/1143844.1143891
[6] Conformer: Convolution-augmented Transformer for Speech Recognition
Gulati, Anmol
Qin, James
Chiu, Chung-Cheng
Parmar, Niki
Zhang, Yu
Yu, Jiahui
Han, Wei
Wang, Shibo
Zhang, Zhengdong
Wu, Yonghui
Pang, Ruoming
[J]. INTERSPEECH 2020, 2020, : 5036 - 5040
[7] Hayashi T, 2018, IEEE W SP LANG TECH, P426, DOI 10.1109/SLT.2018.8639619
[8] SYNT plus plus : UTILIZING IMPERFECT SYNTHETIC DATA TO IMPROVE SPEECH RECOGNITION
Hu, Ting-Yao
Armandpour, Mohammadreza
Shrivastava, Ashish
Chang, Jen-Hao Rick
Koppula, Hema
Tuzel, Oncel
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7682 - 7686
[9] Kannan A, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5824, DOI 10.1109/ICASSP.2018.8462682
[10] Kingma D.P., 2014, arXiv, DOI [DOI 10.48550/ARXIV.1412.6980, 10.48550/arXiv.1412.6980]

← 1 2 3 4 →