SPEAKER AUGMENTATION FOR LOW RESOURCE SPEECH RECOGNITION

被引:0
作者
Du, Chenpeng [1 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, Dept Comp Sci & Engn, SpeechLab, Shanghai, Peoples R China
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
Low resource; speech recognition; speech synthesis; variational autoencoder;
D O I
10.1109/icassp40776.2020.9053139
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text-to-speech synthesis (TTS) has been used as a data augmentation approach for automatic speech recognition (ASR), leveraging additional texts for ASR training. However, in low resource tasks, usually only a limited number of speakers are available, leading to the lack of speaker variations in synthetic speech. In this paper, we propose a novel speaker augmentation approach which can synthesize data with sufficient speaker and text diversity. Here, an end-to-end TTS system is trained with speaker representations from a variational auto-encoder (VAE), which enables TTS to synthesize speech from unseen new speakers via sampling from the trained latent distribution. As a new type of data augmentation approach, speaker augmentation can be combined with traditional feature augmentation approaches, such as SpecAugment. Experiments on a switchboard task show that, given 50 hours of data, the proposed speaker augmentation with SpecAugment significantly reduces word error rate (WER) by 30% relative compared to the system without any data augmentation, and about 18% relative compared to the system with SpecAugment.
引用
收藏
页码:7719 / 7723
页数:5
相关论文
共 23 条
[1]  
Akuzawa K, 2018, INTERSPEECH, P3067
[2]  
Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[3]  
Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI [DOI 10.1145/1143844.1143891, 10.1145/1143844.1143891]
[4]  
Graves A, 2014, PR MACH LEARN RES, V32, P1764
[5]   SIGNAL ESTIMATION FROM MODIFIED SHORT-TIME FOURIER-TRANSFORM [J].
GRIFFIN, DW ;
LIM, JS .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (02) :236-243
[6]   Two-Stage Data Augmentation for Low-Resourced Speech Recognition [J].
Hartmann, William ;
Ng, Tim ;
Hsiao, Roger ;
Tsakalidis, Stavros ;
Schwartz, Richard .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :2378-2382
[7]  
Hsu WN, 2019, INT CONF ACOUST SPEE, P5901, DOI 10.1109/ICASSP.2019.8683561
[8]  
Ito Keith, 2017, LJ SPEECH DATASET
[9]  
Karita Shigeki., 2019, CoRR
[10]  
Kingma D.P., 2014, Auto-encoding variational bayes