SPEAKER AUGMENTATION FOR LOW RESOURCE SPEECH RECOGNITION

被引：0

作者：

Du, Chenpeng ^{[1
]}

Yu, Kai ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, Dept Comp Sci & Engn, SpeechLab, Shanghai, Peoples R China

来源：

2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年

关键词：

Low resource; speech recognition; speech synthesis; variational autoencoder;

D O I：

10.1109/icassp40776.2020.9053139

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Text-to-speech synthesis (TTS) has been used as a data augmentation approach for automatic speech recognition (ASR), leveraging additional texts for ASR training. However, in low resource tasks, usually only a limited number of speakers are available, leading to the lack of speaker variations in synthetic speech. In this paper, we propose a novel speaker augmentation approach which can synthesize data with sufficient speaker and text diversity. Here, an end-to-end TTS system is trained with speaker representations from a variational auto-encoder (VAE), which enables TTS to synthesize speech from unseen new speakers via sampling from the trained latent distribution. As a new type of data augmentation approach, speaker augmentation can be combined with traditional feature augmentation approaches, such as SpecAugment. Experiments on a switchboard task show that, given 50 hours of data, the proposed speaker augmentation with SpecAugment significantly reduces word error rate (WER) by 30% relative compared to the system without any data augmentation, and about 18% relative compared to the system with SpecAugment.

引用

页码：7719 / 7723

页数：5

共 23 条

[1]

Akuzawa K, 2018, INTERSPEECH, P3067

[2]

Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506

[3]

Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI [DOI 10.1145/1143844.1143891, 10.1145/1143844.1143891]

[4]

Graves A, 2014, PR MACH LEARN RES, V32, P1764

[5] SIGNAL ESTIMATION FROM MODIFIED SHORT-TIME FOURIER-TRANSFORM [J].

GRIFFIN, DW ;

LIM, JS .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (02) :236-243

[6] Two-Stage Data Augmentation for Low-Resourced Speech Recognition [J].

Hartmann, William ;

Ng, Tim ;

Hsiao, Roger ;

Tsakalidis, Stavros ;

Schwartz, Richard .

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :2378-2382

[7]

Hsu WN, 2019, INT CONF ACOUST SPEE, P5901, DOI 10.1109/ICASSP.2019.8683561

[8]

Ito Keith, 2017, LJ SPEECH DATASET

[9]

Karita Shigeki., 2019, CoRR

[10]

Kingma D.P., 2014, Auto-encoding variational bayes

← 1 2 3 →