Pre-training Techniques for Improving Text-to-Speech Synthesis by Automatic Speech Recognition Based Data Enhancement

被引：1

作者：

Liu, Yazhu ^{[1
]}

Xue, Shaofei ^{[1
]}

Tang, Jian ^{[1
]}

机构：

[1] AIspeech Ltd, Suzhou, Peoples R China

来源：

MAN-MACHINE SPEECH COMMUNICATION, NCMMSC 2022 | 2023年 / 1765卷

关键词：

Pre-training techniques; neural network; text-to-speech; automatic speech recognition;

D O I：

10.1007/978-981-99-2401-1_15

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

As the development of deep learning, neural network (NN) based text-to-speech (TTS) that adopts deep neural networks as the model backbone for speech synthesis, has now become the mainstream technology for TTS. Compared to the previous TTS systems based on concatenative synthesis and statistical parametric synthesis, the NN based speech synthesis shows conspicuous advantages. It needs less requirement on human pre-processing and feature development, and brings high-quality voice in terms of both intelligibility and naturalness. However, robust NN based speech synthesis model typically requires a sizable set of high-quality data for training, which is expensive to collect especially in low-resource scenarios. It is worth investigating how to take advantage of low-quality material such as automatic speech recognition (ASR) data which can be easily obtained compared with high-quality TTS material. In this paper, we propose a pre-training technique framework to improve the performance of low-resource speech synthesis. The idea is to extend the training material of TTS model by using ASR based data augmentation method. Specifically, we first build a frame-wise phoneme classification network on the ASR dataset and extract the semi-supervised <linguistic features, audio> paired data from large-scale speech corpora. We then pre-train the NN based TTS acoustic model by using the semi-supervised <linguistic features, audio> pairs. Finally, we fine-tune the model with a small amount of available paired data. Experimental results show that our proposed framework enables the TTS model to generate more intelligible and natural speech with the same amount of paired training data.

引用

页码：162 / 172

页数：11

共 29 条

[1] Hierarchical Transfer Learning for Multilingual, Multi-Speaker, and Style Transfer DNN-Based TTS on Low-Resource Languages
Azizah, Kurniawati
Adriani, Mirna
Jatmiko, Wisnu
[J]. IEEE ACCESS, 2020, 8 : 179798 - 179812
[2] Chen MJ, 2021, Arxiv, DOI [arXiv:2103.00993, 10.48550/arXiv.2103.00993]
[3] Chung YA, 2019, INT CONF ACOUST SPEE, P6940, DOI 10.1109/ICASSP.2019.8683862
[4] Daniel PoveyArnab Ghoshal., 2011, The kaldi speech recognition toolkit
[5] de Korte M, 2020, Arxiv, DOI arXiv:2008.09659
[6] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7] Fang W, 2019, Arxiv, DOI arXiv:1906.07307
[8] Hemati H., 2020, arXiv
[9] LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH USING DATA AUGMENTATION
Huybrechts, Goeric
Merritt, Thomas
Comini, Giulia
Perz, Bartek
Shah, Raahil
Lorenzo-Trueba, Jaime
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6593 - 6597
[10] Karlapati S, 2020, Arxiv, DOI arXiv:2004.14617

← 1 2 3 →