Effective Data Augmentation Methods for Neural Text-to-Speech Systems

被引：0

作者：

Oh, Suhyeon ^{[1
]}

Kwon, Ohsung ^{[1
]}

Hwang, Min-Jae ^{[1
]}

Kim, Jae-Min ^{[1
]}

Song, Eunwoo ^{[1
]}

机构：

[1] NAVER Corp, Seongnam, South Korea

来源：

2022 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC) | 2022年

关键词：

speech synthesis; self-augmentation; ranking support vector machine;

D O I：

10.1109/ICEIC54506.2022.9748515

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

This paper proposes an effective self-augmentation method for improving the quality of neural text-to-speech (TTS) systems. As synthetic speech quality has been greatly improved, creating a neural TTS system using synthetic corpora is now possible. However, whether increasing the amount of synthetic data is always beneficial for improving training efficiency has not been verified. Our aim in this study is to selectively choose synthetic data whose characteristics are close to those of natural speech. Specifically, we adopt a ranking support vector machine (RankSVM) that is well known for effectively ranking relative attributes among binary classes. By setting the synthetic and recorded corpora as two opposite classes, RankSVM is used to determine how the synthesized speech is acoustically similar with the recorded data. As training data can be selectively chosen from large-scale synthetic corpora, the performance of the TTS model re-trained by those data is significantly improved. Subjective evaluation results verify that the proposed TTS model performs much better than the original model trained with recorded data alone and the similarly configured system re-trained with all the synthetic data without any selection method.

引用

页数：4

共 50 条

[31] CLUSTERING OF DURATION PATTERNS IN SPEECH FOR TEXT-TO-SPEECH SYNTHESIS
Sreelekshmi, K. S.
Gopinath, Deepa P.
[J]. 2012 ANNUAL IEEE INDIA CONFERENCE (INDICON), 2012, : 1122 - 1127
[32] Diphone Spanish Text-to-Speech Synthesizer
Rybarova, Renata
del Corral, Gonzalo
Rozinaj, Gregor
[J]. 2015 INTERNATIONAL CONFERENCE ON SYSTEMS, SIGNALS AND IMAGE PROCESSING (IWSSIP 2015), 2015, : 121 - 124
[33] Dealing with prosody in a text-to-speech system
Goldsmith J.
[J]. International Journal of Speech Technology, 1999, 3 (1) : 51 - 63
[34] Comparison of the ITU-T P.85 Standard to Other Methods for the Evaluation of Text-to-Speech Systems
Sityaev, Dmitry
Knill, Katherine
Burrows, Tina
[J]. INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1077 - 1080
[35] A Rule-Based Concatenative Approach to Speech Synthesis in Indian Language Text-to-Speech Systems
Panda, Soumya Priyadarsini
Nayak, Ajit Kumar
[J]. INTELLIGENT COMPUTING, COMMUNICATION AND DEVICES, 2015, 309 : 523 - 531
[36] Spectral Smoothening Based Waveform Concatenation Technique for Speech Quality Enhancement in Text-to-Speech Systems
Panda, Soumya Priyadarsini
Nayak, Ajit Kumar
[J]. ADVANCED COMPUTING AND INTELLIGENT ENGINEERING, 2020, 1082 : 425 - 432
[37] REPETITION AND RE-START STRATEGIES FOR PROSODY IN TEXT-TO-SPEECH CONVERSION SYSTEMS
LAVER, J
[J]. SPEECH COMMUNICATION, 1993, 13 (1-2) : 75 - 85
[38] BOOTSTRAPPING TEXT-TO-SPEECH FOR SPEECH PROCESSING IN LANGUAGES WITHOUT AN ORTHOGRAPHY
Sitaram, Sunayana
Palkar, Sukhada
Chen, Yun-Nung
Parlikar, Alok
Black, Alan W.
[J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7992 - 7996
[39] Pause Insertion Based on a Morphosyntactic Parser for Brazilian Portuguese Text-to-Speech Systems
Seara, Izabel C.
Kafka, Sandra G.
Seara, Rui, Jr.
Klein, Simone
Pacheco, Fernando S.
Seara, Rui
[J]. PROCEEDINGS OF THE IEEE INTERNATIONAL TELECOMMUNICATIONS SYMPOSIUM, VOLS 1 AND 2, 2006, : 718 - 722
[40] Intonation Control for Neural Text-to-Speech Synthesis with Polynomial Models of F0
Corkey, Niamh
O'Mahony, Johannah
King, Simon
[J]. INTERSPEECH 2023, 2023, : 2014 - 2015

← 1 2 3 4 5 →