Effective Data Augmentation Methods for Neural Text-to-Speech Systems

被引：0

作者：

Oh, Suhyeon ^{[1
]}

Kwon, Ohsung ^{[1
]}

Hwang, Min-Jae ^{[1
]}

Kim, Jae-Min ^{[1
]}

Song, Eunwoo ^{[1
]}

机构：

[1] NAVER Corp, Seongnam, South Korea

来源：

2022 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC) | 2022年

关键词：

speech synthesis; self-augmentation; ranking support vector machine;

D O I：

10.1109/ICEIC54506.2022.9748515

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

This paper proposes an effective self-augmentation method for improving the quality of neural text-to-speech (TTS) systems. As synthetic speech quality has been greatly improved, creating a neural TTS system using synthetic corpora is now possible. However, whether increasing the amount of synthetic data is always beneficial for improving training efficiency has not been verified. Our aim in this study is to selectively choose synthetic data whose characteristics are close to those of natural speech. Specifically, we adopt a ranking support vector machine (RankSVM) that is well known for effectively ranking relative attributes among binary classes. By setting the synthetic and recorded corpora as two opposite classes, RankSVM is used to determine how the synthesized speech is acoustically similar with the recorded data. As training data can be selectively chosen from large-scale synthetic corpora, the performance of the TTS model re-trained by those data is significantly improved. Subjective evaluation results verify that the proposed TTS model performs much better than the original model trained with recorded data alone and the similarly configured system re-trained with all the synthetic data without any selection method.

引用

页数：4

共 50 条

[21] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
Zolzaya Byambadorj
Ryota Nishimura
Altangerel Ayush
Kengo Ohta
Norihide Kitaoka
EURASIP Journal on Audio, Speech, and Music Processing, 2021
[22] TACOTRON-BASED ACOUSTIC MODEL USING PHONEME ALIGNMENT FOR PRACTICAL NEURAL TEXT-TO-SPEECH SYSTEMS
Okamoto, Takuma
Toda, Tomoki
Shiga, Yoshinori
Kawai, Hisashi
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, : 214 - 221
[23] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
Byambadorj, Zolzaya
Nishimura, Ryota
Ayush, Altangerel
Ohta, Kengo
Kitaoka, Norihide
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
[24] Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data
Huang, Wen-Chin
Wu, Yi-Chiao
Toda, Tomoki
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2995 - 2999
[25] A hybrid model for text-to-speech synthesis
Violaro, F
Boeffard, O
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (05): : 426 - 434
[26] Pitch models of Mandarin text-to-speech
邵艳秋
穗志方
韩纪庆
Journal of Harbin Institute of Technology(New series), 2009, 16 (02) : 179 - 184
[27] Environment Aware Text-to-Speech Synthesis
Tan, Daxin
Zhang, Guangyan
Lee, Tan
INTERSPEECH 2022, 2022, : 481 - 485
[28] Diphone Spanish Text-to-Speech Synthesizer
Rybarova, Renata
del Corral, Gonzalo
Rozinaj, Gregor
2015 INTERNATIONAL CONFERENCE ON SYSTEMS, SIGNALS AND IMAGE PROCESSING (IWSSIP 2015), 2015, : 121 - 124
[29] Dealing with prosody in a text-to-speech system
Goldsmith J.
International Journal of Speech Technology, 1999, 3 (1) : 51 - 63
[30] Exploring Efficient Neural Architectures for Linguistic-Acoustic Mapping in Text-To-Speech
Pascual, Santiago
Serra, Joan
Bonafonte, Antonio
APPLIED SCIENCES-BASEL, 2019, 9 (16):

← 1 2 3 4 5 →