Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement

被引：8

作者：

Tan, Daxin ^{[1
]}

Lee, Tan ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Dept Elect Engn, Hong Kong, Peoples R China

来源：

INTERSPEECH 2021 | 2021年

关键词：

speech synthesis; style transfer; prosody;

D O I：

10.21437/Interspeech.2021-1129

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the "content leakage" problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech synthesis. Audio samples are provided for system demonstration(1).

引用

页码：4683 / 4687

页数：5

共 32 条

[1] Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers
Akcay, Mehmet Berkehan
Oguz, Kaya
[J]. SPEECH COMMUNICATION, 2020, 116 (116) : 56 - 76
[2] [Anonymous], 2017, LJ SPEECH DATASET
[3] One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
Chou, Ju-chieh
Lee, Hung-Yi
[J]. INTERSPEECH 2019, 2019, : 664 - 668
[4] Generative Adversarial Networks
Goodfellow, Ian
Pouget-Abadie, Jean
Mirza, Mehdi
Xu, Bing
Warde-Farley, David
Ozair, Sherjil
Courville, Aaron
Bengio, Yoshua
[J]. COMMUNICATIONS OF THE ACM, 2020, 63 (11) : 139 - 144
[5] Conformer: Convolution-augmented Transformer for Speech Recognition
Gulati, Anmol
Qin, James
Chiu, Chung-Cheng
Parmar, Niki
Zhang, Yu
Yu, Jiahui
Han, Wei
Wang, Shibo
Zhang, Zhengdong
Wu, Yonghui
Pang, Ruoming
[J]. INTERSPEECH 2020, 2020, : 5036 - 5040
[6] ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
Han, Wei
Zhang, Zhengdong
Zhang, Yu
Yu, Jiahui
Chiu, Chung-Cheng
Qin, James
Gulati, Anmol
Pang, Ruoming
Wu, Yonghui
[J]. INTERSPEECH 2020, 2020, : 3610 - 3614
[7] Hsu W.-N., 2018, P INT C LEARN REPR
[8] Hu TY, 2020, INT CONF ACOUST SPEE, P3267, DOI [10.1109/icassp40776.2020.9054591, 10.1109/ICASSP40776.2020.9054591]
[9] StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion
Kaneko, Takuhiro
Kameoka, Hirokazu
Tanaka, Kou
Hojo, Nobukatsu
[J]. INTERSPEECH 2019, 2019, : 679 - 683
[10] CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
Karlapati, Sri
Moinet, Alexis
Joly, Arnaud
Klimkov, Viacheslav
Sciez-Trigueros, Daniel
Drugman, Thomas
[J]. INTERSPEECH 2020, 2020, : 4387 - 4391

← 1 2 3 4 →