Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement

被引:8
作者
Tan, Daxin [1 ]
Lee, Tan [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Elect Engn, Hong Kong, Peoples R China
来源
INTERSPEECH 2021 | 2021年
关键词
speech synthesis; style transfer; prosody;
D O I
10.21437/Interspeech.2021-1129
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the "content leakage" problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech synthesis. Audio samples are provided for system demonstration(1).
引用
收藏
页码:4683 / 4687
页数:5
相关论文
共 32 条
  • [1] Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers
    Akcay, Mehmet Berkehan
    Oguz, Kaya
    [J]. SPEECH COMMUNICATION, 2020, 116 (116) : 56 - 76
  • [2] [Anonymous], 2017, LJ SPEECH DATASET
  • [3] One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
    Chou, Ju-chieh
    Lee, Hung-Yi
    [J]. INTERSPEECH 2019, 2019, : 664 - 668
  • [4] Generative Adversarial Networks
    Goodfellow, Ian
    Pouget-Abadie, Jean
    Mirza, Mehdi
    Xu, Bing
    Warde-Farley, David
    Ozair, Sherjil
    Courville, Aaron
    Bengio, Yoshua
    [J]. COMMUNICATIONS OF THE ACM, 2020, 63 (11) : 139 - 144
  • [5] Conformer: Convolution-augmented Transformer for Speech Recognition
    Gulati, Anmol
    Qin, James
    Chiu, Chung-Cheng
    Parmar, Niki
    Zhang, Yu
    Yu, Jiahui
    Han, Wei
    Wang, Shibo
    Zhang, Zhengdong
    Wu, Yonghui
    Pang, Ruoming
    [J]. INTERSPEECH 2020, 2020, : 5036 - 5040
  • [6] ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context
    Han, Wei
    Zhang, Zhengdong
    Zhang, Yu
    Yu, Jiahui
    Chiu, Chung-Cheng
    Qin, James
    Gulati, Anmol
    Pang, Ruoming
    Wu, Yonghui
    [J]. INTERSPEECH 2020, 2020, : 3610 - 3614
  • [7] Hsu W.-N., 2018, P INT C LEARN REPR
  • [8] Hu TY, 2020, INT CONF ACOUST SPEE, P3267, DOI [10.1109/icassp40776.2020.9054591, 10.1109/ICASSP40776.2020.9054591]
  • [9] StarGAN-VC2: Rethinking Conditional Methods for StarGAN-Based Voice Conversion
    Kaneko, Takuhiro
    Kameoka, Hirokazu
    Tanaka, Kou
    Hojo, Nobukatsu
    [J]. INTERSPEECH 2019, 2019, : 679 - 683
  • [10] CopyCat: Many-to-Many Fine-Grained Prosody Transfer for Neural Text-to-Speech
    Karlapati, Sri
    Moinet, Alexis
    Joly, Arnaud
    Klimkov, Viacheslav
    Sciez-Trigueros, Daniel
    Drugman, Thomas
    [J]. INTERSPEECH 2020, 2020, : 4387 - 4391