Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis

被引:5
|
作者
Lei, Yi [1 ]
Yang, Shan [2 ]
Zhu, Xinfa [1 ]
Xie, Lei [1 ]
Su, Dan [2 ]
机构
[1] Northwestern Polytech Univ, Xian 710129, Peoples R China
[2] Tencent AI Lab, Beijing 100086, Peoples R China
基金
国家重点研发计划;
关键词
Timbre; Spectrogram; Perturbation methods; Generators; Speech synthesis; Adaptation models; Acoustics; Cross-speaker emotion transfer; emotional TTS; information perturbation; speech synthesis; RECOGNITION;
D O I
10.1109/LSP.2022.3203888
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Through borrowing emotional expressions from an emotional speaker, cross-speaker emotion transfer is an effective way to produce emotional speech for target speakers without emotional training data. Since emotion and timbre of the source speaker are heavily entangled in speech, existing approaches often struggle to trade off between speaker similarity and emotional expression in the synthetic speech of the target speaker. In this letter, we propose to disentangle timbre and emotion through information perturbation to conduct cross-speaker emotion transfer, which effectively learns the emotional expression of the source speaker and maintains the timbre of the target speaker. Specifically, we separately perturb the timbre and emotion-related features (e.g., formant and pitch) of source speech to obtain and model the timbre- and emotion-independent signals, based on which the proposed model can deliver the emotional expression for target speakers. Experimental results demonstrate the proposed approach significantly outperforms the baselines in terms of naturalness and similarity, indicating the effectiveness of information perturbation for cross-speaker emotion transfer.
引用
收藏
页码:1948 / 1952
页数:5
相关论文
共 50 条
  • [1] Cross-Speaker Emotion Disentangling and Transfer for End-to-End Speech Synthesis
    Li, Tao
    Wang, Xinsheng
    Xie, Qicong
    Wang, Zhichao
    Xie, Lei
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1448 - 1460
  • [2] METTS: Multilingual Emotional Text-to-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer
    Zhu, Xinfa
    Lei, Yi
    Li, Tao
    Zhang, Yongmao
    Zhou, Hongbin
    Lu, Heng
    Xie, Lei
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1506 - 1518
  • [3] Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis
    Li, Tao
    Wang, Xinsheng
    Xie, Qicong
    Wang, Zhichao
    Jiang, Mingqi
    Xie, Lei
    INTERSPEECH 2022, 2022, : 5498 - 5502
  • [4] Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis
    Pan, Shifeng
    He, Lei
    INTERSPEECH 2021, 2021, : 4678 - 4682
  • [5] iEmoTTS: Toward Robust Cross-Speaker Emotion Transfer and Control for Speech Synthesis Based on Disentanglement Between Prosody and Timbre
    Zhang, Guangyan
    Qin, Ying
    Zhang, Wenjie
    Wu, Jialun
    Li, Mei
    Gai, Yutao
    Jiang, Feijun
    Lee, Tan
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 : 1693 - 1705
  • [6] ACCENT CONVERSION THROUGH CROSS-SPEAKER ARTICULATORY SYNTHESIS
    Aryal, Sandesh
    Gutierrez-Osuna, Ricardo
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [7] Daft-Exprt: Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis
    Zaidi, Julian
    Seute, Hugo
    van Niekerk, Benjamin
    Carbonneau, Marc-Andre
    INTERSPEECH 2022, 2022, : 4591 - 4595
  • [8] CROSS-SPEAKER STYLE TRANSFER FOR TEXT-TO-SPEECH USING DATA AUGMENTATION
    Ribeiro, Manuel Sam
    Roth, Julian
    Comini, Giulia
    Huybrechts, Goeric
    Gabrys, Adam
    Lorenzo-Trueba, Jaime
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6797 - 6801
  • [9] Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis
    Qiang, Chunyu
    Yang, Peng
    Che, Hao
    Wang, Xiaorui
    Wang, Zhongyuan
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 61 - 65
  • [10] Incorporating Cross-speaker Style Transfer for Multi-language Text-to-Speech
    Shang, Zengqiang
    Huang, Zhihua
    Zhang, Haozhe
    Zhang, Pengyuan
    Yan, Yonghong
    INTERSPEECH 2021, 2021, : 1619 - 1623