Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis

被引:5
|
作者
Lei, Yi [1 ]
Yang, Shan [2 ]
Zhu, Xinfa [1 ]
Xie, Lei [1 ]
Su, Dan [2 ]
机构
[1] Northwestern Polytech Univ, Xian 710129, Peoples R China
[2] Tencent AI Lab, Beijing 100086, Peoples R China
基金
国家重点研发计划;
关键词
Timbre; Spectrogram; Perturbation methods; Generators; Speech synthesis; Adaptation models; Acoustics; Cross-speaker emotion transfer; emotional TTS; information perturbation; speech synthesis; RECOGNITION;
D O I
10.1109/LSP.2022.3203888
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Through borrowing emotional expressions from an emotional speaker, cross-speaker emotion transfer is an effective way to produce emotional speech for target speakers without emotional training data. Since emotion and timbre of the source speaker are heavily entangled in speech, existing approaches often struggle to trade off between speaker similarity and emotional expression in the synthetic speech of the target speaker. In this letter, we propose to disentangle timbre and emotion through information perturbation to conduct cross-speaker emotion transfer, which effectively learns the emotional expression of the source speaker and maintains the timbre of the target speaker. Specifically, we separately perturb the timbre and emotion-related features (e.g., formant and pitch) of source speech to obtain and model the timbre- and emotion-independent signals, based on which the proposed model can deliver the emotional expression for target speakers. Experimental results demonstrate the proposed approach significantly outperforms the baselines in terms of naturalness and similarity, indicating the effectiveness of information perturbation for cross-speaker emotion transfer.
引用
收藏
页码:1948 / 1952
页数:5
相关论文
共 50 条
  • [41] Mongolian emotional speech synthesis based on transfer learning and emotional embedding
    Huang, Aihong
    Bao, Feilong
    Gao, Guanglai
    Shan, Yu
    Liu, Rui
    2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2021, : 78 - 83
  • [42] Cross-Corpus Speech Emotion Recognition Based on Causal Emotion Information Representation
    Fu, Hongliang
    Li, Qianqian
    Tao, Huawei
    Zhu, Chunhua
    Xie, Yue
    Guo, Ruxue
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2024, E107D (08) : 1097 - 1100
  • [43] EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis
    Tang, Haobin
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    INTERSPEECH 2023, 2023, : 12 - 16
  • [44] Emotional transplant in statistical speech synthesis based on emotion additive model
    Ohtani, Yaniato
    Nasu, Yu
    Morita, Masahiro
    Akamine, Masami
    16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 274 - 278
  • [45] Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder
    Bang, Chae-Woon
    Chun, Chanjun
    SENSORS, 2023, 23 (23)
  • [46] EMOTION CONTROLLABLE SPEECH SYNTHESIS USING EMOTION-UNLABELED DATASET WITH THE ASSISTANCE OF CROSS-DOMAIN SPEECH EMOTION RECOGNITION
    Cai, Xiong
    Dai, Dongyang
    Wu, Zhiyong
    Li, Xiang
    Li, Jingbei
    Meng, Helen
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5734 - 5738
  • [47] Controllable Emotion Transfer For End-to-End Speech Synthesis
    Li, Tao
    Yang, Shan
    Xue, Liumeng
    Xie, Lei
    2021 12TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2021,
  • [48] Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
    Albanie, Samuel
    Nagrani, Arsha
    Vedaldi, Andrea
    Zisserman, Andrew
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 292 - 301
  • [49] MULTI-SPEAKER EMOTIONAL SPEECH SYNTHESIS WITH FINE-GRAINED PROSODY MODELING
    Lu, Chunhui
    Wen, Xue
    Liu, Ruolan
    Chen, Xiao
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5729 - 5733
  • [50] MULTI-SPEAKER EMOTIONAL ACOUSTIC MODELING FOR CNN-BASED SPEECH SYNTHESIS
    Choi, Heejin
    Park, Sangjun
    Park, Jinuk
    Hahn, Minsoo
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6950 - 6954