Cross-Speaker Emotion Transfer Through Information Perturbation in Emotional Speech Synthesis

被引:5
|
作者
Lei, Yi [1 ]
Yang, Shan [2 ]
Zhu, Xinfa [1 ]
Xie, Lei [1 ]
Su, Dan [2 ]
机构
[1] Northwestern Polytech Univ, Xian 710129, Peoples R China
[2] Tencent AI Lab, Beijing 100086, Peoples R China
基金
国家重点研发计划;
关键词
Timbre; Spectrogram; Perturbation methods; Generators; Speech synthesis; Adaptation models; Acoustics; Cross-speaker emotion transfer; emotional TTS; information perturbation; speech synthesis; RECOGNITION;
D O I
10.1109/LSP.2022.3203888
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Through borrowing emotional expressions from an emotional speaker, cross-speaker emotion transfer is an effective way to produce emotional speech for target speakers without emotional training data. Since emotion and timbre of the source speaker are heavily entangled in speech, existing approaches often struggle to trade off between speaker similarity and emotional expression in the synthetic speech of the target speaker. In this letter, we propose to disentangle timbre and emotion through information perturbation to conduct cross-speaker emotion transfer, which effectively learns the emotional expression of the source speaker and maintains the timbre of the target speaker. Specifically, we separately perturb the timbre and emotion-related features (e.g., formant and pitch) of source speech to obtain and model the timbre- and emotion-independent signals, based on which the proposed model can deliver the emotional expression for target speakers. Experimental results demonstrate the proposed approach significantly outperforms the baselines in terms of naturalness and similarity, indicating the effectiveness of information perturbation for cross-speaker emotion transfer.
引用
收藏
页码:1948 / 1952
页数:5
相关论文
共 50 条
  • [31] Speaker-Aware Speech Emotion Recognition by Fusing Amplitude and Phase Information
    Guo, Lili
    Wang, Longbiao
    Dang, Jianwu
    Liu, Zhilei
    Guan, Haotian
    MULTIMEDIA MODELING (MMM 2020), PT I, 2020, 11961 : 14 - 25
  • [32] Emotional Speech Synthesis for Multi-Speaker Emotional Dataset Using WaveNet Vocoder
    Choi, Heejin
    Park, Sangjun
    Park, Jinuk
    Hahn, Minsoo
    2019 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS (ICCE), 2019,
  • [33] An emotional speech synthesis markup language processor for multi-speaker and emotional text-to-speech applications
    Ryu, Se-Hui
    Cho, Hee
    Lee, Ju-Hyun
    Hong, Ki-Hyung
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2021, 40 (05): : 523 - 529
  • [34] Enhancing Speech Emotion Recognition Using Transfer Learning from Speaker Embeddings
    Jakubec, Maros
    Jarina, Roman
    Lieskovska, Eva
    Kasak, Peter
    Spisiak, Michal
    TEXT, SPEECH, AND DIALOGUE, TSD 2024, PT II, 2024, 15049 : 184 - 195
  • [35] Speaker-dependent model interpolation for statistical emotional speech synthesis
    Chih-Yu Hsu
    Chia-Ping Chen
    EURASIP Journal on Audio, Speech, and Music Processing, 2012
  • [36] Speaker-dependent model interpolation for statistical emotional speech synthesis
    Hsu, Chih-Yu
    Chen, Chia-Ping
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2012, : 1 - 10
  • [37] Emotional feature extraction based on phoneme information for speech emotion recognition
    Hyun, Kyang Hak
    Kim, Eun Ho
    Kwak, Yoon Keun
    2007 RO-MAN: 16TH IEEE INTERNATIONAL SYMPOSIUM ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, VOLS 1-3, 2007, : 797 - +
  • [38] ON-LINE SPEAKER ADAPTATION BASED EMOTION RECOGNITION USING INCREMENTAL EMOTIONAL INFORMATION
    Kim, Jae-Bok
    Park, Jeong-Sik
    Oh, Yung-Hwan
    2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, : 4948 - 4951
  • [39] Improving Speech Emotion Recognition via Fine-tuning ASR with Speaker Information
    Ta, Bao Thang
    Nguyen, Tung Lam
    Dang, Dinh Son
    Le, Nhat Minh
    Do, Van Hai
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1596 - 1601
  • [40] ED-TTS: MULTI-SCALE EMOTION MODELING USING CROSS-DOMAIN EMOTION DIARIZATION FOR EMOTIONAL SPEECH SYNTHESIS
    Tang, Haobin
    Zhang, Xulong
    Cheng, Ning
    Xiao, Jing
    Wang, Jianzong
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 12146 - 12150