MsEmoTTS: Multi-Scale Emotion Transfer, Prediction, and Control for Emotional Speech Synthesis

被引:32
|
作者
Lei, Yi [1 ]
Yang, Shan [2 ]
Wang, Xinsheng [3 ,4 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, ASGO, Audio Speech & Langauge Proc Grp, Xian 710072, Peoples R China
[2] Tencent AI Lab, Beijing 100086, Peoples R China
[3] Xi An Jiao Tong Univ, Sch Software Engn, Xian 710049, Peoples R China
[4] Northwestern Polytech Univ, Sch Comp Sci, Xian 710072, Peoples R China
关键词
Speech synthesis; Predictive models; Analytical models; Virtual assistants; Speech; Feature extraction; Decoding; emotional speech synthesis; emotion strengths; multi-scale; PROSODY;
D O I
10.1109/TASLP.2022.3145293
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Expressive synthetic speech is essential for many human-computer interaction and audio broadcast scenarios, and thus synthesizing expressive speech has attracted much attention in recent years. Previous methods performed the expressive speech synthesis either with explicit labels or with a fixed-length style embedding extracted from reference audio, both of which can only learn an average style and thus ignores the multi-scale nature of speech prosody. In this paper, we propose MsEmoTTS, a multi-scale emotional speech synthesis framework, to model the emotion from different levels. Specifically, the proposed method is a typical attention-based sequence-to-sequence model and with proposed three modules, including global-level emotion presenting module (GM), utterance-level emotion presenting module (UM), and local-level emotion presenting module (LM), to model the global emotion category, utterance-level emotion variation, and syllable-level emotion strength, respectively. In addition to modeling the emotion from different levels, the proposed method also allows us to synthesize emotional speech in different ways, i.e., transferring the emotion from reference audio, predicting the emotion from input text, and controlling the emotion strength manually. Extensive experiments conducted on a Chinese emotional speech corpus demonstrate that the proposed method outperforms the compared reference audio-based and text-based emotional speech synthesis methods on the emotion transfer speech synthesis and text-based emotion prediction speech synthesis respectively. Besides, the experiments also show that the proposed method can control the emotion expressions flexibly. Detailed analysis shows the effectiveness of each module and the good design of the proposed method.
引用
收藏
页码:853 / 864
页数:12
相关论文
共 50 条
  • [1] FINE-GRAINED EMOTION STRENGTH TRANSFER, CONTROL AND PREDICTION FOR EMOTIONAL SPEECH SYNTHESIS
    Lei, Yi
    Yang, Shan
    Xie, Lei
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 423 - 430
  • [2] ED-TTS: MULTI-SCALE EMOTION MODELING USING CROSS-DOMAIN EMOTION DIARIZATION FOR EMOTIONAL SPEECH SYNTHESIS
    Tang, Haobin
    Zhang, Xulong
    Cheng, Ning
    Xiao, Jing
    Wang, Jianzong
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 12146 - 12150
  • [3] A Lightweight Multi-Scale Model for Speech Emotion Recognition
    Li, Haoming
    Zhao, Daqi
    Wang, Jingwen
    Wang, Deqiang
    IEEE ACCESS, 2024, 12 : 130228 - 130240
  • [4] Towards Multi-Scale Style Control for Expressive Speech Synthesis
    Li, Xiang
    Song, Changhe
    Li, Jingbei
    Wu, Zhiyong
    Jia, Jia
    Meng, Helen
    INTERSPEECH 2021, 2021, : 4673 - 4677
  • [5] Multi-Scale Temporal Transformer For Speech Emotion Recognition
    Li, Zhipeng
    Xing, Xiaofen
    Fang, Yuanbo
    Zhang, Weibin
    Fan, Hengsheng
    Xu, Xiangmin
    INTERSPEECH 2023, 2023, : 3652 - 3656
  • [6] A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition
    Chen, Ming
    Zhao, Xudong
    INTERSPEECH 2020, 2020, : 374 - 378
  • [7] EFFICIENT SPEECH EMOTION RECOGNITION USING MULTI-SCALE CNN AND ATTENTION
    Peng, Zixuan
    Lu, Yu
    Pan, Shengfeng
    Liu, Yunfeng
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3020 - 3024
  • [8] Multi-scale discrepancy adversarial network for crosscorpus speech emotion recognition
    Zheng W.
    Zheng W.
    Zong Y.
    Zheng, Wenming (wenming_zheng@seu.edu.cn), 1600, KeAi Communications Co. (03): : 65 - 75
  • [9] Multi-scale Context Based Attention for Dynamic Music Emotion Prediction
    Ma, Ye
    Li, Xinxing
    Xu, Mingxing
    Jia, Jia
    Cai, Lianhong
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1443 - 1450
  • [10] Learning multi-scale features for speech emotion recognition with connection attention mechanism
    Chen, Zengzhao
    Li, Jiawen
    Liu, Hai
    Wang, Xuyang
    Wang, Hu
    Zheng, Qiuyu
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 214