CONTROLLING EMOTION STRENGTH WITH RELATIVE ATTRIBUTE FOR END-TO-END SPEECH SYNTHESIS

被引:0
作者
Zhu, Xiaolian [1 ,2 ]
Yang, Shan [1 ]
Yang, Geng [1 ]
Xie, Lei [1 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China
[2] Hebei Univ Econ & Business, Publ Comp Educ Ctr, Shijiazhuang, Hebei, Peoples R China
来源
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019) | 2019年
关键词
Emotion strength; relative attributes; speech synthesis; text-to-speech; end-to-end;
D O I
10.1109/asru46091.2019.9003829
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, attention-based end-to-end speech synthesis has achieved superior performance compared to traditional speech synthesis models, and several approaches like global style tokens are proposed to explore the style controllability of the end-to-end model. Although the existing methods show good performance in style disentanglement and transfer, it is still unable to control the explicit emotion of generated speech. In this paper, we mainly focus on the subtle control of expressive speech synthesis, where the emotion category and strength can be easily controlled with a discrete emotional vector and a continuous simple scalar, respectively. The continuous strength controller is learned by a ranking function according to the relative attribute measured on an emotion dataset. Our method automatically learns the relationship between low-level acoustic features and high-level subtle emotion strength. Experiments show that our method can effectively improve the controllability for an expressive end-to-end model.
引用
收藏
页码:192 / 199
页数:8
相关论文
共 50 条
[41]   Testing the Limits of Representation Mixing for Pronunciation Correction in End-to-End Speech Synthesis [J].
Fong, Jason ;
Taylor, Jason ;
King, Simon .
INTERSPEECH 2020, 2020, :4019-4023
[42]   ATTENTION-AUGMENTED END-TO-END MULTI-TASK LEARNING FOR EMOTION PREDICTION FROM SPEECH [J].
Zhang, Zixing ;
Wu, Bingwen ;
Schuller, Bjoern .
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, :6705-6709
[43]   Insights on Neural Representations for End-to-End Speech Recognition [J].
Ollerenshaw, Anna ;
Jalal, Asif ;
Hain, Thomas .
INTERSPEECH 2021, 2021, :4079-4083
[44]   Hybrid end-to-end model for Kazakh speech recognition [J].
Mamyrbayev O.Z. ;
Oralbekova D.O. ;
Alimhan K. ;
Nuranbayeva B.M. .
International Journal of Speech Technology, 2023, 26 (02) :261-270
[45]   MINTZAI: End-to-end Deep Learning for Speech Translation [J].
Etchegoyhen, Thierry ;
Arzelus, Haritz ;
Gete, Harritxu ;
Alvarez, Aitor ;
Hernaez, Inma ;
Navas, Eva ;
Gonzalez-Docasal, Ander ;
Osacar, Jaime ;
Benites, Edson ;
Ellakuria, Igor ;
Calonge, Eusebi ;
Martin, Maite .
PROCESAMIENTO DEL LENGUAJE NATURAL, 2020, (65) :97-100
[46]   Towards End-to-End Speech-to-Text Summarization [J].
Monteiro, Raul ;
Pernes, Diogo .
TEXT, SPEECH, AND DIALOGUE, TSD 2023, 2023, 14102 :304-316
[47]   A COMPARATIVE STUDY ON END-TO-END SPEECH TO TEXT TRANSLATION [J].
Bahar, Parnia ;
Bieschke, Tobias ;
Ney, Hermann .
2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), 2019, :792-799
[48]   IMPROVING END-TO-END SPEECH SYNTHESIS WITH LOCAL RECURRENT NEURAL NETWORK ENHANCED TRANSFORMER [J].
Zheng, Yibin ;
Li, Xinhui ;
Xie, Fenglong ;
Lu, Li .
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :6734-6738
[49]   Towards end-to-end speech recognition with transfer learning [J].
Qin, Chu-Xiong ;
Qu, Dan ;
Zhang, Lian-Hai .
EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2018,
[50]   Combination of end-to-end and hybrid models for speech recognition [J].
Wong, Jeremy H. M. ;
Gaur, Yashesh ;
Zhao, Rui ;
Lu, Liang ;
Sun, Eric ;
Li, Jinyu ;
Gong, Yifan .
INTERSPEECH 2020, 2020, :1783-1787