Emotional transplant in statistical speech synthesis based on emotion additive model

被引:0
作者
Ohtani, Yaniato [1 ]
Nasu, Yu [1 ]
Morita, Masahiro [1 ]
Akamine, Masami [1 ]
机构
[1] Toshiba Co Ltd, Coorporate R&D Ctr, Knowledge Media Lab, Tokyo, Japan
来源
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5 | 2015年
关键词
speech synthesis; emotional speech synthesis; emotional transplant; hidden Markov model; eigenvoice; SPEAKER ADAPTATION; EXPRESSIONS; FEATURES; STYLES; HSMM;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a novel method to transplant emotions to a new speaker in statistical speech synthesis based on an emotion additive model (EAM), which represents the differences between emotional and neutral voices. This method trains EAM using neutral and emotional speech data of multiple speakers and applies it to a neutral voice model of a new speaker (target). There is some degradation in speech quality due to a mismatch in speakers between the EAM and the target neutral voice model. To alleviate the mismatch, we introduce an eigenvoice technique to this framework. We build neutral voice models and EAMs using multiple speakers, and construct an eigenvoice space consisting the neutral voice models and EAMs. To transplant the emotion to the target speaker, the proposed method estimates weights of eigenvoices for the target neutral speech data based on a maximum likelihood criteria. The EAM of the target speaker is obtained by applying the estimated weights to the EAM parameters of the eigenvoice space. Emotional speech is generated using the EAM and the neutral voice model. Experimental results show that the proposed method performs emotional speech synthesis with reasonable emotions and high speech quality.
引用
收藏
页码:274 / 278
页数:5
相关论文
共 20 条
[1]  
[Anonymous], P INTERSPEECH
[2]  
Chen L., 2013, P INTERSPEECH2013 AU, P1042
[3]   Generalizing and optimizing fractional frequency reuse in broadband cellular radio access networks [J].
Chen, Lei ;
Yuan, Di .
EURASIP JOURNAL ON WIRELESS COMMUNICATIONS AND NETWORKING, 2012,
[4]  
Dounipiotis V., 2004, P ICASSP, V1, P357
[5]   Cluster adaptive training of hidden Markov models [J].
Gales, MJF .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2000, 8 (04) :417-428
[6]   Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech [J].
Jackson, PJB ;
Shadle, CH .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2001, 9 (07) :713-726
[7]   Rapid speaker adaptation in eigenvoice space [J].
Kuhn, R ;
Junqua, JC ;
Nguyen, P ;
Niedzielski, N .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2000, 8 (06) :695-707
[8]  
Latorre J., 2012, P INTERSPEECH2012 SE
[9]  
Lorenzo-Trueba J., 2013, 8 ISCA SPEECH SYNTH, P159
[10]  
Nankaku Y, 2008, INT CONF ACOUST SPEE, P4469