Statistical voice conversion with WaveNet-based waveform generation

被引:64
作者
Kobayashi, Kazuhiro [1 ]
Hayashi, Tomoki [2 ]
Tamamori, Akira [3 ]
Toda, Tomoki [1 ]
机构
[1] Nagoya Univ, Informat Technol Ctr, Nagoya, Aichi, Japan
[2] Nagoya Univ, Grad Sch Informat Sci, Nagoya, Aichi, Japan
[3] Nagoya Univ, Inst Innovat Future Soc, Nagoya, Aichi, Japan
来源
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION | 2017年
关键词
voice conversion; WaveNet; vocoder; Gaussian mixture model; deep neural networks; PLUS NOISE MODEL; SPARSE REPRESENTATION; VOCODER;
D O I
10.21437/Interspeech.2017-986
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a statistical voice conversion (VC) technique with the WaveNet-based waveform generation. VC based on a Gaussian mixture model (GMM) makes it possible to convert the speaker identity of a source speaker into that of a target speaker. However, in the conventional vocoding process, various factors such as Fs extraction errors. parameterization errors and over -smoothing effects of converted feature trajectory cause the modeling errors of the speech waveform, which usually bring about sound quality degradation of the converted voice. To address this issue, we apply a direct waveform generation technique based on a WaveNet vocoder to VC. In the proposed method, first. the acoustic features of the source speaker arc converted into those of the target speaker based on the GMM. Then, the waveform samples of the converted voice are generated based on the WaveNet vocoder conditioned on the converted acoustic features. In this paper. to investigate the modeling accuracies of the converted speech waveform, we compare several types of the acoustic features for training and synthesizing based on the WaveNet vocoder. The experimental results confirmed that the proposed VC technique achieves higher conversion accuracy on speaker individuality with comparable sound quality compared to the conventional VC technique.
引用
收藏
页码:1138 / 1142
页数:5
相关论文
共 28 条
[1]  
Abe M., 1990, Journal of the Acoustical Society of Japan (E), V11, P71, DOI 10.1250/ast.11.71
[2]  
[Anonymous], 1988, ITU T REC G 711
[3]  
[Anonymous], 2015, Optimization
[4]  
Banno H., 2012, P INTERSPEECH SEPT
[5]   Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training [J].
Chen, Ling-Hui ;
Ling, Zhen-Hua ;
Liu, Li-Juan ;
Dai, Li-Rong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1859-1872
[6]   Remaking speech [J].
Dudley, H .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1939, 11 (02) :169-177
[7]   Harmonics Plus Noise Model Based Vocoder for Statistical Parametric Speech Synthesis [J].
Erro, Daniel ;
Sainz, Inaki ;
Navas, Eva ;
Hernaez, Inma .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2014, 8 (02) :184-194
[8]   Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].
Kawahara, H ;
Masuda-Katsuse, I ;
de Cheveigné, A .
SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207
[9]  
Kawahara Hideki., 2001, Proc. of MAVEBA, P13
[10]  
Kobayashi K., 2016, P INTERSPEECH