The USTC System for Voice Conversion Challenge 2016: Neural Network Based Approaches for Spectrum, Aperiodicity and F0 Conversion

被引:4
作者
Chen, Ling-Hui [1 ,2 ]
Liu, Li-Juan [2 ]
Ling, Zhen-Hua [1 ]
Jiang, Yuan [2 ]
Dai, Li-Rong [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Anhui, Peoples R China
[2] IFLYTEK Res, Hefei, Anhui, Peoples R China
来源
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年
关键词
voice conversion; frequency warping; DNN; RNN; LSTM;
D O I
10.21437/Interspeech.2016-456
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper introduces the methods we adopt to build our system for the evaluation event of Voice Conversion Challenge (VCC) 2016. We propose to use neural network-based approaches to convert both spectral and excitation features. First, the generatively trained deep neural network (GTDNN) is adopted for spectral envelope conversion after the spectral envelopes have been pre-processed by frequency warping. Second, we propose to use a recurrent neural network (RNN) with long short-term memory (LSTM) cells for F0 trajectory conversion. In addition, we adopt a DNN for band aperiodicity conversion. Both internal tests and formal VCC evaluation results demonstrate the effectiveness of the proposed methods.
引用
收藏
页码:1642 / 1646
页数:5
相关论文
共 20 条
[1]  
Abe M., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), P655, DOI 10.1109/ICASSP.1988.196671
[2]  
[Anonymous], 2004, LREC
[3]  
Chen C. J., 1997, EUROSPEECH
[4]  
Chen LH, 2013, INTERSPEECH, P3051
[5]   Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training [J].
Chen, Ling-Hui ;
Ling, Zhen-Hua ;
Liu, Li-Juan ;
Dai, Li-Rong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1859-1872
[6]   Foreign accent conversion in computer assisted pronunciation training [J].
Felps, Daniel ;
Bortfeld, Heather ;
Gutierrez-Osuna, Ricardo .
SPEECH COMMUNICATION, 2009, 51 (10) :920-932
[7]   Deep Neural Networks for Acoustic Modeling in Speech Recognition [J].
Hinton, Geoffrey ;
Deng, Li ;
Yu, Dong ;
Dahl, George E. ;
Mohamed, Abdel-rahman ;
Jaitly, Navdeep ;
Senior, Andrew ;
Vanhoucke, Vincent ;
Patrick Nguyen ;
Sainath, Tara N. ;
Kingsbury, Brian .
IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) :82-97
[8]  
Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423
[9]   Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].
Kawahara, H ;
Masuda-Katsuse, I ;
de Cheveigné, A .
SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207
[10]  
Nakashika T, 2013, INTERSPEECH, P369