The USTC System for Voice Conversion Challenge 2016: Neural Network Based Approaches for Spectrum, Aperiodicity and F0 Conversion

被引:4
作者
Chen, Ling-Hui [1 ,2 ]
Liu, Li-Juan [2 ]
Ling, Zhen-Hua [1 ]
Jiang, Yuan [2 ]
Dai, Li-Rong [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Anhui, Peoples R China
[2] IFLYTEK Res, Hefei, Anhui, Peoples R China
来源
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年
关键词
voice conversion; frequency warping; DNN; RNN; LSTM;
D O I
10.21437/Interspeech.2016-456
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper introduces the methods we adopt to build our system for the evaluation event of Voice Conversion Challenge (VCC) 2016. We propose to use neural network-based approaches to convert both spectral and excitation features. First, the generatively trained deep neural network (GTDNN) is adopted for spectral envelope conversion after the spectral envelopes have been pre-processed by frequency warping. Second, we propose to use a recurrent neural network (RNN) with long short-term memory (LSTM) cells for F0 trajectory conversion. In addition, we adopt a DNN for band aperiodicity conversion. Both internal tests and formal VCC evaluation results demonstrate the effectiveness of the proposed methods.
引用
收藏
页码:1642 / 1646
页数:5
相关论文
共 20 条
  • [1] Abe M., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), P655, DOI 10.1109/ICASSP.1988.196671
  • [2] [Anonymous], 2004, LREC
  • [3] Chen C. J., 1997, EUROSPEECH
  • [4] Chen LH, 2013, INTERSPEECH, P3051
  • [5] Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training
    Chen, Ling-Hui
    Ling, Zhen-Hua
    Liu, Li-Juan
    Dai, Li-Rong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) : 1859 - 1872
  • [6] Foreign accent conversion in computer assisted pronunciation training
    Felps, Daniel
    Bortfeld, Heather
    Gutierrez-Osuna, Ricardo
    [J]. SPEECH COMMUNICATION, 2009, 51 (10) : 920 - 932
  • [7] Deep Neural Networks for Acoustic Modeling in Speech Recognition
    Hinton, Geoffrey
    Deng, Li
    Yu, Dong
    Dahl, George E.
    Mohamed, Abdel-rahman
    Jaitly, Navdeep
    Senior, Andrew
    Vanhoucke, Vincent
    Patrick Nguyen
    Sainath, Tara N.
    Kingsbury, Brian
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) : 82 - 97
  • [8] Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423
  • [9] Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds
    Kawahara, H
    Masuda-Katsuse, I
    de Cheveigné, A
    [J]. SPEECH COMMUNICATION, 1999, 27 (3-4) : 187 - 207
  • [10] Nakashika T, 2013, INTERSPEECH, P369