IMPROVING SEQUENCE-TO-SEQUENCE VOICE CONVERSION BY ADDING TEXT-SUPERVISION

被引:0
作者
Zhang, Jing-Xuan [1 ]
Ling, Zhen-Hua [1 ]
Jiang, Yuan [2 ]
Liu, Li-Juan [2 ]
Liang, Chen [3 ]
Dai, Li-Rong [1 ]
机构
[1] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei, Anhui, Peoples R China
[2] iFLYTEK Co Ltd, iFLYTEK Res, Hefei, Anhui, Peoples R China
[3] Anhui Sci & Technol Res Inst, Hefei, Anhui, Peoples R China
来源
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2019年
关键词
sequence-to-sequence; neural network; voice conversion; text-supervision; DEEP NEURAL-NETWORKS;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents methods of making using of text supervision to improve the performance of sequence-to-sequence (seq2seq) voice conversion. Compared with conventional frame-to-frame voice conversion approaches, the seq2seq acoustic modeling method proposed in our previous work achieved higher naturalness and similarity. In this paper, we further improve its performance by utilizing the text transcriptions of parallel training data. First, a multi-task learning structure is designed which adds auxiliary classifiers to the middle layers of the seq2seq model and predicts linguistic labels as a secondary task. Second, a data-augmentation method is proposed which utilizes text alignment to produce extra parallel sequences for model training. Experiments are conducted to evaluate our proposed method with training sets at different sizes. Experimental results show that the multi-task learning with linguistic labels is effective at reducing the errors of seq2seq voice conversion. The data-augmentation method can further improve the performance of seq2seq voice conversion when only 50 or 100 training utterances are available.
引用
收藏
页码:6785 / 6789
页数:5
相关论文
共 26 条
[1]  
[Anonymous], 2014, Advances in neural information processing systems
[2]  
[Anonymous], 2016, PROC 9 ISCA SPEEC
[3]  
[Anonymous], ARXIV181006865
[4]  
[Anonymous], 2015, 3 INT C LEARN REPR I
[5]  
Bonafonte Antonio, 2004, 8 INT C SPOK LANG PR 8 INT C SPOK LANG PR
[6]   Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training [J].
Chen, Ling-Hui ;
Ling, Zhen-Hua ;
Liu, Li-Juan ;
Dai, Li-Rong .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :1859-1872
[7]  
Chen Zhuo, 2015, P ANN C INT SPEECH C
[8]   VOICE CONVERSION [J].
CHILDERS, DG ;
WU, K ;
HICKS, DM ;
YEGNANARAYANA, B .
SPEECH COMMUNICATION, 1989, 8 (02) :147-158
[9]  
Collobert R., 2008, P 25 INT C MACHINE L, P160, DOI [10.1145/1390156.1390177, DOI 10.1145/1390156.1390177]
[10]   Spectral Mapping Using Artificial Neural Networks for Voice Conversion [J].
Desai, Srinivas ;
Black, Alan W. ;
Yegnanarayana, B. ;
Prahallad, Kishore .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05) :954-964