VOICE CONVERSION BASED ON NON-NEGATIVE MATRIX FACTORIZATION USING PHONEME-CATEGORIZED DICTIONARY

被引:0
作者
Aihara, Ryo [1 ]
Nakashika, Toru [1 ]
Takiguchi, Tetsuya [1 ]
Ariki, Yasuo [1 ]
机构
[1] Kobe Univ, Grad Sch Syst Informat, Nada Ku, Kobe, Hyogo 6578501, Japan
来源
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2014年
关键词
voice conversion; sparse representation; non-negative matrix factorization; sub-dictionary;
D O I
暂无
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We present in this paper an exemplar-based voice conversion (VC) method using a phoneme-categorized dictionary. Sparse representation-based VC using Non-negative matrix factorization (NMF) is employed for spectral conversion between different speakers. In our previous NMF-based VC method, source exemplars and target exemplars are extracted from parallel training data, having the same texts uttered by the source and target speakers. The input source signal is represented using the source exemplars and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. However, this exemplar-based approach needs to hold all the training exemplars (frames), and it may cause mismatching of phonemes between input signals and selected exemplars. In this paper, in order to reduce the mismatching of phoneme alignment, we propose a phoneme-categorized sub-dictionary and a dictionary selection method using NMF. By using the sub-dictionary, the performance of VC is improved compared to a conventional NMF-based VC. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method and a conventional NMF-based method.
引用
收藏
页数:5
相关论文
共 23 条
[1]  
Abe M., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), P655, DOI 10.1109/ICASSP.1988.196671
[2]  
Aihara R., 2012, AM J SIGNAL PROCESSI, V2
[3]  
Aihara R, 2013, INT CONF ACOUST SPEE, P8037, DOI 10.1109/ICASSP.2013.6639230
[4]  
[Anonymous], 2011, INTERSPEECH
[5]  
En-Najjary T., 2004, P ICSLP, P199
[6]   Exemplar-Based Sparse Representations for Noise Robust Automatic Speech Recognition [J].
Gemmeke, Jort F. ;
Virtanen, Tuomas ;
Hurmalainen, Antti .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (07) :2067-2080
[7]   Voice Conversion Using Partial Least Squares Regression [J].
Helander, Elina ;
Virtanen, Tuomas ;
Nurminen, Jani ;
Gabbouj, Moncef .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05) :912-921
[8]  
International Telecommunication Union, 2003, METH OBJ SUBJ ASS QU
[9]  
Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423
[10]   Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction:: Possible role of a repetitive structure in sounds [J].
Kawahara, H ;
Masuda-Katsuse, I ;
de Cheveigné, A .
SPEECH COMMUNICATION, 1999, 27 (3-4) :187-207