Learning Structured Sparse Representations for Voice Conversion

被引:9
作者
Ding, Shaojin [1 ]
Zhao, Guanlong [1 ]
Liberatore, Christopher [1 ]
Gutierrez-Osuna, Ricardo [1 ]
机构
[1] Texas A&M Univ, Dept Comp Sci & Engn, College Stn, TX 77843 USA
基金
美国国家科学基金会;
关键词
Dictionaries; Phonetics; Training; Machine learning; Encoding; Speech processing; Clustering algorithms; Voice conversion; sparse coding; sparse representation; dictionary learning; MATRIX FACTORIZATION; NEURAL-NETWORKS;
D O I
10.1109/TASLP.2019.2955289
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Sparse-coding techniques for voice conversion assume that an utterance can be decomposed into a sparse code that only carries linguistic contents, and a dictionary of atoms that captures the speakers' characteristics. However, conventional dictionary-construction and sparse-coding algorithms rarely meet this assumption. The result is that the sparse code is no longer speaker-independent, which leads to lower voice-conversion performance. In this paper, we propose a Cluster-Structured Sparse Representation (CSSR) that improves the speaker independence of the representations. CSSR consists of two complementary components: a Cluster-Structured Dictionary Learning module that groups atoms in the dictionary into clusters, and a Cluster-Selective Objective Function that encourages each speech frame to be represented by atoms from a small number of clusters. We conducted four experiments on the CMU ARCTIC corpus to evaluate the proposed method. In a first ablation study, results show that each of the two CSSR components enhances speaker independence, and that combining both components leads to further improvements. In a second experiment, we find that CSSR uses increasingly larger dictionaries more efficiently than phoneme-based representations by allowing finer-grained decompositions of speech sounds. In a third experiment, results from objective and subjective measurements show that CSSR outperforms prior voice-conversion methods, improving the acoustic quality of the synthesized speech while retaining the target speaker's voice identity. Finally, we show that the CSSR captures latent (i.e., phonetic) information in the speech signal.
引用
收藏
页码:343 / 354
页数:12
相关论文
共 66 条
[1]  
Aihara R., 2014, P IEEE INT C AC SPEE, P7894
[2]   Parallel Dictionary Learning for Voice Conversion Using Discriminative Graph-embedded Non-negative Matrix Factorization [J].
Aihara, Ryo ;
Takiguchi, Tetsuya ;
Ariki, Yasuo .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :292-296
[3]  
Aihara R, 2015, INT CONF ACOUST SPEE, P4899, DOI 10.1109/ICASSP.2015.7178902
[4]  
[Anonymous], 2010, INT C CONTR AUT
[5]  
[Anonymous], 2016, P 2016 INT C BIOL
[6]  
[Anonymous], 2013, PROC CVPR IEEE, DOI DOI 10.1109/CVPR.2013.52
[7]  
[Anonymous], 2018, 2018 IEEE INT C AC
[8]  
[Anonymous], 2010, ACL 2010 48 ANN M
[9]  
[Anonymous], 2009, Proceedings of the 26th Annual International Conference on Machine Learning, DOI [DOI 10.1145/1553374.1553463, 10.1145/1553374.1553463]
[10]  
[Anonymous], 2018, INTERSPEECH