Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech

被引:1
作者
Sone, Kentaro [1 ]
Nakashika, Toru [1 ]
机构
[1] Univ Electrocommun, Grad Sch Informat & Engn, Tokyo 1828585, Japan
基金
日本科学技术振兴机构;
关键词
speech synthesis; generative models; Boltzmann distributions; pre-training methods; deep neural networks;
D O I
10.1587/transinf.2018EDP7344
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Conventional approaches to statistical parametric speech synthesis use context-dependent hidden Markov models (HMMs) clustered using decision trees to generate speech parameters from linguistic features. However, decision trees are not always appropriate to model complex context dependencies of linguistic features efficiently. An alternative scheme that replaces decision trees with deep neural networks (DNNs) was presented as a possible way to overcome the difficulty. By training the network to represent high-dimensional feedforward dependencies from linguistic features to acoustic features, DNN-based speech synthesis systems convert a text into a speech. To improved the naturalness of the synthesized speech, this paper presents a novel pre-training method for DNN-based statistical parametric speech synthesis systems. In our method, a deep relational model (DRM), which represents a joint probability of two visible variables, is applied to describe the joint distribution of acoustic and linguistic features. As with DNNs, a DRM consists several hidden layers and two visible layers. Although DNNs represent feedforward dependencies from one visible variables (inputs) to other visible variables (outputs), a DRM has an ability to represent the bidirectional dependencies between two visible variables. During the maximum-likelihood (ML)-based training, the model optimizes its parameters (connection weights between two adjacent layers, and biases) of a deep architecture considering the bidirectional conversion between 1) acoustic features given linguistic features, and 2) linguistic features given acoustic features generated from itself. Owing to considering whether the generated acoustic features are recognizable, our method can obtain reasonable parameters for speech synthesis. Experimental results in a speech synthesis task show that pre-trained DNN-based systems using our proposed method outperformed randomly-initialized DNN-based systems, especially when the amount of training data is limited. Additionally, speaker-dependent speech recognition experimental results also show that our method outperformed DNN-based systems, by setting the initial parameters of our method are the same as that in the synthesis experiments.
引用
收藏
页码:1546 / 1553
页数:8
相关论文
共 26 条
[1]  
[Anonymous], ARXIV160903499
[2]  
[Anonymous], 2017, P ICLR
[3]  
[Anonymous], 1999, P EUROSPEECH
[4]  
[Anonymous], P INT C ART INT STAT
[5]  
[Anonymous], 2016, arXiv
[6]   A Deep Generative Architecture for Postfiltering in Statistical Parametric Speech Synthesis [J].
Chen, Ling-Hui ;
Raitio, Tuomo ;
Valentini-Botinhao, Cassia ;
Ling, Zhen-Hua ;
Yamagishi, Junichi .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (11) :2003-2014
[7]  
Cho K, 2011, LECT NOTES COMPUT SC, V6791, P10, DOI 10.1007/978-3-642-21735-7_2
[8]   VOICE CONVERSION USING ARTIFICIAL NEURAL NETWORKS [J].
Desai, Srinivas ;
Raghavendra, E. Veera ;
Yegnanarayana, B. ;
Black, Alan W. ;
Prahallad, Kishore .
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, :3893-+
[9]  
Fukada T., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), P137, DOI 10.1109/ICASSP.1992.225953
[10]   Reducing the dimensionality of data with neural networks [J].
Hinton, G. E. ;
Salakhutdinov, R. R. .
SCIENCE, 2006, 313 (5786) :504-507