Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis

被引:3
作者
Ribeiro, Manuel Sam [1 ]
Watts, Oliver [1 ]
Yamagishi, Junichi [1 ,2 ]
机构
[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland
[2] Natl Inst Informat, Tokyo, Japan
来源
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年
基金
瑞士国家科学基金会; 英国工程与自然科学研究理事会;
关键词
speech synthesis; prosody; deep neural networks; suprasegmental representations;
D O I
10.21437/Interspeech.2016-1034
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
A top-down hierarchical system based on deep neural networks is investigated for the modeling of prosody in speech synthesis. Suprasegmental features are processed separately from segmental features and a compact distributed representation of high-level units is learned at syllable-level. The suprasegmental representation is then integrated into a frame-level network. Objective measures show that balancing segmental and suprasegmental features can be useful for the frame-level network. Additional features incorporated into the hierarchical system are then tested. At the syllable-level, a bag-of-phones representation is proposed and, at the word-level, embeddings learned from text sources are used. It is shown that the hierarchical system is able to leverage new features at higher-levels more efficiently than a system which exploits them directly at the frame-level. A perceptual evaluation of the proposed systems is conducted and followed by a discussion of the results.
引用
收藏
页码:3186 / 3190
页数:5
相关论文
共 23 条
[1]  
[Anonymous], 2013, P 8 ISCA SPEECH SYNT
[2]  
Braunschweiler N., 2011, INTERSPEECH, P1821
[3]  
Braunschweiler N, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2222
[4]  
Cernak M, 2013, INT CONF ACOUST SPEE, P8140, DOI 10.1109/ICASSP.2013.6639251
[5]   An RNN-based prosodic information synthesizer for Mandarin text-to-speech [J].
Chen, SH ;
Hwang, SH ;
Wang, YR .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (03) :226-239
[6]  
Fernandez R., 2014, P ANN C INT SPEECH C
[7]   Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-Based Speech Synthesis [J].
Hsia, Chi-Chun ;
Wu, Chung-Hsien ;
Wu, Jung-Yun .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (08) :1994-2003
[8]   Measuring a decade of progress in Text-to-Speech [J].
King, Simon .
LOQUENS, 2014, 1 (01)
[9]  
Ladd DR, 2008, CAMB STUD LINGUIST, V79, P1
[10]  
Mikolov T., 2013, ARXIV, DOI [10.48550/arXiv.1301.3781, DOI 10.48550/ARXIV.1301.3781]