Syllable-level representations of suprasegmental features for DNN-based text-to-speech synthesis

被引：3

作者：

Ribeiro, Manuel Sam ^{[1
]}

Watts, Oliver ^{[1
]}

Yamagishi, Junichi ^{[1
,2
]}

机构：

[1] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland

[2] Natl Inst Informat, Tokyo, Japan

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

基金：

瑞士国家科学基金会; 英国工程与自然科学研究理事会;

关键词：

speech synthesis; prosody; deep neural networks; suprasegmental representations;

D O I：

10.21437/Interspeech.2016-1034

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

A top-down hierarchical system based on deep neural networks is investigated for the modeling of prosody in speech synthesis. Suprasegmental features are processed separately from segmental features and a compact distributed representation of high-level units is learned at syllable-level. The suprasegmental representation is then integrated into a frame-level network. Objective measures show that balancing segmental and suprasegmental features can be useful for the frame-level network. Additional features incorporated into the hierarchical system are then tested. At the syllable-level, a bag-of-phones representation is proposed and, at the word-level, embeddings learned from text sources are used. It is shown that the hierarchical system is able to leverage new features at higher-levels more efficiently than a system which exploits them directly at the frame-level. A perceptual evaluation of the proposed systems is conducted and followed by a discussion of the results.

引用

页码：3186 / 3190

页数：5

共 23 条

[1]

[Anonymous], 2013, P 8 ISCA SPEECH SYNT

[2]

Braunschweiler N., 2011, INTERSPEECH, P1821

[3]

Braunschweiler N, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2222

[4]

Cernak M, 2013, INT CONF ACOUST SPEE, P8140, DOI 10.1109/ICASSP.2013.6639251

[5] An RNN-based prosodic information synthesizer for Mandarin text-to-speech [J].

Chen, SH ;

Hwang, SH ;

Wang, YR .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1998, 6 (03) :226-239

[6]

Fernandez R., 2014, P ANN C INT SPEECH C

[7] Exploiting Prosody Hierarchy and Dynamic Features for Pitch Modeling and Generation in HMM-Based Speech Synthesis [J].

Hsia, Chi-Chun ;

Wu, Chung-Hsien ;

Wu, Jung-Yun .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (08) :1994-2003

[8] Measuring a decade of progress in Text-to-Speech [J].

King, Simon .

LOQUENS, 2014, 1 (01)

[9]

Ladd DR, 2008, CAMB STUD LINGUIST, V79, P1

[10]

Mikolov T., 2013, ARXIV, DOI [10.48550/arXiv.1301.3781, DOI 10.48550/ARXIV.1301.3781]

← 1 2 3 →