The Romanian speech synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate

被引:51
作者
Stan, Adriana [1 ]
Yamagishi, Junichi [2 ]
King, Simon [2 ]
Aylett, Matthew [3 ]
机构
[1] Tech Univ Cluj Napoca, Dept Commun, Cluj Napoca 400027, Romania
[2] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland
[3] CereProc Ltd, Edinburgh EH8 9LE, Midlothian, Scotland
关键词
Speech synthesis; HTS; Romanian; HMMs; Sampling frequency; Auditory scale;
D O I
10.1016/j.specom.2010.12.002
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called "RSS", along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96 kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given. Using the database, we then revisit some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale, with the aim of improving speaker similarity, which is an acknowledged weakness of current HMM-based speech synthesisers. As we demonstrate using perceptual tests, these configuration choices can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling frequencies can offer enhanced feature extraction and improved speaker similarity for HMM-based speech synthesis. (C) 2010 Elsevier B.V. All rights reserved.
引用
收藏
页码:442 / 450
页数:9
相关论文
共 31 条
[1]  
[Anonymous], 1995, PROC EUROPEAN C SPEE
[2]  
[Anonymous], P INT C SPOK LANG PR
[3]  
[Anonymous], 1984, OLSHEN STONE CLASSIF, DOI 10.2307/2530946
[4]  
Aylett M. P., 2007, AISB, P174
[5]   The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences [J].
Benoit, C ;
Grice, M ;
Hazan, V .
SPEECH COMMUNICATION, 1996, 18 (04) :381-392
[6]  
BURILEANU D, 1999, P EUROSPEECH 99 BUD, P2063
[7]   MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM [J].
DEMPSTER, AP ;
LAIRD, NM ;
RUBIN, DB .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01) :1-38
[8]  
Fant G, 2005, TEXT SPEECH LANG TEC, V24, P199
[9]  
FERENCZ A, 1997, THESIS U CLUJ NAPOCA
[10]  
Frunza O., 2005, P EUROLAN 2005 WORKS