A new Korean corpus-based text-to-speech system

被引:0
作者
Kim S. [1 ]
Lee Y. [1 ]
Hirose K. [2 ]
机构
[1] Spoken Language Processing Team, Electronics and Telecommunications Research Institute
[2] Department of Frontier Informatics, School of Frontier Sciences, University of Tokyo
关键词
Korean; Prosody; Synthesis; TTS;
D O I
10.1023/A:1015454829127
中图分类号
学科分类号
摘要
This paper describes a new Korean Text-to-Speech (TTS) system based on a large speech corpus. Conventional concatenative TTS systems still produce machine-like synthetic speech. The poor naturalness is caused by excessive prosodic modification using a small speech database. To cope with this problem, we utilized a dynamic unit selection method based on a large speech database without prosodic modification. The proposed TTS system adopts triphones as synthesis units. We designed a new sentence set maximizing phonetic or prosodic coverage of Korean triphones. All the utterances were segmented automatically into phonemes using a speech recognizer. With the segmented phonemes, we achieved a synthesis unit cost of zero if two synthesis units were placed consecutively in an utterance. This reduces the number of concatenating points that may occur due to concatenating mismatches. In this paper, we present data concerning the realization of major prosodic variations through a consideration of prosodic phrase break strength. The phrase break was divided into four kinds of strength based on pause length. Using phrase break strength, triphones were further classified to reflect major prosodic variations. To predict phrase break strength on texts, we adopted an HMM-like Part-of-Speech (POS) sequence model. The performance of the model showed 73.5% accuracy for 4-level break strength prediction. For unit selection, a Viterbi beam search was performed to find the most appropriate triphone sequence, which has the minimum continuation cost of prosody and spectrum at concatenating boundaries. From the informal listening test, we found that the proposed Korean corpus-based TTS system showed better naturalness than the conventional demisyllable-based one.
引用
收藏
页码:105 / 116
页数:11
相关论文
共 13 条
[1]  
Beutnagel M., Conkie A., Syrdal A., Diphone synthesis using unit selection, The 3rd ESCA/COCOSDA Workshop on Speech Synthesis, (1998)
[2]  
Black A.W., Campbell N., Optimizing selection of unit from speech database concatenative synthesis, EUROSPEECH'95 Proceedings, 1, pp. 581-584, (1995)
[3]  
Campbell N., Large-scale single-speaker speech corpora, pp. 21-26, (1998)
[4]  
Hauptmann A.G., SPEAKEZ: A first experiment in concatenation synthesis from a large corpus, EUROSPEECH'93 Proceedings, pp. 1701-1704, (1993)
[5]  
Hunt A.J., Black A.W., Unit selection in a concatenative speech synthesis system using a large speech database, ICASSP'96 Proceedings, pp. 373-376, (1996)
[6]  
Kim S.H., Lee J.C., Korean text-to-speech system using TD-PSOLA, Australian International Conference on Speech Science and Technology (SST'94) Proceedings, pp. 587-592, (1994)
[7]  
Kim S.H., Lee H.S., Kim H.R., An effectiveness of automatic labeling using speech recognizer, International Conference on Phonetic Sciences (SICOPS'96) Proceedings, pp. 468-471, (1996)
[8]  
Moulines E., Charpentier F., Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones, 9, pp. 453-467, (1990)
[9]  
Ostendorf M., Veilleux N., A hierarchical stochastic model for automatic prediction of prosodic boundary location, Computational Linguistics, 20, 1, pp. 27-54, (1994)
[10]  
Roucos S., Wilgus A.M., High quality time scale modification for speech, ICASSP'85, pp. 493-496, (1985)