A unit selection text-to-speech-and-singing synthesis framework from neutral speech: proof of concept

被引：1

作者：

Freixes, Marc ^{[1
]}

Alias, Francesc ^{[1
]}

Claudi Socoro, Joan ^{[1
]}

机构：

[1] La Salle Univ Ramon Llull, Grup Recerca Tecnol Media GTM, Quatre Camins 30, Barcelona 08022, Spain

来源：

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING | 2019年 / 2019卷 / 01期

关键词：

Text-to-speech; Unit selection; Speech synthesis; Singing synthesis; Speech-to-singing; VOICE SYNTHESIS SYSTEM; PLUS NOISE MODEL; QUALITY;

D O I：

10.1186/s13636-019-0163-y

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Text-to-speech (TTS) synthesis systems have been widely used in general-purpose applications based on the generation of speech. Nonetheless, there are some domains, such as storytelling or voice output aid devices, which may also require singing. To enable a corpus-based TTS system to sing, a supplementary singing database should be recorded. This solution, however, might be too costly for eventual singing needs, or even unfeasible if the original speaker is unavailable or unable to sing properly. This work introduces a unit selection-based text-to-speech-and-singing (US-TTS&S) synthesis framework, which integrates speech-to-singing (STS) conversion to enable the generation of both speech and singing from an input text and a score, respectively, using the same neutral speech corpus. The viability of the proposal is evaluated considering three vocal ranges and two tempos on a proof-of-concept implementation using a 2.6-h Spanish neutral speech corpus. The experiments show that challenging STS transformation factors are required to sing beyond the corpus vocal range and/or with notes longer than 150 ms. While score-driven US configurations allow the reduction of pitch-scale factors, time-scale factors are not reduced due to the short length of the spoken vowels. Moreover, in the MUSHRA test, text-driven and score-driven US configurations obtain similar naturalness rates of around 40 for all the analysed scenarios. Although these naturalness scores are far from those of vocaloid, the singing scores of around 60 which were obtained validate that the framework could reasonably address eventual singing needs.

引用

页数：14

共 50 条

[31] On the Impact of Annotation Errors on Unit-Selection Speech Synthesis
Matousek, Jindrich
Tihelka, Daniel
Smidl, Lubos
TEXT, SPEECH AND DIALOGUE, TSD 2012, 2012, 7499 : 456 - 463
[32] Unit Selection based Speech Synthesis for Poor Channel Condition
Cen, Ling
Dong, Minghui
Chan, Paul
Li, Haizhou
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2035 - 2038
[33] PREDICTING SPECTRAL AND PROSODIC PARAMETERS FOR UNIT SELECTION IN SPEECH SYNTHESIS
Dong, Minghui
Li, Haizhou
2008 6TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, 2008, : 133 - 136
[34] A method for combining intonation modelling and speech unit selection in corpus-based speech synthesis systems
Diaz, Francisco Campillo
Rodriguez Banga, Eduardo
SPEECH COMMUNICATION, 2006, 48 (08) : 941 - 956
[35] A Small Footprint Hybrid Statistical and Unit Selection Text-to-Speech Synthesis System for Turkish
Guner, Ekrem
Demiroglu, Cenk
COMPUTER AND INFORMATION SCIENCES II, 2012, : 85 - 91
[36] Unsupervised features from text for speech synthesis in a speech-to-speech translation system
Watts, Oliver
Zhou, Bowen
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2164 - 2167
[37] Applying Scalable Phonetic Context Similarity in Unit Selection of Concatenative Text-to-Speech
Zhang, Wei
Cui, Xiaodong
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 154 - 157
[38] Minimum unit selection error training for HMM-based unit selection speech synthesis system
Ling, Zhen-Hua
Wang, Ren-Hua
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 3949 - 3952
[39] COMPRESSED SENSING FOR UNIT SELECTION BASED SPEECH SYNTHESIS
Sharma, Pulkit
Abrol, Vinayak
Sao, Anil Kumar
2015 23RD EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2015, : 1731 - 1735
[40] On the Role of Spectral Dynamics in Unit Selection Speech Synthesis
Kirkpatrick, Barry
O'Brien, Darragh
Scaife, Ronan
Errity, Andrew
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2029 - 2032

← 1 2 3 4 5 →