Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques

被引：32

作者：

Turk, Oytun ^{[1
]}

Schroeder, Marc ^{[2
]}

机构：

[1] Sensory Inc, Portland, OR 97209 USA

[2] DFKI GmbH Language Technol Lab, Speech Grp, D-66123 Saarbrucken, Germany

来源：

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2010年 / 18卷 / 05期

关键词：

Expressive speech synthesis; prosody; voice conversion; voice quality transformation;

D O I：

10.1109/TASL.2010.2041113

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Generating expressive synthetic voices requires carefully designed databases that contain sufficient amount of expressive speech material. This paper investigates voice conversion and modification techniques to reduce database collection and processing efforts while maintaining acceptable quality and naturalness. In a factorial design, we study the relative contributions of voice quality and prosody as well as the amount of distortions introduced by the respective signal manipulation steps. The unit selection engine in our open source and modular text-to-speech (TTS) framework MARY is extended with voice quality transformation using either GMM-based prediction or vocal tract copy resynthesis. These algorithms are then cross-combined with various prosody copy resynthesis methods. The overall expressive speech generation process functions as a postprocessing step on TTS outputs to transform neutral synthetic speech into aggressive, cheerful, or depressed speech. Cross-combinations of voice quality and prosody transformation algorithms are compared in listening tests for perceived expressive style and quality. The results show that there is a tradeoff between identification and naturalness. Combined modeling of both voice quality and prosody leads to the best identification scores at the expense of lowest naturalness ratings. The fine detail of both voice quality and prosody, as preserved by the copy synthesis, did contribute to a better identification as compared to the approximate models.

引用

页码：965 / 973

页数：9

共 50 条

[31] A Study of Speech Phase in Dysarthria Voice Conversion System
Chen, Ko-Chiang
Han, Ji-Yan
Jhang, Sin-Hua
Lai, Ying-Hui
FUTURE TRENDS IN BIOMEDICAL AND HEALTH INFORMATICS AND CYBERSECURITY IN MEDICAL DEVICES, ICBHI 2019, 2020, 74 : 219 - 226
[32] ON USING BACKPROPAGATION FOR SPEECH TEXTURE GENERATION AND VOICE CONVERSION
Chorowski, Jan
Weiss, Ron J.
Saurous, Rif A.
Bengio, Samy
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2256 - 2260
[33] PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion
Deng, Yimin
Tang, Huaizhen
Zhang, Xulong
Wang, Jianzong
Cheng, Ning
Xiao, Jing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 184 - 192
[34] Voice Conversion for Improving Perceived Likability of Uttered Speech
Horiike, Shinya
Morise, Masanori
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2020, E103D (05): : 1199 - 1202
[35] Electrolaryngeal Speech Enhancement Based on Statistical Voice Conversion
Nakamura, Keigo
Toda, Tomoki
Saruwatari, Hiroshi
Shikano, Kiyohiro
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1443 - 1446
[36] EXPRESSIVE VOICE CONVERSION: A JOINT FRAMEWORK FOR SPEAKER IDENTITY AND EMOTIONAL STYLE TRANSFER
Du, Zongyang
Sisman, Berrak
Zhou, Kun
Li, Haizhou
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 594 - 601
[37] Multi-MelGAN Voice Conversion for the Creation of Under-Resourced Child Speech Synthesis
Govender, Avashna
Paul, Dipjyoti
2022 IST-AFRICA CONFERENCE, 2022,
[38] Towards Glottal Source Controllability in Expressive Speech Synthesis
Lorenzo-Trueba, Jaime
Barra-Chicote, Roberto
Raitio, Tuomo
Obin, Nicolas
Alku, Paavo
Yamagishi, Junichi
Montero, Juan M.
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1618 - 1621
[39] Runtime and Speech Quality Survey of a Voice Conversion Method
Jokisch, Oliver
Birhanu, Yitagessu
Hoffmann, Ruediger
2013 IEEE EUROCON, 2013, : 1684 - 1688
[40] JOINT AND ADVERSARIAL TRAINING WITH ASR FOR EXPRESSIVE SPEECH SYNTHESIS
Zhang, Kaili
Gong, Cheng
Lu, Wenhuan
Wang, Longbiao
Wei, Jianguo
Liu, Dawei
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6322 - 6326

← 1 2 3 4 5 →