Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques

被引:32
|
作者
Turk, Oytun [1 ]
Schroeder, Marc [2 ]
机构
[1] Sensory Inc, Portland, OR 97209 USA
[2] DFKI GmbH Language Technol Lab, Speech Grp, D-66123 Saarbrucken, Germany
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2010年 / 18卷 / 05期
关键词
Expressive speech synthesis; prosody; voice conversion; voice quality transformation;
D O I
10.1109/TASL.2010.2041113
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Generating expressive synthetic voices requires carefully designed databases that contain sufficient amount of expressive speech material. This paper investigates voice conversion and modification techniques to reduce database collection and processing efforts while maintaining acceptable quality and naturalness. In a factorial design, we study the relative contributions of voice quality and prosody as well as the amount of distortions introduced by the respective signal manipulation steps. The unit selection engine in our open source and modular text-to-speech (TTS) framework MARY is extended with voice quality transformation using either GMM-based prediction or vocal tract copy resynthesis. These algorithms are then cross-combined with various prosody copy resynthesis methods. The overall expressive speech generation process functions as a postprocessing step on TTS outputs to transform neutral synthetic speech into aggressive, cheerful, or depressed speech. Cross-combinations of voice quality and prosody transformation algorithms are compared in listening tests for perceived expressive style and quality. The results show that there is a tradeoff between identification and naturalness. Combined modeling of both voice quality and prosody leads to the best identification scores at the expense of lowest naturalness ratings. The fine detail of both voice quality and prosody, as preserved by the copy synthesis, did contribute to a better identification as compared to the approximate models.
引用
收藏
页码:965 / 973
页数:9
相关论文
共 50 条
  • [31] A Study of Speech Phase in Dysarthria Voice Conversion System
    Chen, Ko-Chiang
    Han, Ji-Yan
    Jhang, Sin-Hua
    Lai, Ying-Hui
    FUTURE TRENDS IN BIOMEDICAL AND HEALTH INFORMATICS AND CYBERSECURITY IN MEDICAL DEVICES, ICBHI 2019, 2020, 74 : 219 - 226
  • [32] ON USING BACKPROPAGATION FOR SPEECH TEXTURE GENERATION AND VOICE CONVERSION
    Chorowski, Jan
    Weiss, Ron J.
    Saurous, Rif A.
    Bengio, Samy
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 2256 - 2260
  • [33] PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion
    Deng, Yimin
    Tang, Huaizhen
    Zhang, Xulong
    Wang, Jianzong
    Cheng, Ning
    Xiao, Jing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 184 - 192
  • [34] Voice Conversion for Improving Perceived Likability of Uttered Speech
    Horiike, Shinya
    Morise, Masanori
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2020, E103D (05): : 1199 - 1202
  • [35] Electrolaryngeal Speech Enhancement Based on Statistical Voice Conversion
    Nakamura, Keigo
    Toda, Tomoki
    Saruwatari, Hiroshi
    Shikano, Kiyohiro
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 1443 - 1446
  • [36] EXPRESSIVE VOICE CONVERSION: A JOINT FRAMEWORK FOR SPEAKER IDENTITY AND EMOTIONAL STYLE TRANSFER
    Du, Zongyang
    Sisman, Berrak
    Zhou, Kun
    Li, Haizhou
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 594 - 601
  • [37] Multi-MelGAN Voice Conversion for the Creation of Under-Resourced Child Speech Synthesis
    Govender, Avashna
    Paul, Dipjyoti
    2022 IST-AFRICA CONFERENCE, 2022,
  • [38] Towards Glottal Source Controllability in Expressive Speech Synthesis
    Lorenzo-Trueba, Jaime
    Barra-Chicote, Roberto
    Raitio, Tuomo
    Obin, Nicolas
    Alku, Paavo
    Yamagishi, Junichi
    Montero, Juan M.
    13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, 2012, : 1618 - 1621
  • [39] Runtime and Speech Quality Survey of a Voice Conversion Method
    Jokisch, Oliver
    Birhanu, Yitagessu
    Hoffmann, Ruediger
    2013 IEEE EUROCON, 2013, : 1684 - 1688
  • [40] JOINT AND ADVERSARIAL TRAINING WITH ASR FOR EXPRESSIVE SPEECH SYNTHESIS
    Zhang, Kaili
    Gong, Cheng
    Lu, Wenhuan
    Wang, Longbiao
    Wei, Jianguo
    Liu, Dawei
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6322 - 6326