Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques

被引:32
|
作者
Turk, Oytun [1 ]
Schroeder, Marc [2 ]
机构
[1] Sensory Inc, Portland, OR 97209 USA
[2] DFKI GmbH Language Technol Lab, Speech Grp, D-66123 Saarbrucken, Germany
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2010年 / 18卷 / 05期
关键词
Expressive speech synthesis; prosody; voice conversion; voice quality transformation;
D O I
10.1109/TASL.2010.2041113
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Generating expressive synthetic voices requires carefully designed databases that contain sufficient amount of expressive speech material. This paper investigates voice conversion and modification techniques to reduce database collection and processing efforts while maintaining acceptable quality and naturalness. In a factorial design, we study the relative contributions of voice quality and prosody as well as the amount of distortions introduced by the respective signal manipulation steps. The unit selection engine in our open source and modular text-to-speech (TTS) framework MARY is extended with voice quality transformation using either GMM-based prediction or vocal tract copy resynthesis. These algorithms are then cross-combined with various prosody copy resynthesis methods. The overall expressive speech generation process functions as a postprocessing step on TTS outputs to transform neutral synthetic speech into aggressive, cheerful, or depressed speech. Cross-combinations of voice quality and prosody transformation algorithms are compared in listening tests for perceived expressive style and quality. The results show that there is a tradeoff between identification and naturalness. Combined modeling of both voice quality and prosody leads to the best identification scores at the expense of lowest naturalness ratings. The fine detail of both voice quality and prosody, as preserved by the copy synthesis, did contribute to a better identification as compared to the approximate models.
引用
收藏
页码:965 / 973
页数:9
相关论文
共 50 条
  • [21] Iteratively Improving Speech Recognition and Voice Conversion
    Singh, Mayank Kumar
    Takahashi, Naoya
    Onoe, Naoyuki
    INTERSPEECH 2023, 2023, : 206 - 210
  • [22] Controllable Emphatic Speech Synthesis based on Forward Attention for Expressive Speech Synthesis
    Liu, Liangqi
    Hu, Jiankun
    Wu, Zhiyong
    Yang, Song
    Yang, Songfan
    Jia, Jia
    Meng, Helen
    2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 410 - 414
  • [23] INVESTIGATING SELF-SUPERVISED FEATURES FOR EXPRESSIVE, MULTILINGUAL VOICE CONVERSION
    Martin-Cortinas, Alvaro
    Saez-Trigueros, Daniel
    Beringer, Grzegorz
    Valles-Perez, Ivan
    Barra-Chicote, Roberto
    Tura-Vecino, Biel
    Gabrys, Adam
    Merritt, Thomas
    Bilinski, Piotr
    Lorenzo-Trueba, Jaime
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 341 - 345
  • [24] Expressive speech synthesis using sentiment embeddings
    Jauk, Igor
    Lorenzo-Trueba, Jaime
    Yamagishi, Junichi
    Bonafonte, Antonio
    19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 3062 - 3066
  • [25] Expressive Latvian Speech Synthesis for Dialog Systems
    Nicmanis, Davis
    Salimbajevs, Askars
    INTERSPEECH 2021, 2021, : 3321 - 3322
  • [26] Towards Multi-Scale Style Control for Expressive Speech Synthesis
    Li, Xiang
    Song, Changhe
    Li, Jingbei
    Wu, Zhiyong
    Jia, Jia
    Meng, Helen
    INTERSPEECH 2021, 2021, : 4673 - 4677
  • [27] Expressive Speech Synthesis Using Emotion-Specific Speech Inventories
    Zainko, Csaba
    Fek, Mark
    Nemeth, Geza
    VERBAL AND NONVERBAL FEATURES OF HUMAN-HUMAN AND HUMAN-MACHINE INTERACTIONS, 2008, 5042 : 225 - 234
  • [28] A Comparative Study of Voice Conversion Techniques: A review
    Ezzine, Kadria
    Frikha, Mondher
    2017 3RD INTERNATIONAL CONFERENCE ON ADVANCED TECHNOLOGIES FOR SIGNAL AND IMAGE PROCESSING (ATSIP), 2017, : 361 - 366
  • [29] A Survey on the Evolution of Various Voice Conversion Techniques
    Sathiarekha, K.
    Kumaresan, S.
    2016 3RD INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATION SYSTEMS (ICACCS), 2016,
  • [30] VOICE CONVERSION FOR VARIOUS TYPES OF BODY TRANSMITTED SPEECH
    Toda, Tomoki
    Nakamura, Keigo
    Sekimoto, Hidehiko
    Shikano, Kiyohiro
    2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 3601 - 3604