Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques

被引:32
|
作者
Turk, Oytun [1 ]
Schroeder, Marc [2 ]
机构
[1] Sensory Inc, Portland, OR 97209 USA
[2] DFKI GmbH Language Technol Lab, Speech Grp, D-66123 Saarbrucken, Germany
来源
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2010年 / 18卷 / 05期
关键词
Expressive speech synthesis; prosody; voice conversion; voice quality transformation;
D O I
10.1109/TASL.2010.2041113
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Generating expressive synthetic voices requires carefully designed databases that contain sufficient amount of expressive speech material. This paper investigates voice conversion and modification techniques to reduce database collection and processing efforts while maintaining acceptable quality and naturalness. In a factorial design, we study the relative contributions of voice quality and prosody as well as the amount of distortions introduced by the respective signal manipulation steps. The unit selection engine in our open source and modular text-to-speech (TTS) framework MARY is extended with voice quality transformation using either GMM-based prediction or vocal tract copy resynthesis. These algorithms are then cross-combined with various prosody copy resynthesis methods. The overall expressive speech generation process functions as a postprocessing step on TTS outputs to transform neutral synthetic speech into aggressive, cheerful, or depressed speech. Cross-combinations of voice quality and prosody transformation algorithms are compared in listening tests for perceived expressive style and quality. The results show that there is a tradeoff between identification and naturalness. Combined modeling of both voice quality and prosody leads to the best identification scores at the expense of lowest naturalness ratings. The fine detail of both voice quality and prosody, as preserved by the copy synthesis, did contribute to a better identification as compared to the approximate models.
引用
收藏
页码:965 / 973
页数:9
相关论文
共 50 条
  • [41] Automatic Speech Disentanglement for Voice Conversion using Rank Module and Speech Augmentation
    Liu, Zhonghua
    Wang, Shijun
    Chen, Ning
    INTERSPEECH 2023, 2023, : 2298 - 2302
  • [42] EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion
    Miao, Chenfeng
    Zhu, Qingying
    Chen, Minchuan
    Ma, Jun
    Wang, Shaojun
    Xiao, Jing
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1650 - 1661
  • [43] Expressive Prosody for Unit-selection Speech Synthesis
    Strom, Volker
    Clark, Robert
    King, Simon
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1296 - 1299
  • [44] Voice Conversion Using Speech-to-Speech Neuro-Style Transfer
    AlBadawy, Ehab A.
    Lyu, Siwei
    INTERSPEECH 2020, 2020, : 4726 - 4730
  • [45] Performance Evaluation for Voice Conversion Systems
    Ganchev, Todor
    Lazaridis, Alexandros
    Mporas, Iosif
    Fakotakis, Nikos
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2008, 5246 : 317 - 324
  • [46] The role of prosody and voice quality in indirect storytelling speech: Annotation methodology and expressive categories
    Montano, Raul
    Alias, Francesc
    SPEECH COMMUNICATION, 2016, 85 : 8 - 18
  • [47] Improvement of time alignment of the speech signals to be used in voice conversion
    Mozaffari, Fatemeh
    Sayadian, Abolghasem
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2018, 21 (01) : 79 - 84
  • [48] Modeling the Acoustic Correlates of Expressive Elements in Text Genres for Expressive Text-to-Speech Synthesis
    Yang, Hongwu
    Meng, Helen M.
    Cai, Lianhong
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1806 - 1809
  • [49] Audio-Visual Mandarin Electrolaryngeal Speech Voice Conversion
    Chien, Yung-Lun
    Chen, Hsin-Hao
    Yen, Ming-Chi
    Tsai, Shu-Wei
    Wang, Hsin-Min
    Tsao, Yu
    Chi, Tai-Shih
    INTERSPEECH 2023, 2023, : 5023 - 5026
  • [50] TEXT-INFORMED SPEECH INPAINTING VIA VOICE CONVERSION
    Prablanc, Pierre
    Ozerov, Alexey
    Duong, Ngoc Q. K.
    Perez, Patrick
    2016 24TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2016, : 878 - 882