Probabilistic Amplitude Demodulation Features in Speech Synthesis for Improving Prosody

被引:0
|
作者
Lazaridis, Alexandros [1 ]
Cernak, Milos [1 ]
Garner, Philip N. [1 ]
机构
[1] Idiap Res Inst, Martigny, Switzerland
来源
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年
基金
瑞士国家科学基金会;
关键词
Probabilistic amplitude demodulation; speech synthesis; deep neural networks; speech prosody;
D O I
10.21437/Interspeech.2016-258
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Amplitude demodulation (AM) is a signal decomposition technique by which a signal can be decomposed to a product of two signals, i.e, a quickly varying carrier and a slowly varying modulator. In this work, the probabilistic amplitude demodulation (PAD) features are used to improve prosody in speech synthesis. The PAD is applied iteratively for generating syllable and stress amplitude modulations in a cascade manner. The PAD features are used as a secondary input scheme along with the standard text-based input features in statistical parametric speech synthesis. Specifically, deep neural network (DNN)-based speech synthesis is used to evaluate the importance of these features. Objective evaluation has shown that the proposed system using the PAD features has improved mainly prosody modelling; it outperforms the baseline system by approximately 5% in terms of relative reduction in root mean square error (RMSE) of the fundamental frequency (FO). The significance of this improvement is validated by subjective evaluation of the overall speech quality, achieving 38.6% over 19.5% preference score in respect to the baseline system, in an ABX test.
引用
收藏
页码:2298 / 2302
页数:5
相关论文
共 50 条
  • [21] UNSUPERVISED WORD-LEVEL PROSODY TAGGING FOR CONTROLLABLE SPEECH SYNTHESIS
    Guo, Yiwei
    Du, Chenpeng
    Yu, Kai
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7597 - 7601
  • [22] Phonetics and Machine Learning: Hierarchical Modelling of Prosody in Statistical Speech Synthesis
    Vainio, Martti
    STATISTICAL LANGUAGE AND SPEECH PROCESSING, SLSP 2014, 2014, 8791 : 37 - 54
  • [23] Technical and Phonetic Aspects of Speech Quality Assessment: The Case of Prosody Synthesis
    Tuckova, Jana
    Holub, Jan
    Dubeda, Tomas
    CROSS-MODAL ANALYSIS OF SPEECH, GESTURES, GAZE AND FACIAL EXPRESSIONS, 2009, 5641 : 126 - +
  • [24] Eye Tracking for the Online Evaluation of Prosody in Speech Synthesis: Not So Fast!
    White, Michael
    Rajkumar, Rajakrishnan
    Ito, Kiwako
    Speer, Shari R.
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2491 - 2494
  • [25] Probabilistic Linear Discriminant Analysis with Bottleneck Features for Speech Recognition
    Lu, Liang
    Renals, Steve
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 910 - 914
  • [26] Improving Trajectory Modelling for DNN-Based Speech Synthesis by Using Stacked Bottleneck Features and Minimum Generation Error Training
    Wu, Zhizheng
    King, Simon
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2016, 24 (07) : 1255 - 1265
  • [27] Improving speech synthesis with discourse relations
    Aubin, Adele
    Cervone, Alessandra
    Watts, Oliver
    King, Simon
    INTERSPEECH 2019, 2019, : 4470 - 4474
  • [28] Multiple-prosody speech databases and their effectiveness in high-quality speech synthesis at arbitrary rates
    Masuda, T
    Toda, T
    Kawanami, H
    Saruwatari, H
    Shikano, K
    ELECTRONICS AND COMMUNICATIONS IN JAPAN PART II-ELECTRONICS, 2005, 88 (09): : 38 - 47
  • [29] Two-Stage Prosody Prediction for Emotional Text-to-Speech Synthesis
    Tang, Hao
    Zhou, Xi
    Odisio, Matthias
    Hasegawa-Johnson, Mark
    Huang, Thomas S.
    INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 2138 - 2141
  • [30] Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
    Jiang, Yuepeng
    Li, Tao
    Yang, Fengyu
    Xie, Lei
    Menge, Meng
    Wang, Yujun
    INTERSPEECH 2024, 2024, : 2300 - 2304