Probabilistic Amplitude Demodulation Features in Speech Synthesis for Improving Prosody

被引：0

作者：

Lazaridis, Alexandros ^{[1
]}

Cernak, Milos ^{[1
]}

Garner, Philip N. ^{[1
]}

机构：

[1] Idiap Res Inst, Martigny, Switzerland

来源：

17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES | 2016年

基金：

瑞士国家科学基金会;

关键词：

Probabilistic amplitude demodulation; speech synthesis; deep neural networks; speech prosody;

D O I：

10.21437/Interspeech.2016-258

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Amplitude demodulation (AM) is a signal decomposition technique by which a signal can be decomposed to a product of two signals, i.e, a quickly varying carrier and a slowly varying modulator. In this work, the probabilistic amplitude demodulation (PAD) features are used to improve prosody in speech synthesis. The PAD is applied iteratively for generating syllable and stress amplitude modulations in a cascade manner. The PAD features are used as a secondary input scheme along with the standard text-based input features in statistical parametric speech synthesis. Specifically, deep neural network (DNN)-based speech synthesis is used to evaluate the importance of these features. Objective evaluation has shown that the proposed system using the PAD features has improved mainly prosody modelling; it outperforms the baseline system by approximately 5% in terms of relative reduction in root mean square error (RMSE) of the fundamental frequency (FO). The significance of this improvement is validated by subjective evaluation of the overall speech quality, achieving 38.6% over 19.5% preference score in respect to the baseline system, in an ABX test.

引用

页码：2298 / 2302

页数：5

共 50 条

[31] Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis
Pan, Shifeng
He, Lei
INTERSPEECH 2021, 2021, : 4678 - 4682
[32] Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit
Zeng, Zhen
Wang, Jianzong
Cheng, Ning
Xiao, Jing
INTERSPEECH 2020, 2020, : 4422 - 4426
[33] ARTICULATORY FEATURES FOR EXPRESSIVE SPEECH SYNTHESIS
Black, Alan W.
Bunnell, H. Timothy
Dou, Ying
Muthukumar, Prasanna Kumar
Metze, Florian
Perry, Daniel
Polzehl, Tim
Prahallad, Kishore
Steidl, Stefan
Vaughn, Callie
2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2012, : 4005 - 4008
[34] Effectiveness of Speech Mode Adaptation for Improving Dialogue Speech Synthesis
Kaya, Kazuki
Mori, Hiroki
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2019, E102D (10): : 2064 - 2066
[35] Prosody-controllable gender-ambiguous speech synthesis: a tool for investigating implicit bias in speech perception
Szekely, Eva
Gustafson, Joakim
Torre, Ilaria
INTERSPEECH 2023, 2023, : 1234 - 1238
[36] Automatic Prosody Generation for Serbo-Croatian Speech Synthesis Based on Regression Trees
Secujski, Milan
Pekar, Darko
Jakovljevic, Niksa
12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 3164 - +
[37] Decoupled Pronunciation and Prosody Modeling in Meta-Learning-Based Multilingual Speech Synthesis
Peng, Yukun
Ling, Zhenhua
INTERSPEECH 2022, 2022, : 4257 - 4261
[38] Using Automatic Stress Extraction from Audio for Improved Prosody Modelling in Speech Synthesis
Szaszak, Gyorgy
Beke, Andras
Olaszy, Gabor
Toth, Balint Pal
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 2227 - 2231
[39] Fine-grained prosody modeling in neural speech synthesis using ToBI representation
Zou, Yuxiang
Liu, Shichao
Yin, Xiang
Lin, Haopeng
Wang, Chunfeng
Zhang, Haoyu
Ma, Zejun
INTERSPEECH 2021, 2021, : 3146 - 3150
[40] Excitation modelling using epoch features for statistical parametric speech synthesis
Reddy, M. Kiran
Rao, K. Sreenivasa
COMPUTER SPEECH AND LANGUAGE, 2020, 60

← 1 2 3 4 5 →