UNSUPERVISED WORD-LEVEL PROSODY TAGGING FOR CONTROLLABLE SPEECH SYNTHESIS

被引：6

作者：

Guo, Yiwei ^{[1
]}

Du, Chenpeng ^{[1
]}

Yu, Kai ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, AI Inst, Dept Comp Sci & Engn, MoE Key Lab Artificial Intelligence,X LANCE Lab, Shanghai, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

Prosody control; prosody tagging; word-level prosody; speech synthesis;

D O I：

10.1109/ICASSP43922.2022.9746323

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Although word-level prosody modeling in neural text-to-speech (TTS) has been investigated in recent research for diverse speech synthesis, it is still challenging to control speech synthesis manually without a specific reference. This is largely due to lack of word-level prosody tags. In this work, we propose a novel approach for unsupervised word-level prosody tagging with two stages, where we first group the words into different types with a decision tree according to their phonetic content and then cluster the prosodies using GMM within each type of words separately. This design is based on the assumption that the prosodies of different type of words, such as long or short words, should be tagged with different label sets. Furthermore, a TTS system with the derived word-level prosody tags is trained for controllable speech synthesis. Experiments on LJSpeech show that the TTS model trained with word-level prosody tags not only achieves better naturalness than a typical FastSpeech2 model, but also gains the ability to manipulate word-level prosody.

引用

页码：7597 / 7601

页数：5

共 50 条

[21] GRAPHPB: GRAPHICAL REPRESENTATIONS OF PROSODY BOUNDARY IN SPEECH SYNTHESIS [J].

Sun, Aolan ;

Wang, Jianzong ;

Cheng, Ning ;

Peng, Huayi ;

Zeng, Zhen ;

Kong, Lingwei ;

Xiao, Jing .

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :438-445

[22] Unsupervised Prominence Prediction for Speech Synthesis [J].

Mehrabani, Mahnoosh ;

Mishra, Taniya ;

Conkie, Alistair .

14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, :1558-1562

[23] Simple and Effective Unsupervised Speech Synthesis [J].

Liu, Alexander H. ;

Lai, Cheng-I Jeff ;

Hsu, Wei-Ning ;

Auli, Michael ;

Baevski, Alexei ;

Glass, James .

INTERSPEECH 2022, 2022, :843-847

[24] DISCOURSE-LEVEL PROSODY MODELING WITH A VARIATIONAL AUTOENCODER FOR NON-AUTOREGRESSIVE EXPRESSIVE SPEECH SYNTHESIS [J].

Wu, Ning-Qian ;

Liu, Zhao-Ci ;

Ling, Zhen-Hua .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :7592-7596

[25] ON THE INTERPLAY BETWEEN SPARSITY, NATURALNESS, INTELLIGIBILITY, AND PROSODY IN SPEECH SYNTHESIS [J].

Lai, Cheng-I Jeff ;

Cooper, Erica ;

Zhang, Yang ;

Chang, Shiyu ;

Qian, Kaizhi ;

Liao, Yi-Lun ;

Chuang, Yung-Sung ;

Liu, Alexander H. ;

Yamagishi, Junichi ;

Cox, David ;

Glass, James .

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :8447-8451

[26] INVESTIGATING DISENTANGLEMENT IN A PHONEME-LEVEL SPEECH CODEC FOR PROSODY MODELING [J].

Karapiperis, Sotirios ;

Ellinas, Nikolaos ;

Vioni, Alexandra ;

Oh, Junkwang ;

Jho, Gunu ;

Hwang, Inchul ;

Raptis, Spyros .

2024 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2024, :668-674

[27] MEASURING THE EFFECT OF LINGUISTIC RESOURCES ON PROSODY MODELING FOR SPEECH SYNTHESIS [J].

Rosenberg, Andrew ;

Fernandez, Raul ;

Ramabhadran, Bhuvana .

2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, :5114-5118

[28] Feedback Loop for Prosody Prediction in Concatenative Speech Synthesis. [J].

Latorre, Javier ;

Gracia, Sergio ;

Akamine, Masami .

INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, :2027-2030

[29] IMPROVED POS TAGGING FOR TEXT-TO-SPEECH SYNTHESIS [J].

Sun, Ming ;

Bellegarda, Jerome R. .

2011 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2011, :5384-5387

[30] ProZed: A speech prosody analysis-by-synthesis tool for linguists [J].

Hirst, Daniel .

PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, 2012, :15-18

← 1 2 3 4 5 →