Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling

被引:0
作者
Jiang, Yuepeng [1 ]
Li, Tao [1 ]
Yang, Fengyu [2 ]
Xie, Lei [1 ]
Menge, Meng [2 ]
Wang, Yujun [2 ]
机构
[1] Northwestern Polytech Univ, Sch Software, Audio Speech & Language Proc Grp ASLP NPU, Xian, Peoples R China
[2] Xiaomi AI Lab, Beijing, Peoples R China
来源
INTERSPEECH 2024 | 2024年
关键词
speech synthesis; zero-shot; prosody modeling; denoising diffusion probabilistic model;
D O I
10.21437/Interspeech.2024-2506
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research in zero-shot speech synthesis has made significant progress in speaker similarity. However, current efforts focus on timbre generalization rather than prosody modeling, which results in limited naturalness and expressiveness. To address this, we introduce a novel speech synthesis model trained on large-scale datasets, including both timbre and hierarchical prosody modeling. As timbre is a global attribute closely linked to expressiveness, we adopt a global vector to model speaker timbre while guiding prosody modeling. Besides, given that prosody contains both global consistency and local variations, we introduce a diffusion model as the pitch predictor and employ a prosody adaptor to model prosody hierarchically, further enhancing the prosody quality of the synthesized speech. Experimental results show that our model not only maintains comparable timbre quality to the baseline but also exhibits better naturalness and expressiveness. The synthesized samples can be found at: https://rxy-j.github.io/HPMD-TTS/
引用
收藏
页码:2300 / 2304
页数:5
相关论文
共 35 条
[1]  
Arik SÖ, 2018, ADV NEUR IN, V31
[2]  
Baevski A, 2020, ADV NEUR IN, V33
[3]   AudioLM: A Language Modeling Approach to Audio Generation [J].
Borsos, Zalan ;
Marinier, Raphael ;
Vincent, Damien ;
Kharitonov, Eugene ;
Pietquin, Olivier ;
Sharifi, Matt ;
Roblek, Dominik ;
Teboul, Olivier ;
Grangier, David ;
Tagliasacchi, Marco ;
Zeghidour, Neil .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2023, 31 :2523-2533
[4]  
Casanova E, 2022, PR MACH LEARN RES
[5]  
Cooper E, 2020, INT CONF ACOUST SPEE, P6184, DOI [10.1109/icassp40776.2020.9054535, 10.1109/ICASSP40776.2020.9054535]
[6]  
Defossez A., 2022, CoRR, Vabs/2210.13438
[7]  
Deng Y, 2019, Arxiv, DOI arXiv:1812.05253
[8]   Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition [J].
Gao, Zhifu ;
Zhang, Shiliang ;
McLoughlin, Ian ;
Yan, Zhijie .
INTERSPEECH 2022, 2022, :2063-2067
[9]   DIDISPEECH: A LARGE SCALE MANDARIN SPEECH CORPUS [J].
Guo, Tingwei ;
Wen, Cheng ;
Jiang, Dongwei ;
Luo, Ne ;
Zhang, Ruixiong ;
Zhao, Shuaijiang ;
Li, Wubo ;
Gong, Cheng ;
Zou, Wei ;
Han, Kun ;
Li, Xiangang .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6968-6972
[10]  
Ho Jonathan, 2020, Advances in neural information processing systems, V33, P68