Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources

被引:1
作者
Barakat, Huda [1 ]
Turk, Oytun [2 ]
Demiroglu, Cenk [2 ]
机构
[1] Ozyegin Univ, Dept Comp Sci, TR-34794 Istanbul, Turkiye
[2] Ozyegin Univ, Dept Elect & Elect Engn, TR-34794 Istanbul, Turkiye
关键词
Speech synthesis; Expressive speech; Emotional speech; Deep learning; EMOTIONAL EXPRESSIONS; STYLE; TEXT; MODEL; REPRESENTATIONS; NETWORK; QUALITY;
D O I
10.1186/s13636-024-00329-7
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Speech synthesis has made significant strides thanks to the transition from machine learning to deep learning models. Contemporary text-to-speech (TTS) models possess the capability to generate speech of exceptionally high quality, closely mimicking human speech. Nevertheless, given the wide array of applications now employing TTS models, mere high-quality speech generation is no longer sufficient. Present-day TTS models must also excel at producing expressive speech that can convey various speaking styles and emotions, akin to human speech. Consequently, researchers have concentrated their efforts on developing more efficient models for expressive speech synthesis in recent years. This paper presents a systematic review of the literature on expressive speech synthesis models published within the last 5 years, with a particular emphasis on approaches based on deep learning. We offer a comprehensive classification scheme for these models and provide concise descriptions of models falling into each category. Additionally, we summarize the principal challenges encountered in this research domain and outline the strategies employed to tackle these challenges as documented in the literature. In the Section 8, we pinpoint some research gaps in this field that necessitate further exploration. Our objective with this work is to give an all-encompassing overview of this hot research area to offer guidance to interested researchers and future endeavors in this field.
引用
收藏
页数:34
相关论文
共 177 条
[11]   EMOTION CONTROLLABLE SPEECH SYNTHESIS USING EMOTION-UNLABELED DATASET WITH THE ASSISTANCE OF CROSS-DOMAIN SPEECH EMOTION RECOGNITION [J].
Cai, Xiong ;
Dai, Dongyang ;
Wu, Zhiyong ;
Li, Xiang ;
Li, Jingbei ;
Meng, Helen .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5734-5738
[12]   FINE-GRAINED STYLE CONTROL IN TRANSFORMER-BASED TEXT-TO-SPEECH SYNTHESIS [J].
Chen, Li-Wei ;
Rudnicky, Alexander .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :7907-7911
[13]  
Cheng PY, 2020, PR MACH LEARN RES, V119
[14]   Gated Recurrent Attention for Multi-Style Speech Synthesis [J].
Cheon, Sung Jun ;
Lee, Joun Yeop ;
Choi, Byoung Jin ;
Lee, Hyeonseung ;
Kim, Nam Soo .
APPLIED SCIENCES-BASEL, 2020, 10 (15)
[15]  
Choi H, 2019, INT CONF ACOUST SPEE, P6950, DOI 10.1109/ICASSP.2019.8683682
[16]   ON-THE-FLY DATA AUGMENTATION FOR TEXT-TO-SPEECH STYLE TRANSFER [J].
Chung, Raymond ;
Mak, Brian .
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, :634-641
[17]  
Clark K, 2020, Arxiv, DOI arXiv:2003.10555
[18]   INTERACTIVE MULTI-LEVEL PROSODY CONTROL FOR EXPRESSIVE SPEECH SYNTHESIS [J].
Cornille, Tobias ;
Wang, Fengna ;
Bekker, Jessa .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :8312-8316
[19]  
cstr, Voice cloning toolkit
[20]  
cstr, The blizzard challenge