Improving Sequence-to-sequence Tibetan Speech Synthesis with Prosodic Information

被引：1

作者：

Zhang, Weizhao ^{[1
]}

Yang, Hongwu ^{[2
]}

机构：

[1] Northwest Normal Univ, Coll Phys & Elect Engn, Lanzhou 730070, Peoples R China

[2] Northwest Normal Univ, Sch Educ Technol, Lanzhou, Peoples R China

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2023年 / 22卷 / 09期

基金：

中国国家自然科学基金;

关键词：

Sequence-to-sequence speech synthesis; Tibetan speech synthesis; prosodic information fusion; low-resource language; ADAPTATION; SPEAKER;

D O I：

10.1145/3616012

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

There are about 6,000 languages worldwide, most of which are low-resource languages. Although the current speech synthesis (or text-to-speech, TTS) for major languages (e.g., Mandarin, English, French) has achieved good results, the voice quality of TTS for low-resource languages (e.g., Tibetan) still needs to be further improved. Because prosody plays a significant role in natural speech, the article proposes two sequence-tosequence (seq2seq) Tibetan TTS models with prosodic information fusion to improve the voice quality of synthesized Tibetan speech. We first constructed a large-scale Tibetan corpus for seq2seq TTS. Then we designed a prosody generator to extract prosodic information from the Tibetan sentences. Finally, we trained two seq2seq Tibetan TTS models by fusing prosodic information, including feature-level and model-level prosodic information fusion. The experimental results showed that the proposed two seq2seq Tibetan TTS models, which fuse prosodic information, could effectively improve the voice quality of synthesized speech. Furthermore, the model-level prosodic information fusion only needs 60% similar to 70% of the training data to synthesize a voice similar to the baseline seq2seq Tibetan TTS. Therefore, the proposed prosodic information fusion methods can improve the voice quality of synthesized speech for low-resource languages.

引用

页数：13

共 36 条

[1]

Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473

[2]

[才让卓玛 Cai Rangzhuoma], 2017, [中文信息学报, Journal of Chinese Information Processing], V31, P59

[3]

Chorowski J., 2014, P NIPS 2014 WORKSH D

[4]

Chorowski J, 2015, ADV NEUR IN, V28

[5]

[都格草 Dou Gecao], 2019, [中文信息学报, Journal of Chinese Information Processing], V33, P75

[6]

Fan Yuchen., 2014, Fifteenth Annual Conference of the International Speech Communication Association

[7] SIGNAL ESTIMATION FROM MODIFIED SHORT-TIME FOURIER-TRANSFORM [J].

GRIFFIN, DW ;

LIM, JS .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1984, 32 (02) :236-243

[8] Pre-trained Text Embeddings for Enhanced Text-to-Speech Synthesis [J].

Hayashi, Tomoki ;

Watanabe, Shinji ;

Toda, Tomoki ;

Takeda, Kazuya ;

Toshniwal, Shubham ;

Livescu, Karen .

INTERSPEECH 2019, 2019, :4430-4434

[9]

Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110

[10]

Kang SY, 2013, INT CONF ACOUST SPEE, P8012, DOI 10.1109/ICASSP.2013.6639225

← 1 2 3 4 →