Improving Sequence-to-sequence Tibetan Speech Synthesis with Prosodic Information

被引：1

作者：

Zhang, Weizhao ^{[1
]}

Yang, Hongwu ^{[2
]}

机构：

[1] Northwest Normal Univ, Coll Phys & Elect Engn, Lanzhou 730070, Peoples R China

[2] Northwest Normal Univ, Sch Educ Technol, Lanzhou, Peoples R China

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2023年 / 22卷 / 09期

基金：

中国国家自然科学基金;

关键词：

Sequence-to-sequence speech synthesis; Tibetan speech synthesis; prosodic information fusion; low-resource language; ADAPTATION; SPEAKER;

D O I：

10.1145/3616012

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

There are about 6,000 languages worldwide, most of which are low-resource languages. Although the current speech synthesis (or text-to-speech, TTS) for major languages (e.g., Mandarin, English, French) has achieved good results, the voice quality of TTS for low-resource languages (e.g., Tibetan) still needs to be further improved. Because prosody plays a significant role in natural speech, the article proposes two sequence-tosequence (seq2seq) Tibetan TTS models with prosodic information fusion to improve the voice quality of synthesized Tibetan speech. We first constructed a large-scale Tibetan corpus for seq2seq TTS. Then we designed a prosody generator to extract prosodic information from the Tibetan sentences. Finally, we trained two seq2seq Tibetan TTS models by fusing prosodic information, including feature-level and model-level prosodic information fusion. The experimental results showed that the proposed two seq2seq Tibetan TTS models, which fuse prosodic information, could effectively improve the voice quality of synthesized speech. Furthermore, the model-level prosodic information fusion only needs 60% similar to 70% of the training data to synthesize a voice similar to the baseline seq2seq Tibetan TTS. Therefore, the proposed prosodic information fusion methods can improve the voice quality of synthesized speech for low-resource languages.

引用

页数：13

共 36 条

[11]

Lee Y, 2019, INT CONF ACOUST SPEE, P5911, DOI [10.1109/icassp.2019.8683501, 10.1109/ICASSP.2019.8683501]

[12] Knowledge-based Linguistic Encoding for End-to-End Mandarin Text-to-Speech Synthesis [J].

Li, Jingbei ;

Wu, Zhiyong ;

Li, Runnan ;

Zhi, Pengpeng ;

Yang, Song ;

Meng, Helen .

INTERSPEECH 2019, 2019, :4494-4498

[13] Deep Learning for Acoustic Modeling in Parametric Speech Generation [J].

Ling, Zhen-Hua ;

Kang, Shi-Yin ;

Zen, Heiga ;

Senior, Andrew ;

Schuster, Mike ;

Qian, Xiao-Jun ;

Meng, Helen ;

Deng, Li .

IEEE SIGNAL PROCESSING MAGAZINE, 2015, 32 (03) :35-52

[14] Modeling Spectral Envelopes Using Restricted Boltzmann Machines and Deep Belief Networks for Statistical Parametric Speech Synthesis [J].

Ling, Zhen-Hua ;

Deng, Li ;

Yu, Dong .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (10) :2129-2139

[15]

Lu YF, 2019, INT CONF ACOUST SPEE, P7050, DOI 10.1109/ICASSP.2019.8682368

[16]

Luong T., 2015, P 2015 C EMPIRICAL M, P1412, DOI [DOI 10.18653/V1/D15-1166, 10.18653/v1/d15-1166]

[17] Eigenvoice Speaker Adaptation with Minimal Data for Statistical Speech Synthesis Systems Using a MAP Approach and Nearest-Neighbors [J].

Mohammadi, Amir ;

Sarfjoo, Seyyed Saeed ;

Demiroglu, Cenk .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (12) :2146-2157

[18]

Shen J, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4779, DOI 10.1109/ICASSP.2018.8461368

[19]

Skerry-Ryan RJ, 2018, PR MACH LEARN RES, V80

[20] Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems [J].

Song, Eunwoo ;

Soong, Frank K. ;

Kang, Hong-Goo .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (11) :2152-2161

← 1 2 3 4 →