Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training

被引：8

作者：

Liu, Rui ^{[1
]}

Hu, Yifan ^{[1
]}

Zuo, Haolin ^{[1
]}

Luo, Zhaojie ^{[2
]}

Wang, Longbiao ^{[3
]}

Gao, Guanglai ^{[1
]}

机构：

[1] Inner Mongolia Univ, Dept Comp Sci, Hohhot 010021, Peoples R China

[2] Osaka Univ, SANKEN, Osaka 5670047, Japan

[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin 300072, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2024年 / 32卷

基金：

中国国家自然科学基金;

关键词：

Text-to-speech (TTS); agglutinative; morphology; language modeling; pre-training; END;

D O I：

10.1109/TASLP.2023.3348762

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Text-to-Speech (TTS) aims to convert the input text to a human-like voice. With the development of deep learning, encoder-decoder based TTS models perform superior performance, in terms of naturalness, in mainstream languages such as Chinese, English, etc. Note that the linguistic information learning capability of the text encoder is the key. However, for TTS of low-resource agglutinative languages, the scale of the <text, speech> paired data is limited. Therefore, how to extract rich linguistic information from small-scale text data to enhance the naturalness of the synthesized speech, is an urgent issue that needs to be addressed. In this paper, we first collect a large unsupervised text data for BERT-like language model pre-training, and then adopt the trained language model to extract deep linguistic information for the input text of the TTS model to improve the naturalness of the final synthesized speech. It should be emphasized that in order to fully exploit the prosody-related linguistic information in agglutinative languages, we incorporated morphological information into the language model training and constructed a morphology-aware masking based BERT model (MAM-BERT). Experimental results based on various advanced TTS models validate the effectiveness of our approach. Further comparison of the various data scales also validates the effectiveness of our approach in low-resource scenarios.

引用

页码：1075 / 1087

页数：13

共 24 条

[1] AgglutiFiT: Efficient Low-Resource Agglutinative Language Model Fine-Tuning
Li, Zhe
Li, Xiuhong
Sheng, Jiabao
Slamu, Wushour
IEEE ACCESS, 2020, 8 : 148489 - 148499
[2] A STUDY ON THE EFFICACY OF MODEL PRE-TRAINING IN DEVELOPING NEURAL TEXT-TO-SPEECH SYSTEM
Zhang, Guangyan
Leng, Yichong
Tan, Daxin
Qin, Ying
Song, Kaitao
Tan, Xu
Zhao, Sheng
Lee, Tan
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6087 - 6091
[3] Pre-training model for low-resource Chinese-Braille translation
Yu, Hailong
Su, Wei
Liu, Lei
Zhang, Jing
Cai, Chuan
Xu, Cunlu
DISPLAYS, 2023, 79
[4] PROSOSPEECH: ENHANCING PROSODY WITH QUANTIZED VECTOR PRE-TRAINING IN TEXT-TO-SPEECH
Ren, Yi
Lei, Ming
Huang, Zhiying
Zhang, Shiliang
Chen, Qian
Yan, Zhijie
Zhao, Zhou
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7577 - 7581
[5] Augmenting Low-Resource Text Classification with Graph-Grounded Pre-training and Prompting
Wen, Zhihao
Fang, Yuan
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 506 - 516
[6] Low-Resource Neural Machine Translation Using XLNet Pre-training Model
Wu, Nier
Hou, Hongxu
Guo, Ziyue
Zheng, Wei
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 503 - 514
[7] Character-Aware Low-Resource Neural Machine Translation with Weight Sharing and Pre-training
Cao, Yichao
Li, Miao
Feng, Tao
Wang, Rujing
CHINESE COMPUTATIONAL LINGUISTICS, CCL 2019, 2019, 11856 : 321 - 333
[8] SPEECH-LANGUAGE PRE-TRAINING FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
Qian, Yao
Bianv, Ximo
Shi, Yu
Kanda, Naoyuki
Shen, Leo
Xiao, Zhen
Zeng, Michael
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7458 - 7462
[9] FlauBERT: Unsupervised Language Model Pre-training for French
Le, Hang
Vial, Loic
Frej, Jibril
Segonne, Vincent
Coavoux, Maximin
Lecouteux, Benjamin
Allauzen, Alexandre
Crabbe, Benoit
Besacier, Laurent
Schwab, Didier
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2479 - 2490
[10] Pre-Training on Mixed Data for Low-Resource Neural Machine Translation
Zhang, Wenbo
Li, Xiao
Yang, Yating
Dong, Rui
INFORMATION, 2021, 12 (03)

← 1 2 3 →