Text-to-Speech for Low-Resource Agglutinative Language With Morphology-Aware Language Model Pre-Training

被引:8
|
作者
Liu, Rui [1 ]
Hu, Yifan [1 ]
Zuo, Haolin [1 ]
Luo, Zhaojie [2 ]
Wang, Longbiao [3 ]
Gao, Guanglai [1 ]
机构
[1] Inner Mongolia Univ, Dept Comp Sci, Hohhot 010021, Peoples R China
[2] Osaka Univ, SANKEN, Osaka 5670047, Japan
[3] Tianjin Univ, Coll Intelligence & Comp, Tianjin Key Lab Cognit Comp & Applicat, Tianjin 300072, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-to-speech (TTS); agglutinative; morphology; language modeling; pre-training; END;
D O I
10.1109/TASLP.2023.3348762
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Text-to-Speech (TTS) aims to convert the input text to a human-like voice. With the development of deep learning, encoder-decoder based TTS models perform superior performance, in terms of naturalness, in mainstream languages such as Chinese, English, etc. Note that the linguistic information learning capability of the text encoder is the key. However, for TTS of low-resource agglutinative languages, the scale of the <text, speech> paired data is limited. Therefore, how to extract rich linguistic information from small-scale text data to enhance the naturalness of the synthesized speech, is an urgent issue that needs to be addressed. In this paper, we first collect a large unsupervised text data for BERT-like language model pre-training, and then adopt the trained language model to extract deep linguistic information for the input text of the TTS model to improve the naturalness of the final synthesized speech. It should be emphasized that in order to fully exploit the prosody-related linguistic information in agglutinative languages, we incorporated morphological information into the language model training and constructed a morphology-aware masking based BERT model (MAM-BERT). Experimental results based on various advanced TTS models validate the effectiveness of our approach. Further comparison of the various data scales also validates the effectiveness of our approach in low-resource scenarios.
引用
收藏
页码:1075 / 1087
页数:13
相关论文
共 24 条
  • [1] AgglutiFiT: Efficient Low-Resource Agglutinative Language Model Fine-Tuning
    Li, Zhe
    Li, Xiuhong
    Sheng, Jiabao
    Slamu, Wushour
    IEEE ACCESS, 2020, 8 : 148489 - 148499
  • [2] A STUDY ON THE EFFICACY OF MODEL PRE-TRAINING IN DEVELOPING NEURAL TEXT-TO-SPEECH SYSTEM
    Zhang, Guangyan
    Leng, Yichong
    Tan, Daxin
    Qin, Ying
    Song, Kaitao
    Tan, Xu
    Zhao, Sheng
    Lee, Tan
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6087 - 6091
  • [3] Pre-training model for low-resource Chinese-Braille translation
    Yu, Hailong
    Su, Wei
    Liu, Lei
    Zhang, Jing
    Cai, Chuan
    Xu, Cunlu
    DISPLAYS, 2023, 79
  • [4] PROSOSPEECH: ENHANCING PROSODY WITH QUANTIZED VECTOR PRE-TRAINING IN TEXT-TO-SPEECH
    Ren, Yi
    Lei, Ming
    Huang, Zhiying
    Zhang, Shiliang
    Chen, Qian
    Yan, Zhijie
    Zhao, Zhou
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7577 - 7581
  • [5] Augmenting Low-Resource Text Classification with Graph-Grounded Pre-training and Prompting
    Wen, Zhihao
    Fang, Yuan
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 506 - 516
  • [6] Low-Resource Neural Machine Translation Using XLNet Pre-training Model
    Wu, Nier
    Hou, Hongxu
    Guo, Ziyue
    Zheng, Wei
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2021, PT V, 2021, 12895 : 503 - 514
  • [7] Character-Aware Low-Resource Neural Machine Translation with Weight Sharing and Pre-training
    Cao, Yichao
    Li, Miao
    Feng, Tao
    Wang, Rujing
    CHINESE COMPUTATIONAL LINGUISTICS, CCL 2019, 2019, 11856 : 321 - 333
  • [8] SPEECH-LANGUAGE PRE-TRAINING FOR END-TO-END SPOKEN LANGUAGE UNDERSTANDING
    Qian, Yao
    Bianv, Ximo
    Shi, Yu
    Kanda, Naoyuki
    Shen, Leo
    Xiao, Zhen
    Zeng, Michael
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7458 - 7462
  • [9] FlauBERT: Unsupervised Language Model Pre-training for French
    Le, Hang
    Vial, Loic
    Frej, Jibril
    Segonne, Vincent
    Coavoux, Maximin
    Lecouteux, Benjamin
    Allauzen, Alexandre
    Crabbe, Benoit
    Besacier, Laurent
    Schwab, Didier
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 2479 - 2490
  • [10] Pre-Training on Mixed Data for Low-Resource Neural Machine Translation
    Zhang, Wenbo
    Li, Xiao
    Yang, Yating
    Dong, Rui
    INFORMATION, 2021, 12 (03)