A STUDY ON NEURAL-NETWORK-BASED TEXT-TO-SPEECH ADAPTATION TECHNIQUES FOR VIETNAMESE

被引:0
|
作者
Pham Ngoc Phuong [1 ]
Chung Tran Quang [2 ]
Quoc Truong Do [2 ]
Mai Chi Luong [3 ]
机构
[1] Thai Nguyen Univ, Thai Nguyen, Vietnam
[2] Vietnam Artificial Intelligence Solut, VAIS, Hanoi, Vietnam
[3] Vietnam Acad Sci & Technol, Inst Informat Technol, Hanoi, Vietnam
来源
2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA) | 2021年
关键词
Speaker adaptation; Multi-pass fine-tune; TTS adaptation; Vietnamese TTS corpus; SPEAKER ADAPTATION;
D O I
10.1109/O-COCOSDA202152914.2021.9660445
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
One of the main goals of text-to-speech adaptation techniques is to produce a model that can generate good quality audio given a small amount of training data. In fact, TTS systems for rich-resource languages have good quality because of a large amount of data, but training models with small datasets (or low-resources) is not an easy task, which often produces low-quality sounds. One of the approaches to overcome the data limitation is fine-tuning. However, we still need a pretrained model which learns from large amount of data in advance. The paper presents two contributions: (1) a study on the amounts of data needed for a traditional fine-tuning method for Vietnamese, where we change the data and run the training for a few more iterations; (2) we present a new fine-tuning pipeline which allows us to borrow a pre-trained model from English and adapt it to any Vietnamese voices with a very small amount of data while still maintaining a good speech synthetic sound. Our experiments show that with only 4 minutes of data, we can synthesize a new voice with a good similarity score, and with 16 minutes of data, the model can generate audio with a 3.8 MOS score.
引用
收藏
页码:199 / 205
页数:7
相关论文
共 50 条
  • [31] Parameter-Efficient Learning for Text-to-Speech Accent Adaptation
    Yang, Li-Jen
    Yang, Chao-Han Huck
    Chien, Jen-Tzung
    INTERSPEECH 2023, 2023, : 4354 - 4358
  • [32] A novel prosody adaptation method for Mandarin concatenation-based text-to-speech system
    Yu, Jian
    Tao, Jianhua
    ACOUSTICAL SCIENCE AND TECHNOLOGY, 2009, 30 (01) : 33 - 41
  • [33] A novel prosody adaptation method for Mandarin concatenation-based text-to-speech system
    National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, China
    Acoust. Sci. Technol., 1 (33-41):
  • [34] A Hierarchical Neural-Network-Based Document Representation Approach for Text Classification
    Zheng, Jianming
    Guo, Yupu
    Feng, Chong
    Chen, Honghui
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
  • [35] Convolutional recurrent neural network with attention for Vietnamese speech to text problem in the operating room
    Dat T.T.
    Dang L.T.A.
    Sang V.N.T.
    Thuy L.N.L.
    Bao P.T.
    International Journal of Intelligent Information and Database Systems, 2021, 14 (03) : 294 - 314
  • [36] Overview of current text-to-speech techniques .1. Text and linguistic analysis
    Edgington, M
    Lowry, A
    Jackson, P
    Breen, AP
    Minnis, S
    BT TECHNOLOGY JOURNAL, 1996, 14 (01): : 68 - 83
  • [37] A Comparative Study of Text-to-Speech Systems in LabVIEW
    Panoiu, Manuela
    Rat, Cezara-Liliana
    Panoiu, Caius
    SOFT COMPUTING APPLICATIONS, (SOFA 2014), VOL 1, 2016, 356 : 3 - 11
  • [38] Overview of current text-to-speech techniques .2. Prosody and speech generation
    Edgington, M
    Lowry, A
    Jackson, P
    Breen, AP
    Minnis, S
    BT TECHNOLOGY JOURNAL, 1996, 14 (01): : 84 - 99
  • [39] Overview of current text-to-speech techniques: part II - prosody and speech generation
    Edgington, M.
    Lowry, A.
    Jackson, P.
    Breen, A.P.
    Minnis, S.
    British Telecom technology journal, 1996, 14 (01): : 84 - 99
  • [40] An overview of natural language processing techniques in text-to-speech systems
    Külekci, MO
    Oflazer, K
    PROCEEDINGS OF THE IEEE 12TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, 2004, : 454 - 457