A STUDY ON NEURAL-NETWORK-BASED TEXT-TO-SPEECH ADAPTATION TECHNIQUES FOR VIETNAMESE

被引：0

作者：

Pham Ngoc Phuong ^{[1
]}

Chung Tran Quang ^{[2
]}

Quoc Truong Do ^{[2
]}

Mai Chi Luong ^{[3
]}

机构：

[1] Thai Nguyen Univ, Thai Nguyen, Vietnam

[2] Vietnam Artificial Intelligence Solut, VAIS, Hanoi, Vietnam

[3] Vietnam Acad Sci & Technol, Inst Informat Technol, Hanoi, Vietnam

来源：

2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA) | 2021年

关键词：

Speaker adaptation; Multi-pass fine-tune; TTS adaptation; Vietnamese TTS corpus; SPEAKER ADAPTATION;

D O I：

10.1109/O-COCOSDA202152914.2021.9660445

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

One of the main goals of text-to-speech adaptation techniques is to produce a model that can generate good quality audio given a small amount of training data. In fact, TTS systems for rich-resource languages have good quality because of a large amount of data, but training models with small datasets (or low-resources) is not an easy task, which often produces low-quality sounds. One of the approaches to overcome the data limitation is fine-tuning. However, we still need a pretrained model which learns from large amount of data in advance. The paper presents two contributions: (1) a study on the amounts of data needed for a traditional fine-tuning method for Vietnamese, where we change the data and run the training for a few more iterations; (2) we present a new fine-tuning pipeline which allows us to borrow a pre-trained model from English and adapt it to any Vietnamese voices with a very small amount of data while still maintaining a good speech synthetic sound. Our experiments show that with only 4 minutes of data, we can synthesize a new voice with a good similarity score, and with 16 minutes of data, the model can generate audio with a 3.8 MOS score.

引用

页码：199 / 205

页数：7

共 50 条

[31] Parameter-Efficient Learning for Text-to-Speech Accent Adaptation
Yang, Li-Jen
Yang, Chao-Han Huck
Chien, Jen-Tzung
INTERSPEECH 2023, 2023, : 4354 - 4358
[32] A novel prosody adaptation method for Mandarin concatenation-based text-to-speech system
Yu, Jian
Tao, Jianhua
ACOUSTICAL SCIENCE AND TECHNOLOGY, 2009, 30 (01) : 33 - 41
[33] A novel prosody adaptation method for Mandarin concatenation-based text-to-speech system
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, China
Acoust. Sci. Technol., 1 (33-41):
[34] A Hierarchical Neural-Network-Based Document Representation Approach for Text Classification
Zheng, Jianming
Guo, Yupu
Feng, Chong
Chen, Honghui
MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
[35] Convolutional recurrent neural network with attention for Vietnamese speech to text problem in the operating room
Dat T.T.
Dang L.T.A.
Sang V.N.T.
Thuy L.N.L.
Bao P.T.
International Journal of Intelligent Information and Database Systems, 2021, 14 (03) : 294 - 314
[36] Overview of current text-to-speech techniques .1. Text and linguistic analysis
Edgington, M
Lowry, A
Jackson, P
Breen, AP
Minnis, S
BT TECHNOLOGY JOURNAL, 1996, 14 (01): : 68 - 83
[37] A Comparative Study of Text-to-Speech Systems in LabVIEW
Panoiu, Manuela
Rat, Cezara-Liliana
Panoiu, Caius
SOFT COMPUTING APPLICATIONS, (SOFA 2014), VOL 1, 2016, 356 : 3 - 11
[38] Overview of current text-to-speech techniques .2. Prosody and speech generation
Edgington, M
Lowry, A
Jackson, P
Breen, AP
Minnis, S
BT TECHNOLOGY JOURNAL, 1996, 14 (01): : 84 - 99
[39] Overview of current text-to-speech techniques: part II - prosody and speech generation
Edgington, M.
Lowry, A.
Jackson, P.
Breen, A.P.
Minnis, S.
British Telecom technology journal, 1996, 14 (01): : 84 - 99
[40] An overview of natural language processing techniques in text-to-speech systems
Külekci, MO
Oflazer, K
PROCEEDINGS OF THE IEEE 12TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, 2004, : 454 - 457

← 1 2 3 4 5 →