A STUDY ON NEURAL-NETWORK-BASED TEXT-TO-SPEECH ADAPTATION TECHNIQUES FOR VIETNAMESE

被引:0
|
作者
Pham Ngoc Phuong [1 ]
Chung Tran Quang [2 ]
Quoc Truong Do [2 ]
Mai Chi Luong [3 ]
机构
[1] Thai Nguyen Univ, Thai Nguyen, Vietnam
[2] Vietnam Artificial Intelligence Solut, VAIS, Hanoi, Vietnam
[3] Vietnam Acad Sci & Technol, Inst Informat Technol, Hanoi, Vietnam
来源
2021 24TH CONFERENCE OF THE ORIENTAL COCOSDA INTERNATIONAL COMMITTEE FOR THE CO-ORDINATION AND STANDARDISATION OF SPEECH DATABASES AND ASSESSMENT TECHNIQUES (O-COCOSDA) | 2021年
关键词
Speaker adaptation; Multi-pass fine-tune; TTS adaptation; Vietnamese TTS corpus; SPEAKER ADAPTATION;
D O I
10.1109/O-COCOSDA202152914.2021.9660445
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
One of the main goals of text-to-speech adaptation techniques is to produce a model that can generate good quality audio given a small amount of training data. In fact, TTS systems for rich-resource languages have good quality because of a large amount of data, but training models with small datasets (or low-resources) is not an easy task, which often produces low-quality sounds. One of the approaches to overcome the data limitation is fine-tuning. However, we still need a pretrained model which learns from large amount of data in advance. The paper presents two contributions: (1) a study on the amounts of data needed for a traditional fine-tuning method for Vietnamese, where we change the data and run the training for a few more iterations; (2) we present a new fine-tuning pipeline which allows us to borrow a pre-trained model from English and adapt it to any Vietnamese voices with a very small amount of data while still maintaining a good speech synthetic sound. Our experiments show that with only 4 minutes of data, we can synthesize a new voice with a good similarity score, and with 16 minutes of data, the model can generate audio with a 3.8 MOS score.
引用
收藏
页码:199 / 205
页数:7
相关论文
共 50 条
  • [41] AN AUTOENCODER NEURAL-NETWORK BASED LOW-DIMENSIONALITY APPROACH TO EXCITATION MODELING FOR HMM-BASED TEXT-TO-SPEECH
    Vishnubhotla, Srikanth
    Fernandez, Raul
    Ramabhadran, Bhuvana
    2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 4614 - 4617
  • [42] EXAMPLAR-BASED SPEECH WAVEFORM GENERATION FOR TEXT-TO-SPEECH
    Valentini-Botinhao, Cassia
    Watts, Oliver
    Espic, Felipe
    King, Simon
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 332 - 338
  • [43] Articulatory Text-to-Speech Synthesis using the Digital Waveguide Mesh driven by a Deep Neural Network
    Gully, Amelia J.
    Yoshimura, Takenori
    Murphy, Damian T.
    Hashimoto, Kei
    Nankaku, Yoshihiko
    Tokuda, Keiichi
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 234 - 238
  • [44] Deep Voice: Real-time Neural Text-to-Speech
    Arik, Sercan O.
    Chrzanowski, Mike
    Coates, Adam
    Diamos, Gregory
    Gibiansky, Andrew
    Kang, Yongguo
    Li, Xian
    Miller, John
    Ng, Andrew
    Raiman, Jonathan
    Sengupta, Shubho
    Shoeybi, Mohammad
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [45] GRAPHTTS: GRAPH-TO-SEQUENCE MODELLING IN NEURAL TEXT-TO-SPEECH
    Sun, Aolan
    Wang, Jianzong
    Cheng, Ning
    Peng, Huayi
    Zeng, Zhen
    Xiao, Jing
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 6719 - 6723
  • [46] PROSODIC REPRESENTATION LEARNING AND CONTEXTUAL SAMPLING FOR NEURAL TEXT-TO-SPEECH
    Karlapati, Sri
    Abbas, Ammar
    Hodari, Zack
    Moinet, Alexis
    Joly, Arnaud
    Karanasou, Penny
    Drugman, Thomas
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6573 - 6577
  • [47] FastTalker: A neural text-to-speech architecture with shallow and group autoregression
    Liu, Rui
    Sisman, Berrak
    Lin, Yixing
    Li, Haizhou
    NEURAL NETWORKS, 2021, 141 : 306 - 314
  • [48] Effective Data Augmentation Methods for Neural Text-to-Speech Systems
    Oh, Suhyeon
    Kwon, Ohsung
    Hwang, Min-Jae
    Kim, Jae-Min
    Song, Eunwoo
    2022 INTERNATIONAL CONFERENCE ON ELECTRONICS, INFORMATION, AND COMMUNICATION (ICEIC), 2022,
  • [49] Neural Text-to-Speech with a Modeling-by-Generation Excitation Vocoder
    Song, Eunwoo
    Hwang, Min-Jae
    Yamamoto, Ryuichi
    Kim, Jin-Seob
    Kwon, Ohsung
    Kim, Jae-Min
    INTERSPEECH 2020, 2020, : 3570 - 3574
  • [50] PROSODYSPEECH: TOWARDS ADVANCED PROSODY MODEL FOR NEURAL TEXT-TO-SPEECH
    Yi, Yuanhao
    He, Lei
    Pan, Shifeng
    Wang, Xi
    Xiao, Yujia
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 7582 - 7586