Towards Universal Text-to-Speech

被引:18
作者
Yang, Jingzhou [1 ]
He, Lei [1 ]
机构
[1] Microsoft, Beijing, Peoples R China
来源
INTERSPEECH 2020 | 2020年
关键词
multilingual; speech synthesis; neural text-to-speech; transfer learning; FUNDAMENTAL-FREQUENCY; ENGLISH;
D O I
10.21437/Interspeech.2020-1590
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
This paper studies a multilingual sequence-to-sequence text-to-speech framework towards universal modeling, that is able to synthesize speech for any speaker in any language using a single model. This framework consists of a transformer-based acoustic predictor and a WaveNet neural vocoder, with global conditions from speaker and language networks. It is examined on a massive TTS data set with around 1250 hours of data from 50 language locales, and the amount of data in different locales is highly unbalanced. Although the multilingual model exhibits the transfer learning ability to benefit the low-resource languages, data imbalance still undermines the model performance. A data balance training strategy is successfully applied and effectively improves the voice quality of the low-resource languages. Furthermore, this paper examines the modeling capacity of extending to new speakers and languages, as a key step towards universal modeling. Experiments show 20 seconds of data is feasible for a new speaker and 6 minutes for a new language.
引用
收藏
页码:3171 / 3175
页数:5
相关论文
共 26 条
  • [1] [Anonymous], 2019, IEEE INT SYMP CIRC S
  • [2] Arik SO, 2017, PR MACH LEARN RES, V70
  • [3] Arivazhagan N, 2019, ARXIV190705019
  • [4] An Investigation of Convolution Attention Based Models for Multilingual Speech Synthesis of Indian Languages
    Baljekar, Pallavi
    Rallabandi, SaiKrishna
    Black, Alan W.
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 2474 - 2478
  • [5] Demirsahin I., 2018, PROC WORKSHOP SPOKEN
  • [6] Fan YC, 2016, INT CONF ACOUST SPEE, P5540, DOI 10.1109/ICASSP.2016.7472737
  • [7] Jia Y, 2018, ADV NEUR IN, V31
  • [8] Kalchbrenner N., 2018, PMLR, P2410, DOI DOI 10.48550/ARXIV.1802.08435
  • [9] Comparison of speaking fundamental frequency in English and Mandarin
    Keating, Patricia
    Kuo, Grace
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2012, 132 (02) : 1050 - 1060
  • [10] Lenzo K. A., 2004, P ICASSP, V3, piii