Towards Zero-Shot Multi-Speaker Multi-Accent Text-to-Speech Synthesis

被引:2
作者
Zhang, Mingyang [1 ]
Zhou, Xuehao [2 ]
Wu, Zhizheng [1 ]
Li, Haizhou [1 ,2 ]
机构
[1] Chinese Univ Hong Kong, Shenzhen Res Inst Big Data, Sch Data Sci, Shenzhen 518172, Peoples R China
[2] Natl Univ Singapore, Singapore 117583, Singapore
基金
中国国家自然科学基金;
关键词
Accent speech synthesis; limited data; multi accent modelling; text-to-speech;
D O I
10.1109/LSP.2023.3292740
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
This letter presents a framework towards multi-accent neural text-to-speech synthesis for zero-shot multi-speaker, which employs an encoder-decoder architecture and an accent classifier to control the pronunciation variation from the encoder. The encoder and decoder are pre-trained on a large-scale multi-speaker corpus. The accent-informed encoder outputs are taken by the attention-based decoder to generate accented prosody. This framework allows for fine-tuning with limited training data from multiple accents, and is able to generate accented speech for unseen speakers. Both objective and subjective evaluations confirm the effectiveness of the proposed framework.
引用
收藏
页码:947 / 951
页数:5
相关论文
共 50 条
  • [41] MULTI-RATE ATTENTION ARCHITECTURE FOR FAST STREAMABLE TEXT-TO-SPEECH SPECTRUM MODELING
    He, Qing
    Xiu, Zhiping
    Koehler, Thilo
    Wu, Jilong
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5689 - 5693
  • [42] WAVEFORM GENERATION FOR TEXT-TO-SPEECH SYNTHESIS USING PITCH-SYNCHRONOUS MULTI-SCALE GENERATIVE ADVERSARIAL NETWORKS
    Juvela, Lauri
    Bollepalli, Bajibabu
    Yamagishi, Junichi
    Alku, Paavo
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6915 - 6919
  • [43] THE THU-HCSI MULTI-SPEAKER MULTI-LINGUAL FEW-SHOT VOICE CLONING SYSTEM FOR LIMMITS'24 CHALLENGE<bold> </bold>
    Zhou, Yixuan
    Zhou, Shuoyi
    Lei, Shun
    Wu, Zhiyong
    Wu, Menglin
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW 2024, 2024, : 71 - 72
  • [44] Cross-lingual Text-To-Speech Synthesis via Domain Adaptation and Perceptual Similarity Regression in Speaker Space
    Xin, Detai
    Saito, Yuki
    Takamichi, Shinnosuke
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    INTERSPEECH 2020, 2020, : 2947 - 2951
  • [45] ZMM-TTS: Zero-Shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-Supervised Discrete Speech Representations
    Gong, Cheng
    Wang, Xin
    Cooper, Erica
    Wells, Dan
    Wang, Longbiao
    Dang, Jianwu
    Richmond, Korin
    Yamagishi, Junichi
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4036 - 4051
  • [46] An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 1059 - 1063
  • [47] Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet
    Zhang, Mingyang
    Wang, Xin
    Fang, Fuming
    Li, Haizhou
    Yamagishi, Junichi
    INTERSPEECH 2019, 2019, : 1298 - 1302
  • [48] Multi-stage attention for fine-grained expressivity transfer in multispeaker text-to-speech system
    Kulkarni, Ajinkya
    Colotte, Vincent
    Jouvet, Denis
    2022 30TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2022), 2022, : 180 - 184
  • [49] MRMI-TTS: Multi-Reference Audios and Mutual Information Driven Zero-Shot Voice Cloning
    Chen, Yi Ting
    Li, Wanting
    Tang, Buzhou
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (05)
  • [50] H4C-TTS: Leveraging Multi-Modal Historical Context for Conversational Text-to-Speech
    Seong, Donghyun
    Chang, Joon-Hyuk
    INTERSPEECH 2024, 2024, : 4933 - 4937