NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH

被引:7
|
作者
Zhao, Botao [1 ,2 ]
Zhang, Xulong [1 ]
Wang, Jianzong [1 ]
Cheng, Ning [1 ]
Xiao, Jing [1 ]
机构
[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Guangdong, Peoples R China
[2] Fudan Univ, Inst Sci & Technol Brain Inspired Intelligence, Shanghai, Peoples R China
来源
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年
关键词
zero-shot; multi-speaker text-to-speech; conditional variational autoencoder;
D O I
10.1109/ICASSP43922.2022.9746875
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.
引用
收藏
页码:4293 / 4297
页数:5
相关论文
共 14 条
  • [1] Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
    Yoon, Hyungchan
    Kim, Changhwan
    Song, Eunwoo
    Yoon, Hyun-Wook
    Kang, Hong-Goo
    INTERSPEECH 2023, 2023, : 4299 - 4303
  • [2] SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
    Yoon, Hyungchan
    Kim, Changhwan
    Um, Seyun
    Yoon, Hyun-Wook
    Kang, Hong-Goo
    IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 593 - 597
  • [3] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
    Kumar, Neeraj
    Narang, Ankur
    Lall, Brejesh
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
  • [4] ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis
    Xue, Jinlong
    Deng, Yayue
    Han, Yichen
    Li, Ya
    Sun, Jianqing
    Liang, Jiaen
    2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 230 - 234
  • [5] Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis
    Zhou, Yixuan
    Song, Changhe
    Li, Xiang
    Zhang, Luwen
    Wu, Zhiyong
    Bian, Yanyao
    Su, Dan
    Meng, Helen
    INTERSPEECH 2022, 2022, : 2573 - 2577
  • [6] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
    Chien, Chung-Ming
    Lin, Jheng-Hao
    Huang, Chien-yu
    Hsu, Po-chun
    Lee, Hung-yi
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592
  • [7] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
    Lux, Florian
    Koch, Julia
    Vu, Ngoc Thang
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
  • [8] Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations
    Wang, Wenbin
    Song, Yang
    Jha, Sanjay
    INTERSPEECH 2023, 2023, : 4454 - 4458
  • [9] Hierarchical Timbre-Cadence Speaker Encoder for Zero-shot Speech Synthesis
    Lee, Joun Yeop
    Bae, Jae-Sung
    Mun, Seongkyu
    Lee, Jihwan
    Lee, Ji-Hyun
    Cho, Hoon-Young
    Kim, Chanwoo
    INTERSPEECH 2023, 2023, : 4334 - 4338
  • [10] Multi-Level Temporal-Channel Speaker Retrieval for Zero-Shot Voice Conversion
    Wang, Zhichao
    Xue, Liumeng
    Kong, Qiuqiang
    Xie, Lei
    Chen, Yuanzhe
    Tian, Qiao
    Wang, Yuping
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2926 - 2937