NNSPEECH: SPEAKER-GUIDED CONDITIONAL VARIATIONAL AUTOENCODER FOR ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH

被引：7

作者：

Zhao, Botao ^{[1
,2
]}

Zhang, Xulong ^{[1
]}

Wang, Jianzong ^{[1
]}

Cheng, Ning ^{[1
]}

Xiao, Jing ^{[1
]}

机构：

[1] Ping An Technol Shenzhen Co Ltd, Shenzhen, Guangdong, Peoples R China

[2] Fudan Univ, Inst Sci & Technol Brain Inspired Intelligence, Shanghai, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2022年

关键词：

zero-shot; multi-speaker text-to-speech; conditional variational autoencoder;

D O I：

10.1109/ICASSP43922.2022.9746875

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Multi-speaker text-to-speech (TTS) using a few adaption data is a challenge in practical applications. To address that, we propose a zero-shot multi-speaker TTS, named nnSpeech, that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance. Compared with using a speaker representation module to extract the characteristics of new speakers, our method bases on a speaker-guided conditional variational autoencoder and can generate a variable Z, which contains both speaker characteristics and content information. The latent variable Z distribution is approximated by another variable conditioned on reference mel-spectrogram and phoneme. Experiments on the English corpus, Mandarin corpus, and cross-dataset proves that our model could generate natural and similar speech with only one adaption speech.

引用

页码：4293 / 4297

页数：5

共 14 条

[1] Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech
Yoon, Hyungchan
Kim, Changhwan
Song, Eunwoo
Yoon, Hyun-Wook
Kang, Hong-Goo
INTERSPEECH 2023, 2023, : 4299 - 4303
[2] SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
Yoon, Hyungchan
Kim, Changhwan
Um, Seyun
Yoon, Hyun-Wook
Kang, Hong-Goo
IEEE SIGNAL PROCESSING LETTERS, 2023, 30 : 593 - 597
[3] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
Kumar, Neeraj
Narang, Ankur
Lall, Brejesh
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
[4] ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis
Xue, Jinlong
Deng, Yayue
Han, Yichen
Li, Ya
Sun, Jianqing
Liang, Jiaen
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2022, : 230 - 234
[5] Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis
Zhou, Yixuan
Song, Changhe
Li, Xiang
Zhang, Luwen
Wu, Zhiyong
Bian, Yanyao
Su, Dan
Meng, Helen
INTERSPEECH 2022, 2022, : 2573 - 2577
[6] INVESTIGATING ON INCORPORATING PRETRAINED AND LEARNABLE SPEAKER REPRESENTATIONS FOR MULTI-SPEAKER MULTI-STYLE TEXT-TO-SPEECH
Chien, Chung-Ming
Lin, Jheng-Hao
Huang, Chien-yu
Hsu, Po-chun
Lee, Hung-yi
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8588 - 8592
[7] EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH
Lux, Florian
Koch, Julia
Vu, Ngoc Thang
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 962 - 969
[8] Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations
Wang, Wenbin
Song, Yang
Jha, Sanjay
INTERSPEECH 2023, 2023, : 4454 - 4458
[9] Hierarchical Timbre-Cadence Speaker Encoder for Zero-shot Speech Synthesis
Lee, Joun Yeop
Bae, Jae-Sung
Mun, Seongkyu
Lee, Jihwan
Lee, Ji-Hyun
Cho, Hoon-Young
Kim, Chanwoo
INTERSPEECH 2023, 2023, : 4334 - 4338
[10] Multi-Level Temporal-Channel Speaker Retrieval for Zero-Shot Voice Conversion
Wang, Zhichao
Xue, Liumeng
Kong, Qiuqiang
Xie, Lei
Chen, Yuanzhe
Tian, Qiao
Wang, Yuping
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2926 - 2937

← 1 2 →