Normalization Driven Zero-shot Multi-Speaker Speech Synthesis

被引:7
作者
Kumar, Neeraj [1 ,2 ]
Goel, Srishti [1 ]
Narang, Ankur [1 ]
Lall, Brejesh [2 ]
机构
[1] Hike Private Ltd, New Delhi, India
[2] Indian Inst Technol, Delhi, India
来源
INTERSPEECH 2021 | 2021年
关键词
Speech synthesis; normalization; transfer learning; wav2vec2.0 based speaker encoder; angular softmax;
D O I
10.21437/Interspeech.2021-441
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we present a novel zero-shot multi-speaker speech synthesis approach (ZSM-SS) that leverages the normalization architecture and speaker encoder with non-autoregressive multi-head attention driven encoder-decoder architecture.Given an input text and a reference speech sample of an unseen person, ZSM-SS can generate speech in that person's style in a zero-shot manner. Additionally, we demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency in a disentangled fashion and can be used to generate morphed speech output. We demonstrate the efficacy of our proposed architecture on multi-speaker VCTK[1] and LibriTTS [2] datasets, using multiple quantitative metrics that measure generated speech distortion and MOS, along with speaker embedding analysis of the proposed speaker encoder model.
引用
收藏
页码:1354 / 1358
页数:5
相关论文
共 33 条
  • [1] [Anonymous], 2020, XLSR
  • [2] [Anonymous], 2015, DEEP SPEECH 2 ENDTOE
  • [3] [Anonymous], 2018, Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis
  • [4] [Anonymous], 2018, Transfer Learning from speaker Verification to Multispeaker Text-To-Speech Synthesis
  • [5] [Anonymous], 2019, PYWORLD
  • [6] Arik S., 2018, NEURAL VOICE CLONING
  • [7] Arik SÖ, 2017, ADV NEUR IN, V30
  • [8] Ba J., 2016, ARXIV160706450, V1050, P21
  • [9] Baevski A., 2020, Wav2Vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
  • [10] Chen M., 2020, MULTISPEECH MULTISPE