ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

被引:6
|
作者
Xue, Jinlong [1 ]
Deng, Yayue [1 ,2 ]
Han, Yichen [1 ]
Li, Ya [1 ]
Sun, Jianqing [3 ]
Liang, Jiaen [3 ]
机构
[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Beijing, Peoples R China
[2] Beijing Language & Culture Univ, Beijing, Peoples R China
[3] Unisound Technol Co Ltd, Beijing, Peoples R China
来源
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2022年
关键词
multi-speaker text-to-speech; speaker representation; MOS prediction; RECOGNITION;
D O I
10.1109/ISCSLP57327.2022.10037956
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, neural network based methods for multi-speaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-of-the-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better speaker similarity. To efficiently evaluate our synthesized speech, we are the first to adopt and evaluate different deep learning based automatic MOS evaluation methods to assess our results, and these methods show great potential in automatic speech quality assessment.
引用
收藏
页码:230 / 234
页数:5
相关论文
共 50 条
  • [1] ECAPA-TDNN Embeddings for Speaker Diarization
    Dawalatabad, Nauman
    Ravanelli, Mirco
    Grondin, Francois
    Thienpondt, Jenthe
    Desplanques, Brecht
    Na, Hwidong
    INTERSPEECH 2021, 2021, : 3560 - 3564
  • [2] Multi-speaker Emotional Text-to-speech Synthesizer
    Cho, Sungjae
    Lee, Soo-Young
    INTERSPEECH 2021, 2021, : 2337 - 2338
  • [3] Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data
    Huang, Wen-Chin
    Wu, Yi-Chiao
    Toda, Tomoki
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2995 - 2999
  • [4] Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
    Mitsui, Kentaro
    Koriyama, Tomoki
    Saruwatari, Hiroshi
    INTERSPEECH 2020, 2020, : 2032 - 2036
  • [5] Data Augmentation with ECAPA-TDNN Architecture for Automatic Speaker Recognition
    Li, Pinyan
    Hoi, Lap Man
    Wang, Yapeng
    Im, Sio Kei
    2023 12TH INTERNATIONAL CONFERENCE ON RENEWABLE ENERGY RESEARCH AND APPLICATIONS, ICRERA, 2023, : 414 - 420
  • [6] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
    Desplanques, Brecht
    Thienpondt, Jenthe
    Demuynck, Kris
    INTERSPEECH 2020, 2020, : 3830 - 3834
  • [7] Multi-Scene Robust Speaker Verification System Built on Improved ECAPA-TDNN
    Xuan, Xi
    Jin, Rong
    Xuan, Tingyu
    Du, Guolei
    Xuan, Kaisheng
    2022 IEEE 6TH ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC), 2022, : 1689 - 1693
  • [8] ECAPA-TDNN Based Depression Detection from Clinical Speech
    Wang, Dong
    Ding, Yanhui
    Zhao, Qing
    Yang, Peilin
    Tan, Shuping
    Li, Ya
    INTERSPEECH 2022, 2022, : 3333 - 3337
  • [9] DFR-ECAPA: Diffusion Feature Refinement for Speaker Verification Based on ECAPA-TDNN
    Gao, Ya
    Song, Wei
    Zhao, Xiaobing
    Liu, Xiangchun
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT X, 2024, 14434 : 457 - 468
  • [10] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
    Liu, Zhaoyu
    Mak, Brian
    INTERSPEECH 2020, 2020, : 2932 - 2936