ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

被引：6

作者：

Xue, Jinlong ^{[1
]}

Deng, Yayue ^{[1
,2
]}

Han, Yichen ^{[1
]}

Li, Ya ^{[1
]}

Sun, Jianqing ^{[3
]}

Liang, Jiaen ^{[3
]}

机构：

[1] Beijing Univ Posts & Telecommun, Sch Artificial Intelligence, Beijing, Peoples R China

[2] Beijing Language & Culture Univ, Beijing, Peoples R China

[3] Unisound Technol Co Ltd, Beijing, Peoples R China

来源：

2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2022年

关键词：

multi-speaker text-to-speech; speaker representation; MOS prediction; RECOGNITION;

D O I：

10.1109/ISCSLP57327.2022.10037956

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In recent years, neural network based methods for multi-speaker text-to-speech synthesis (TTS) have made significant progress. However, the current speaker encoder models used in these methods still cannot capture enough speaker information. In this paper, we focus on accurate speaker encoder modeling and propose an end-to-end method that can generate better similarity for both seen and unseen speakers. The proposed architecture consists of three separately trained components: a speaker encoder based on the state-of-the-art ECAPA-TDNN model which is derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. The comparison among different speaker encoder models shows our proposed method can achieve better speaker similarity. To efficiently evaluate our synthesized speech, we are the first to adopt and evaluate different deep learning based automatic MOS evaluation methods to assess our results, and these methods show great potential in automatic speech quality assessment.

引用

页码：230 / 234

页数：5

共 50 条

[1] ECAPA-TDNN Embeddings for Speaker Diarization
Dawalatabad, Nauman
Ravanelli, Mirco
Grondin, Francois
Thienpondt, Jenthe
Desplanques, Brecht
Na, Hwidong
INTERSPEECH 2021, 2021, : 3560 - 3564
[2] Multi-speaker Emotional Text-to-speech Synthesizer
Cho, Sungjae
Lee, Soo-Young
INTERSPEECH 2021, 2021, : 2337 - 2338
[3] Multi-Speaker Text-to-Speech Training With Speaker Anonymized Data
Huang, Wen-Chin
Wu, Yi-Chiao
Toda, Tomoki
IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 2995 - 2999
[4] Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
Mitsui, Kentaro
Koriyama, Tomoki
Saruwatari, Hiroshi
INTERSPEECH 2020, 2020, : 2032 - 2036
[5] Data Augmentation with ECAPA-TDNN Architecture for Automatic Speaker Recognition
Li, Pinyan
Hoi, Lap Man
Wang, Yapeng
Im, Sio Kei
2023 12TH INTERNATIONAL CONFERENCE ON RENEWABLE ENERGY RESEARCH AND APPLICATIONS, ICRERA, 2023, : 414 - 420
[6] ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
Desplanques, Brecht
Thienpondt, Jenthe
Demuynck, Kris
INTERSPEECH 2020, 2020, : 3830 - 3834
[7] Multi-Scene Robust Speaker Verification System Built on Improved ECAPA-TDNN
Xuan, Xi
Jin, Rong
Xuan, Tingyu
Du, Guolei
Xuan, Kaisheng
2022 IEEE 6TH ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC), 2022, : 1689 - 1693
[8] ECAPA-TDNN Based Depression Detection from Clinical Speech
Wang, Dong
Ding, Yanhui
Zhao, Qing
Yang, Peilin
Tan, Shuping
Li, Ya
INTERSPEECH 2022, 2022, : 3333 - 3337
[9] DFR-ECAPA: Diffusion Feature Refinement for Speaker Verification Based on ECAPA-TDNN
Gao, Ya
Song, Wei
Zhao, Xiaobing
Liu, Xiangchun
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT X, 2024, 14434 : 457 - 468
[10] Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
Liu, Zhaoyu
Mak, Brian
INTERSPEECH 2020, 2020, : 2932 - 2936

← 1 2 3 4 5 →