ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL

被引：1

作者：

Fujita, Kenichi ^{[1
]}

Ashihara, Takanori ^{[1
]}

Kanagawa, Hiroki ^{[1
]}

Moriya, Takafumi ^{[1
]}

Ijima, Yusuke ^{[1
]}

机构：

[1] NTT Corp, Tokyo, Japan

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW | 2023年

关键词：

Speech synthesis; self-supervised learning model; speaker embeddings; zero-shot TTS;

D O I：

10.1109/ICASSPW59220.2023.10193459

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style tokens still have a gap in reproducing the speaker characteristics of unseen speakers. A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. We also introduce the separate conditioning of acoustic features and a phoneme duration predictor to obtain the disentangled embeddings between rhythm-based speaker characteristics and acoustic-feature-based ones. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches. Objective and subjective evaluations showed that the proposed method can synthesize speech with improved similarity and achieve speech-rhythm transfer.

引用

页数：5

共 50 条

[21] StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis
Chene, Zhiyong
Li, Xinnuo
Ai, Zhiqi
Xu, Shugong
PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 263 - 277
[22] Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder
Bang, Chae-Woon
Chun, Chanjun
SENSORS, 2023, 23 (23)
[23] MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale
Rueckle, Andreas
Pfeiffer, Jonas
Gurevych, Iryna
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2471 - 2486
[24] Speech Enhancement with Zero-Shot Model Selection
Zezario, Ryandhimas E.
Fuh, Chiou-Shann
Wang, Hsin-Min
Tsao, Yu
29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 491 - 495
[25] INJECTING TEXT IN SELF-SUPERVISED SPEECH PRETRAINING
Chen, Zhehuai
Zhang, Yu
Rosenberg, Andrew
Ramabhadran, Bhuvana
Wang, Gary
Moreno, Pedro
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 251 - 258
[26] MIIPHER: A ROBUST SPEECH RESTORATION MODEL INTEGRATING SELF-SUPERVISED SPEECH AND TEXT REPRESENTATIONS
Koizumi, Yuma
Zen, Heiga
Karita, Shigeki
Ding, Yifan
Yatabe, Kohei
Morioka, Nobuyuki
Zhang, Yu
Han, Wei
Bapna, Ankur
Bacchiani, Michiel
2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
[27] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
Kumar, Neeraj
Narang, Ankur
Lall, Brejesh
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
[28] Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis
Zhou, Yixuan
Song, Changhe
Li, Xiang
Zhang, Luwen
Wu, Zhiyong
Bian, Yanyao
Su, Dan
Meng, Helen
INTERSPEECH 2022, 2022, : 2573 - 2577
[29] Self-Supervised Speech Representation Learning: A Review
Mohamed, Abdelrahman
Lee, Hung-yi
Borgholt, Lasse
Havtorn, Jakob D.
Edin, Joakim
Igel, Christian
Kirchhoff, Katrin
Li, Shang-Wen
Livescu, Karen
Maaloe, Lars
Sainath, Tara N.
Watanabe, Shinji
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1179 - 1210
[30] AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
Wu, Yihan
Tan, Xu
Li, Bohan
He, Lei
Zhao, Sheng
Song, Ruihua
Qin, Tao
Liu, Tie-Yan
INTERSPEECH 2022, 2022, : 2568 - 2572

← 1 2 3 4 5 →