ZERO-SHOT TEXT-TO-SPEECH SYNTHESIS CONDITIONED USING SELF-SUPERVISED SPEECH REPRESENTATION MODEL

被引:1
|
作者
Fujita, Kenichi [1 ]
Ashihara, Takanori [1 ]
Kanagawa, Hiroki [1 ]
Moriya, Takafumi [1 ]
Ijima, Yusuke [1 ]
机构
[1] NTT Corp, Tokyo, Japan
来源
2023 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING WORKSHOPS, ICASSPW | 2023年
关键词
Speech synthesis; self-supervised learning model; speaker embeddings; zero-shot TTS;
D O I
10.1109/ICASSPW59220.2023.10193459
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style tokens still have a gap in reproducing the speaker characteristics of unseen speakers. A novel point of the proposed method is the direct use of the SSL model to obtain embedding vectors from speech representations trained with a large amount of data. We also introduce the separate conditioning of acoustic features and a phoneme duration predictor to obtain the disentangled embeddings between rhythm-based speaker characteristics and acoustic-feature-based ones. The disentangled embeddings will enable us to achieve better reproduction performance for unseen speakers and rhythm transfer conditioned by different speeches. Objective and subjective evaluations showed that the proposed method can synthesize speech with improved similarity and achieve speech-rhythm transfer.
引用
收藏
页数:5
相关论文
共 50 条
  • [21] StyleFusion TTS: Multimodal Style-Control and Enhanced Feature Fusion for Zero-Shot Text-to-Speech Synthesis
    Chene, Zhiyong
    Li, Xinnuo
    Ai, Zhiqi
    Xu, Shugong
    PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2024, PT XI, 2025, 15041 : 263 - 277
  • [22] Effective Zero-Shot Multi-Speaker Text-to-Speech Technique Using Information Perturbation and a Speaker Encoder
    Bang, Chae-Woon
    Chun, Chanjun
    SENSORS, 2023, 23 (23)
  • [23] MultiCQA: Zero-Shot Transfer of Self-Supervised Text Matching Models on a Massive Scale
    Rueckle, Andreas
    Pfeiffer, Jonas
    Gurevych, Iryna
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2471 - 2486
  • [24] Speech Enhancement with Zero-Shot Model Selection
    Zezario, Ryandhimas E.
    Fuh, Chiou-Shann
    Wang, Hsin-Min
    Tsao, Yu
    29TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO 2021), 2021, : 491 - 495
  • [25] INJECTING TEXT IN SELF-SUPERVISED SPEECH PRETRAINING
    Chen, Zhehuai
    Zhang, Yu
    Rosenberg, Andrew
    Ramabhadran, Bhuvana
    Wang, Gary
    Moreno, Pedro
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 251 - 258
  • [26] MIIPHER: A ROBUST SPEECH RESTORATION MODEL INTEGRATING SELF-SUPERVISED SPEECH AND TEXT REPRESENTATIONS
    Koizumi, Yuma
    Zen, Heiga
    Karita, Shigeki
    Ding, Yifan
    Yatabe, Kohei
    Morioka, Nobuyuki
    Zhang, Yu
    Han, Wei
    Bapna, Ankur
    Bacchiani, Michiel
    2023 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, WASPAA, 2023,
  • [27] Zero-Shot Normalization Driven Multi-Speaker Text to Speech Synthesis
    Kumar, Neeraj
    Narang, Ankur
    Lall, Brejesh
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 1679 - 1693
  • [28] Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis
    Zhou, Yixuan
    Song, Changhe
    Li, Xiang
    Zhang, Luwen
    Wu, Zhiyong
    Bian, Yanyao
    Su, Dan
    Meng, Helen
    INTERSPEECH 2022, 2022, : 2573 - 2577
  • [29] Self-Supervised Speech Representation Learning: A Review
    Mohamed, Abdelrahman
    Lee, Hung-yi
    Borgholt, Lasse
    Havtorn, Jakob D.
    Edin, Joakim
    Igel, Christian
    Kirchhoff, Katrin
    Li, Shang-Wen
    Livescu, Karen
    Maaloe, Lars
    Sainath, Tara N.
    Watanabe, Shinji
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) : 1179 - 1210
  • [30] AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
    Wu, Yihan
    Tan, Xu
    Li, Bohan
    He, Lei
    Zhao, Sheng
    Song, Ruihua
    Qin, Tao
    Liu, Tie-Yan
    INTERSPEECH 2022, 2022, : 2568 - 2572