Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation

被引：0

作者：

Tu, Tao ^{[1
]}

Chen, Yuan-Jui ^{[1
]}

Liu, Alexander H. ^{[1
]}

Lee, Hung-yi ^{[1
]}

机构：

[1] Natl Taiwan Univ, Coll Elect Engn & Comp Sci, Taipei, Taiwan

来源：

INTERSPEECH 2020 | 2020年

关键词：

multi-speaker speech synthesis; semi-supervised learning; discrete speech representation;

D O I：

10.21437/Interspeech.2020-1824

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Recently, end-to-end multi-speaker text-to-speech (TTS) systems gain success in the situation where a lot of high-quality speech plus their corresponding transcriptions are available. However, laborious paired data collection processes prevent many institutes from building multi-speaker TTS systems of great performance. In this work, we propose a semi-supervised learning approach for multi-speaker TTS. A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. The experiment results demonstrate that with only an hour of paired speech data, whether the paired data is from multiple speakers or a single speaker, the proposed model can generate intelligible speech in different voices. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy. In addition, our analysis reveals that different speaker characteristics of the paired data have an impact on the effectiveness of semi-supervised TTS.

引用

页码：3191 / 3195

页数：5

共 50 条

[1] SEMI-SUPERVISED END-TO-END SPEECH RECOGNITION USING TEXT-TO-SPEECH AND AUTOENCODERS
Karita, Shigeki
Watanabe, Shinji
Iwata, Tomoharu
Delcroix, Marc
Ogawa, Atsunori
Nakatani, Tomohiro
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 6166 - 6170
[2] LIGHTSPEECH: LIGHTWEIGHT NON-AUTOREGRESSIVE MULTI-SPEAKER TEXT-TO-SPEECH
Li, Song
Ouyang, Beibei
Li, Lin
Hong, Qingyang
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, : 499 - 506
[3] Semi-Supervised Learning of Speech Sounds
Jansen, Aren
Niyogi, Partha
INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 2264 - 2267
[4] GRAPH CONVOLUTIONAL NETWORK BASED SEMI-SUPERVISED LEARNING ON MULTI-SPEAKER MEETING DATA
Tong, Fuchuan
Zheng, Siqi
Zhang, Min
Chen, Yafeng
Suo, Hongbin
Hong, Qingyang
Li, Lin
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6622 - 6626
[5] USING COLLECTIVE INFORMATION IN SEMI-SUPERVISED LEARNING FOR SPEECH RECOGNITION
Varadarajan, Balakrishnan
Yu, Dong
Deng, Li
Acero, Alex
2009 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1- 8, PROCEEDINGS, 2009, : 4633 - +
[6] INCREMENTAL SEMI-SUPERVISED LEARNING FOR MULTI-GENRE SPEECH RECOGNITION
Khonglah, Banriskhem
Madikeri, Srikanth
Dey, Subhadeep
Bourlard, Herve
Motlicek, Petr
Billa, Jayadev
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7419 - 7423
[7] Speaker Identification Using Semi-supervised Learning
Fazakis, Nikos
Karlos, Stamatis
Kotsiantis, Sotiris
Sgarbas, Kyriakos
SPEECH AND COMPUTER (SPECOM 2015), 2015, 9319 : 389 - 396
[8] Semi-supervised speech activity detection with an application to automatic speaker verification
Sholokhov, Alexey
Sahidullah, Md
Kinnunen, Tomi
COMPUTER SPEECH AND LANGUAGE, 2018, 47 : 132 - 156
[9] Speech Emotion Recognition Using Semi-supervised Learning with Ladder Networks
Huang, Jian
Li, Ya
Tao, Jianhua
Lian, Zheng
Niu, Mingyue
Yi, Jiangyan
2018 FIRST ASIAN CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII ASIA), 2018,
[10] SESQA: SEMI-SUPERVISED LEARNING FOR SPEECH QUALITY ASSESSMENT
Serra, Joan
Pons, Jordi
Pascual, Santiago
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 381 - 385

← 1 2 3 4 5 →