Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech

被引：5

作者：

Chung, Hyunseung ^{[1
]}

Lee, Sang-Hoon ^{[2
]}

Lee, Seong-Whan ^{[1
,2
]}

机构：

[1] Korea Univ, Dept Artificial Intelligence, Seoul, South Korea

[2] Korea Univ, Dept Brain & Cognit Engn, Seoul, South Korea

来源：

INTERSPEECH 2021 | 2021年

关键词：

text to speech; reinforcement learning;

D O I：

10.21437/Interspeech.2021-831

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Text-to-speech (TTS) synthesis is the process of producing synthesized speech from text or phoneme input. Traditional TTS models contain multiple processing steps and require external aligners, which provide attention alignments of phoneme-to-frame sequences. As the complexity increases and efficiency decreases with every additional step, there is expanding demand in modern synthesis pipelines for end-to-end TTS with efficient internal aligners. In this work, we propose an end-to-end text-to-waveform network with a novel reinforcement learning based duration search method. Our proposed generator is feed-forward and the aligner trains the agent to make optimal duration predictions by receiving active feedback from actions taken to maximize cumulative reward. We demonstrate accurate alignments of phoneme-to-frame sequence generated from trained agents enhance fidelity and naturalness of synthesized audio. Experimental results also show the superiority of our proposed model compared to other state-of-the-art TTS models with internal and external aligners.

引用

页码：3635 / 3639

页数：5

共 25 条

[1] Myanmar Text-to-Speech Synthesis Using End-to-End Model
Qin, Qinglai
Yang, Jian
Li, Peiying
2020 4TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL, NLPIR 2020, 2020, : 6 - 11
[2] EXPLORING END-TO-END NEURAL TEXT-TO-SPEECH SYNTHESIS FOR ROMANIAN
Dumitrache, Marius
Rebedea, Traian
PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE LINGUISTIC RESOURCES AND TOOLS FOR NATURAL LANGUAGE PROCESSING, 2020, : 93 - 102
[3] On the Training and Testing Data Preparation for End-to-End Text-to-Speech Application
Duc Chung Tran
Khan, M. K. A. Ahamed
Sridevi, S.
2020 11TH IEEE CONTROL AND SYSTEM GRADUATE RESEARCH COLLOQUIUM (ICSGRC), 2020, : 73 - 75
[4] Optimization for Low-Resource Speaker Adaptation in End-to-End Text-to-Speech
Hong, Changi
Lee, Jung Hyuk
Jeon, Moongu
Kim, Hong Kook
2024 IEEE 21ST CONSUMER COMMUNICATIONS & NETWORKING CONFERENCE, CCNC, 2024, : 1060 - 1061
[5] NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality
Tan, Xu
Chen, Jiawei
Liu, Haohe
Cong, Jian
Zhang, Chen
Liu, Yanqing
Wang, Xi
Leng, Yichong
Yi, Yuanhao
He, Lei
Zhao, Sheng
Qin, Tao
Soong, Frank
Liu, Tie-Yan
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (06) : 4234 - 4245
[6] Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech
Yoon, Hyungchan
Um, Seyun
Kim, Changhwan
Kang, Hong-Goo
INTERSPEECH 2023, 2023, : 3023 - 3027
[7] SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech
Cho, Hyunjae
Jung, Wonbin
Lee, Junhyeok
Woo, Sang Hoon
INTERSPEECH 2022, 2022, : 1 - 5
[8] EfficientTTS 2: Variational End-to-End Text-to-Speech Synthesis and Voice Conversion
Miao, Chenfeng
Zhu, Qingying
Chen, Minchuan
Ma, Jun
Wang, Shaojun
Xiao, Jing
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1650 - 1661
[9] Multi speaker text-to-speech synthesis using generalized end-to-end loss function
Nazir, Owais
Malik, Aruna
Singh, Samayveer
Pathan, Al-Sakib Khan
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (24) : 64205 - 64222
[10] END-TO-END TEXT-TO-SPEECH USING LATENT DURATION BASED ON VQ-VAE
Yasuda, Yusuke
Wang, Xin
Yamagishi, Junichi
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 5694 - 5698

← 1 2 3 →