Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech

被引：5

作者：

Chung, Hyunseung ^{[1
]}

Lee, Sang-Hoon ^{[2
]}

Lee, Seong-Whan ^{[1
,2
]}

机构：

[1] Korea Univ, Dept Artificial Intelligence, Seoul, South Korea

[2] Korea Univ, Dept Brain & Cognit Engn, Seoul, South Korea

来源：

INTERSPEECH 2021 | 2021年

关键词：

text to speech; reinforcement learning;

D O I：

10.21437/Interspeech.2021-831

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

Text-to-speech (TTS) synthesis is the process of producing synthesized speech from text or phoneme input. Traditional TTS models contain multiple processing steps and require external aligners, which provide attention alignments of phoneme-to-frame sequences. As the complexity increases and efficiency decreases with every additional step, there is expanding demand in modern synthesis pipelines for end-to-end TTS with efficient internal aligners. In this work, we propose an end-to-end text-to-waveform network with a novel reinforcement learning based duration search method. Our proposed generator is feed-forward and the aligner trains the agent to make optimal duration predictions by receiving active feedback from actions taken to maximize cumulative reward. We demonstrate accurate alignments of phoneme-to-frame sequence generated from trained agents enhance fidelity and naturalness of synthesized audio. Experimental results also show the superiority of our proposed model compared to other state-of-the-art TTS models with internal and external aligners.

引用

页码：3635 / 3639

页数：5

共 25 条

[21] E2RLIXT: An end-to-end framework for robust index tuning based on reinforcement learning
Lai, Sichao
Wu, Xiaoying
Peng, Zhiyong
COMPUTERS & ELECTRICAL ENGINEERING, 2025, 122
[22] A UNIVERSAL BERT-BASED FRONT-END MODEL FOR MANDARIN TEXT-TO-SPEECH SYNTHESIS
Bai, Zilong
Hu, Beibei
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6074 - 6078
[23] Optimal Boxes: Boosting End-to-End Scene Text Recognition by Adjusting Annotated Bounding Boxes via Reinforcement Learning
Tang, Jingqun
Qian, Wenming
Song, Luchuan
Dong, Xiena
Li, Lan
Bai, Xiang
COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 233 - 248
[24] E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications
Liang, Zheng
Ma, Ziyang
Du, Chenpeng
Yu, Kai
Chen, Xie
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4810 - 4821
[25] Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis
Park, Seongyeon
Kim, Bohyung
Oh, Tae-Hyun
INTERSPEECH 2023, 2023, : 4319 - 4323

← 1 2 3 →