Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech

被引:5
作者
Chung, Hyunseung [1 ]
Lee, Sang-Hoon [2 ]
Lee, Seong-Whan [1 ,2 ]
机构
[1] Korea Univ, Dept Artificial Intelligence, Seoul, South Korea
[2] Korea Univ, Dept Brain & Cognit Engn, Seoul, South Korea
来源
INTERSPEECH 2021 | 2021年
关键词
text to speech; reinforcement learning;
D O I
10.21437/Interspeech.2021-831
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Text-to-speech (TTS) synthesis is the process of producing synthesized speech from text or phoneme input. Traditional TTS models contain multiple processing steps and require external aligners, which provide attention alignments of phoneme-to-frame sequences. As the complexity increases and efficiency decreases with every additional step, there is expanding demand in modern synthesis pipelines for end-to-end TTS with efficient internal aligners. In this work, we propose an end-to-end text-to-waveform network with a novel reinforcement learning based duration search method. Our proposed generator is feed-forward and the aligner trains the agent to make optimal duration predictions by receiving active feedback from actions taken to maximize cumulative reward. We demonstrate accurate alignments of phoneme-to-frame sequence generated from trained agents enhance fidelity and naturalness of synthesized audio. Experimental results also show the superiority of our proposed model compared to other state-of-the-art TTS models with internal and external aligners.
引用
收藏
页码:3635 / 3639
页数:5
相关论文
共 25 条
  • [21] E2RLIXT: An end-to-end framework for robust index tuning based on reinforcement learning
    Lai, Sichao
    Wu, Xiaoying
    Peng, Zhiyong
    COMPUTERS & ELECTRICAL ENGINEERING, 2025, 122
  • [22] A UNIVERSAL BERT-BASED FRONT-END MODEL FOR MANDARIN TEXT-TO-SPEECH SYNTHESIS
    Bai, Zilong
    Hu, Beibei
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6074 - 6078
  • [23] Optimal Boxes: Boosting End-to-End Scene Text Recognition by Adjusting Annotated Bounding Boxes via Reinforcement Learning
    Tang, Jingqun
    Qian, Wenming
    Song, Luchuan
    Dong, Xiena
    Li, Lan
    Bai, Xiang
    COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 233 - 248
  • [24] E3TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications
    Liang, Zheng
    Ma, Ziyang
    Du, Chenpeng
    Yu, Kai
    Chen, Xie
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 4810 - 4821
  • [25] Automatic Tuning of Loss Trade-offs without Hyper-parameter Search in End-to-End Zero-Shot Speech Synthesis
    Park, Seongyeon
    Kim, Bohyung
    Oh, Tae-Hyun
    INTERSPEECH 2023, 2023, : 4319 - 4323