Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS

被引:24
|
作者
He, Mutian [1 ]
Deng, Yan [2 ]
He, Lei [2 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
[2] Microsoft, Beijing, Peoples R China
来源
INTERSPEECH 2019 | 2019年
关键词
sequence-to-sequence model; attention; speech synthesis;
D O I
10.21437/Interspeech.2019-1972
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Neural TTS has demonstrated strong capabilities to generate human-like speech with high quality and naturalness, while its generalization to out-of-domain texts is still a challenging task, with regard to the design of attention-based sequence-to-sequence acoustic modeling. Various errors occur in those inputs with unseen context, including attention collapse, skipping, repeating, etc., which limits the broader applications. In this paper, we propose a novel stepwise monotonic attention method in sequence-to-sequence acoustic modeling to improve the robustness on out-of-domain inputs. The method utilizes the strict monotonic property in TTS with constraints on monotonic hard attention that the alignments between inputs and outputs sequence must be not only monotonic but allowing no skipping on inputs. Soft attention could be used to evade mismatch between training and inference. The experimental results show that the proposed method could achieve significant improvements in robustness on out-of-domain scenarios for phoneme-based models, without any regression on the in-domain naturalness test.
引用
收藏
页码:1293 / 1297
页数:5
相关论文
共 50 条
  • [1] Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis
    Zhou, Xiao
    Ling, Zhenhua
    Hu, Yajun
    Dai, Lirong
    APPLIED SCIENCES-BASEL, 2021, 11 (21):
  • [2] Prosodic Features Control by Symbols as Input of Sequence-to-Sequence Acoustic Modeling for Neural TTS
    Kurihara, Kiyoshi
    Seiyama, Nobumasa
    Kumano, Tadashi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2021, E104D (02) : 302 - 311
  • [3] FORWARD ATTENTION IN SEQUENCE-TO-SEQUENCE ACOUSTIC MODELING FOR SPEECH SYNTHESIS
    Zhang, Jing-Xuan
    Ling, Zhen-Hua
    Dai, Li-Rong
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 4789 - 4793
  • [4] EFFECT OF DATA REDUCTION ON SEQUENCE-TO-SEQUENCE NEURAL TTS
    Latorre, Javier
    Lachowicz, Jakub
    Lorenzo-Trueba, Jaime
    Merritt, Thomas
    Drugman, Thomas
    Ronanki, Srikanth
    Klimkov, Viacheslav
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7075 - 7079
  • [5] Sequence-to-Sequence Acoustic Modeling for Voice Conversion
    Zhang, Jing-Xuan
    Ling, Zhen-Hua
    Liu, Li-Juan
    Jiang, Yuan
    Dai, Li-Rong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (03) : 631 - 644
  • [6] An Analysis of "Attention" in Sequence-to-Sequence Models
    Prabhavalkar, Rohit
    Sainath, Tara N.
    Li, Bo
    Rao, Kanishka
    Jaitly, Navdeep
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3702 - 3706
  • [7] Plasma confinement mode classification using a sequence-to-sequence neural network with attention
    Matos, F.
    Menkovski, V.
    Pau, A.
    Marceca, G.
    Jenko, F.
    NUCLEAR FUSION, 2021, 61 (04)
  • [8] Deterministic Attention for Sequence-to-Sequence Constituent Parsing
    Ma, Chunpeng
    Liu, Lemao
    Tamura, Akihiro
    Zhao, Tiejun
    Sumita, Eiichiro
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3237 - 3243
  • [9] Tagging Malware Intentions by Using Attention-Based Sequence-to-Sequence Neural Network
    Huang, Yi-Ting
    Chen, Yu-Yuan
    Yang, Chih-Chun
    Sun, Yeali
    Hsiao, Shun-Wen
    Chen, Meng Chang
    INFORMATION SECURITY AND PRIVACY, ACISP 2019, 2019, 11547 : 660 - 668
  • [10] Sequence-to-Sequence Model with Attention for Time Series Classification
    Tang, Yujin
    Xu, Jianfeng
    Matsumoto, Kazunori
    Ono, Chihiro
    2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2016, : 503 - 510