Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS

被引:26
作者
He, Mutian [1 ]
Deng, Yan [2 ]
He, Lei [2 ]
机构
[1] Beihang Univ, Sch Comp Sci & Engn, Beijing, Peoples R China
[2] Microsoft, Beijing, Peoples R China
来源
INTERSPEECH 2019 | 2019年
关键词
sequence-to-sequence model; attention; speech synthesis;
D O I
10.21437/Interspeech.2019-1972
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
Neural TTS has demonstrated strong capabilities to generate human-like speech with high quality and naturalness, while its generalization to out-of-domain texts is still a challenging task, with regard to the design of attention-based sequence-to-sequence acoustic modeling. Various errors occur in those inputs with unseen context, including attention collapse, skipping, repeating, etc., which limits the broader applications. In this paper, we propose a novel stepwise monotonic attention method in sequence-to-sequence acoustic modeling to improve the robustness on out-of-domain inputs. The method utilizes the strict monotonic property in TTS with constraints on monotonic hard attention that the alignments between inputs and outputs sequence must be not only monotonic but allowing no skipping on inputs. Soft attention could be used to evade mismatch between training and inference. The experimental results show that the proposed method could achieve significant improvements in robustness on out-of-domain scenarios for phoneme-based models, without any regression on the in-domain naturalness test.
引用
收藏
页码:1293 / 1297
页数:5
相关论文
共 21 条
[1]   Morphological Inflection Generation with Hard Monotonic Attention [J].
Aharoni, Roee ;
Goldberg, Yoav .
PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, :2004-2015
[2]  
[Anonymous], 2017, ADV NEUR IN
[3]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[4]  
Chorowski J. K, 2015, ADV NEURAL INFORM PR, V1, P577, DOI DOI 10.1016/0167-739X(94)90007-8
[5]  
Graves A, 2013, ARXIV
[6]   Gaussian Prediction based Attention for Online End-to-End Speech Recognition [J].
Hou, Junfeng ;
Zhang, Shiliang ;
Dai, Lirong .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3692-3696
[7]  
Jaitly N, 2016, ADV NEUR IN, V29
[8]  
Ling R, 2017, CHANG MOBIL, P33
[9]  
Luong M., 2015, P 2015 C EMP METH NA, P1412, DOI DOI 10.18653/V1/D15-1166
[10]  
Martins AFT, 2016, PR MACH LEARN RES, V48