A Unified Accent Estimation Method Based on Multi-Task Learning for Japanese Text-to-Speech

被引:2
作者
Park, Byeongseon [1 ]
Yamamoto, Ryuichi [1 ]
Tachibana, Kentaro [1 ]
机构
[1] LINE Corp, Tokyo, Japan
来源
INTERSPEECH 2022 | 2022年
关键词
Accent estimation; multi-task learning; accent sandhi; text-to-speech; Japanese; EMBEDDINGS; MODEL;
D O I
10.21437/Interspeech.2022-334
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
We propose a unified accent estimation method for Japanese text-to-speech (TTS). Unlike the conventional two-stage methods, which separately train two models for predicting accent phrase boundaries and accent nucleus positions, our method merges the two models and jointly optimizes the entire model in a multi-task learning framework. Furthermore, considering the hierarchical linguistic structure of intonation phrases (IPs), accent phrases, and accent nuclei, we generalize the proposed approach to simultaneously model the IP boundaries with accent information. Objective evaluation results reveal that the proposed method achieves an accent estimation accuracy of 80.4%, which is 6.67% higher than the conventional two-stage method. When the proposed method is incorporated into a neural TTS framework, the system achieves a 4.29 mean opinion score with respect to prosody naturalness.
引用
收藏
页码:1931 / 1935
页数:5
相关论文
共 34 条
[1]  
Bahdanau D, 2016, Arxiv, DOI arXiv:1409.0473
[2]   A model of inductive bias learning [J].
Baxter, J .
JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2000, 12 :149-198
[3]  
Caruana R.A., 1993, P 10 INT C INT C MAC, P41, DOI DOI 10.1016/B978-1-55860-307-3.50012-5
[4]  
Den Y, 2008, SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, P1019
[5]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6]  
Diederik J. B., 2015, P ICLR
[7]  
Fujimoto T., 2019, P 10 ISCA SPEECH SYN, P166
[8]  
Hida R., 2022, P ICASSP
[9]   Multi-Task Learning for Prosodic Structure Generation using BLSTM RNN with Structured Output Layer [J].
Huang, Yuchen ;
Wu, Zhiyong ;
Li, Runnan ;
Meng, Helen ;
Cai, Lianhong .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :779-783
[10]  
Joty S, 2018, 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), P4196