Expressive Text-to-Speech Synthesis using Text Chat Dataset with Speaking Style Information

被引:0
作者
Homma Y. [1 ]
Kanagawa H. [1 ]
Kobayashi N. [1 ]
Ijima Y. [1 ]
Saito K. [1 ]
机构
[1] NTT Human Information Laboratories, NTT Corporation
关键词
speaking style information; spoken dialogue system; text-to-speech;
D O I
10.1527/tjsai.38-3_F-MA7
中图分类号
学科分类号
摘要
This paper aims to generate expressive speech for integration with a robot and AI character dialogue systems. To generate expressive speech, some researchers have proposed using labels that express specific dialogue acts and emotions (i.e., speaking style information). Our approach is to use the speaking style information as an intermediate representation and to train a model for inferring the speaking style information from the text and a speech synthesis model independently. Using a model that infers speaking style information from text, we construct a method that can generate expressive speech for text in the dialogue domain, outside the scope of speech synthesis training. The method first estimates the labels corresponding to the speaking style information for the input text. Then, the estimated labels and the input text are used to generate speech using a speech synthesis model. Experiments show that our method effectively improves the accuracy of text classification of speaking style labels. Subjective evaluation experiments show that our method can produce more expressive speech than conventional methods. © 2023, Japanese Society for Artificial Intelligence. All rights reserved.
引用
收藏
相关论文
共 35 条
[1]  
Adiwardana D., Luong M.-T., So D. R., Hall J., Fiedel N., Thoppilan R., Yang Z., Kulshreshtha A., Nemade G., Lu Y., Le Q. V., Towards a human-like open-domain chatbot, (2020)
[2]  
Akuzawa K., Iwasawa Y., Matsuo Y., Expressive speech synthesis via modeling expressions with variational autoen-coder, Proc. Interspeech, pp. 3067-3071, (2018)
[3]  
Cai X., Dai D., Wu Z., Li X., Li J., Meng H., Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition, Proc. ICASSP, pp. 5734-5738, (2021)
[4]  
Church K. W., Hanks P., Word association norms, mutual information, and lexicography, Computational Linguistics, 16, 1, pp. 22-29, (1990)
[5]  
Cui C., Ren Y., Liu J., Chen F., Huang R., Lei M., Zhao Z., EMOVIE: A mandarin emotion speech dataset with a simple emotional text-to-speech model, Proc. Interspeech, pp. 2766-2770, (2021)
[6]  
Devlin J., Chang M.-W., Lee K., Toutanova K., BERT: Pre-training of deep bidirectional transformers for language understanding, Proc. NAACL, pp. 4171-4186, (2019)
[7]  
Hayashi T., Watanabe S., Toda T., Takeda K., Toshni-wal S., Livescu K., Pre-trained text embeddings for enhanced text-to-speech synthesis, Proc. Interspeech, pp. 4430-4434, (2019)
[8]  
Hojo N., Ijima Y., Mizuno H., DNN-based speech synthesis using speaker codes, IEICE Transactions on Information and Systems, E101.D, 2, pp. 462-472, (2018)
[9]  
Inoue K., Hara S., Abe M., Hojo N., Ijima Y., Model architectures to extrapolate emotional expressions in DNN-based text-to-speech, Speech Communication, 126, pp. 35-43, (2021)
[10]  
Kanagawa H., Ijima Y., Multi-sample subband wavernn via multivariate Gaussian, Proc. ICASSP, pp. 8427-8431, (2022)