ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

被引:3
作者
Xiao, Yujia [1 ]
Zhang, Shaofei [2 ]
Wang, Xi [2 ]
Tan, Xu [2 ]
He, Lei [2 ]
Zhao, Sheng [2 ]
Soong, Frank K. [2 ]
Lee, Tan [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Elect Engn, Hong Kong, Peoples R China
[2] Microsoft, Beijing, Peoples R China
来源
INTERSPEECH 2023 | 2023年
关键词
Text-to-Speech; Contextual Modeling;
D O I
10.21437/Interspeech.2023-122
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesis. To address these issues, this work develops a lightweight yet effective TTS system, ContextSpeech. Specifically, we first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. Then we construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Additionally, we integrate linearized self-attention to improve model efficiency. Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency. Audio samples are available at: https://contextspeech.github.io/demo/
引用
收藏
页码:4883 / 4887
页数:5
相关论文
共 29 条
[1]  
[Anonymous], 2021, ICASSP 2021, DOI DOI 10.1109/ICASSP39728.2021.9414102
[2]  
Chen P., 2021, INT C EMP METH NAT L
[3]  
Child Rewon, 2019, Generating long sequences with sparse transformers
[4]  
Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978
[5]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[6]   CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS [J].
Guo, Haohan ;
Zhang, Shaofei ;
Soong, Frank K. ;
He, Lei ;
Xie, Lei .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :403-409
[7]  
Kalchbrenner N, 2018, PR MACH LEARN RES, V80
[8]  
Katharopoulos A, 2020, PR MACH LEARN RES, V119
[9]  
Kim J., 2022, CONDITIONAL VARIATIO
[10]  
Kim Jaehyeon, 2020, Advances in Neural Information Processing Systems, V33