ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

被引：3

作者：

Xiao, Yujia ^{[1
]}

Zhang, Shaofei ^{[2
]}

Wang, Xi ^{[2
]}

Tan, Xu ^{[2
]}

He, Lei ^{[2
]}

Zhao, Sheng ^{[2
]}

Soong, Frank K. ^{[2
]}

Lee, Tan ^{[1
]}

机构：

[1] Chinese Univ Hong Kong, Dept Elect Engn, Hong Kong, Peoples R China

[2] Microsoft, Beijing, Peoples R China

来源：

INTERSPEECH 2023 | 2023年

关键词：

Text-to-Speech; Contextual Modeling;

D O I：

10.21437/Interspeech.2023-122

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesis. To address these issues, this work develops a lightweight yet effective TTS system, ContextSpeech. Specifically, we first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. Then we construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Additionally, we integrate linearized self-attention to improve model efficiency. Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency. Audio samples are available at: https://contextspeech.github.io/demo/

引用

页码：4883 / 4887

页数：5

共 29 条

[1]

[Anonymous], 2021, ICASSP 2021, DOI DOI 10.1109/ICASSP39728.2021.9414102

[2]

Chen P., 2021, INT C EMP METH NAT L

[3]

Child Rewon, 2019, Generating long sequences with sparse transformers

[4]

Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978

[5]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[6] CONVERSATIONAL END-TO-END TTS FOR VOICE AGENTS [J].

Guo, Haohan ;

Zhang, Shaofei ;

Soong, Frank K. ;

He, Lei ;

Xie, Lei .

2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :403-409

[7]

Kalchbrenner N, 2018, PR MACH LEARN RES, V80

[8]

Katharopoulos A, 2020, PR MACH LEARN RES, V119

[9]

Kim J., 2022, CONDITIONAL VARIATIO

[10]

Kim Jaehyeon, 2020, Advances in Neural Information Processing Systems, V33

← 1 2 3 →