Expressive Text-to-Speech using Style Tag

被引：10

作者：

Kim, Minchan ^{[1
,2
]}

Cheon, Sung Jun ^{[1
,2
]}

Choi, Byoung Jin ^{[1
,2
]}

Kim, Jong Jin ^{[3
]}

Kim, Nam Soo ^{[1
,2
]}

机构：

[1] Seoul Natl Univ, Dept Elect & Comp Engn, Seoul, South Korea

[2] Seoul Natl Univ, INMC, Seoul, South Korea

[3] SK Telecom, Seoul, South Korea

来源：

INTERSPEECH 2021 | 2021年

关键词：

speech synthesis; expressive TTS; language model; non-autoregressive TTS;

D O I：

10.21437/Interspeech.2021-465

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

As recent text-to-speech (TTS) systems have been rapidly improved in speech quality and generation speed, many researchers now focus on a more challenging issue: expressive TTS. To control speaking styles, existing expressive TTS models use categorical style index or reference speech as style input. In this work, we propose StyleTagging-TTS (ST-TTS), a novel expressive TTS model that utilizes a style tag written in natural language. Using a style-tagged TTS dataset and a pre-trained language model, we modeled the relationship between linguistic embedding and speaking style domain, which enables our model to work even with style tags unseen during training. As style tag is written in natural language, it can control speaking style in a more intuitive, interpretable, and scalable way compared with style index or reference speech. In addition, in terms of model architecture, we propose an efficient non-autoregressive (NAR) TTS architecture with single-stage training. The experimental result shows that ST-TTS outperforms the existing expressive TTS model, Tacotron2-GST in speech quality and expressiveness.

引用

页码：4663 / 4667

页数：5

共 50 条

[1] Expressive Text-to-Speech Synthesis using Text Chat Dataset with Speaking Style Information
Homma Y.
Kanagawa H.
Kobayashi N.
Ijima Y.
Saito K.
Transactions of the Japanese Society for Artificial Intelligence, 2023, 38 (03)
[2] LLM-based Expressive Text-to-Speech Synthesizer with Style and Timbre disentanglement
Zhu, Yuanyuan
He, Jiaxu
Jing, Ruihao
Song, Yaodong
Lian, Jie
Zhang, Xiao-Lei
Li, Jie
2024 IEEE 14TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, ISCSLP 2024, 2024, : 596 - 600
[3] Expressive Visual Text-To-Speech Using Active Appearance Models
Anderson, Robert
Stenger, Bjoern
Wan, Vincent
Cipolla, Roberto
2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 3382 - 3389
[4] ON GRANULARITY OF PROSODIC REPRESENTATIONS IN EXPRESSIVE TEXT-TO-SPEECH
Babianski, Mikolaj
Pokora, Kamil
Shah, Raahil
Sienkiewicz, Rafal
Korzekwa, Daniel
Klimkov, Viacheslav
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 892 - 899
[5] Modeling the Acoustic Correlates of Expressive Elements in Text Genres for Expressive Text-to-Speech Synthesis
Yang, Hongwu
Meng, Helen M.
Cai, Lianhong
INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1806 - 1809
[6] LOW-RESOURCE EXPRESSIVE TEXT-TO-SPEECH USING DATA AUGMENTATION
Huybrechts, Goeric
Merritt, Thomas
Comini, Giulia
Perz, Bartek
Shah, Raahil
Lorenzo-Trueba, Jaime
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6593 - 6597
[7] TSP-TTS: Text-based Style Predictor with Residual Vector Quantization for Expressive Text-to-Speech
Seong, Donghyun
Lee, Hoyoung
Chang, Joon-Hyuk
INTERSPEECH 2024, 2024, : 1780 - 1784
[8] CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
Meng, Yi
Li, Xiang
Wu, Zhiyong
Li, Tingtian
Sun, Zixun
Xiao, Xinyu
Sun, Chi
Zhan, Hui
Meng, Helen
INTERSPEECH 2022, 2022, : 5533 - 5537
[9] ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading
Xiao, Yujia
Zhang, Shaofei
Wang, Xi
Tan, Xu
He, Lei
Zhao, Sheng
Soong, Frank K.
Lee, Tan
INTERSPEECH 2023, 2023, : 4883 - 4887
[10] Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
Paul, Dipjyoti
Shifas, Muhammed P., V
Pantazis, Yannis
Stylianou, Yannis
INTERSPEECH 2020, 2020, : 1361 - 1365

← 1 2 3 4 5 →