ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading

被引:2
|
作者
Xiao, Yujia [1 ]
Zhang, Shaofei [2 ]
Wang, Xi [2 ]
Tan, Xu [2 ]
He, Lei [2 ]
Zhao, Sheng [2 ]
Soong, Frank K. [2 ]
Lee, Tan [1 ]
机构
[1] Chinese Univ Hong Kong, Dept Elect Engn, Hong Kong, Peoples R China
[2] Microsoft, Beijing, Peoples R China
来源
INTERSPEECH 2023 | 2023年
关键词
Text-to-Speech; Contextual Modeling;
D O I
10.21437/Interspeech.2023-122
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
While state-of-the-art Text-to-Speech systems can generate natural speech of very high quality at sentence level, they still meet great challenges in speech generation for paragraph / long-form reading. Such deficiencies are due to i) ignorance of cross-sentence contextual information, and ii) high computation and memory cost for long-form synthesis. To address these issues, this work develops a lightweight yet effective TTS system, ContextSpeech. Specifically, we first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. Then we construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Additionally, we integrate linearized self-attention to improve model efficiency. Experiments show that ContextSpeech significantly improves the voice quality and prosody expressiveness in paragraph reading with competitive model efficiency. Audio samples are available at: https://contextspeech.github.io/demo/
引用
收藏
页码:4883 / 4887
页数:5
相关论文
共 50 条
  • [1] Expressive Text-to-Speech using Style Tag
    Kim, Minchan
    Cheon, Sung Jun
    Choi, Byoung Jin
    Kim, Jong Jin
    Kim, Nam Soo
    INTERSPEECH 2021, 2021, : 4663 - 4667
  • [2] ON GRANULARITY OF PROSODIC REPRESENTATIONS IN EXPRESSIVE TEXT-TO-SPEECH
    Babianski, Mikolaj
    Pokora, Kamil
    Shah, Raahil
    Sienkiewicz, Rafal
    Korzekwa, Daniel
    Klimkov, Viacheslav
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 892 - 899
  • [3] Modeling the Acoustic Correlates of Expressive Elements in Text Genres for Expressive Text-to-Speech Synthesis
    Yang, Hongwu
    Meng, Helen M.
    Cai, Lianhong
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1806 - 1809
  • [4] CAMNet: A controllable acoustic model for efficient, expressive, high-quality text-to-speech
    Alvarez, Jesus Monge
    Francois, Holly
    Sung, Hosang
    Choi, Seungdo
    Jeong, Jonghoon
    Choo, Kihyun
    Min, Kyoungbo
    Park, Sangjun
    APPLIED ACOUSTICS, 2022, 186
  • [5] Speech Modification for Prosody Conversion in Expressive Marathi Text-to-Speech Synthesis
    Anil, Manjare Chandraprabha
    Shirbahadurkar, S. D.
    2014 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2014, : 56 - 58
  • [6] Efficient Incremental Text-to-Speech on GPUs
    Du, Muyang
    Liu, Chuan
    Qi, Jiaxing
    Lai, Junjie
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 1422 - 1428
  • [7] Exploiting Emotion Information in Speaker Embeddings for Expressive Text-to-Speech
    Shaheen, Zein
    Sadekova, Tasnima
    Matveeva, Yulia
    Shirshova, Alexandra
    Kudinov, Mikhail
    INTERSPEECH 2023, 2023, : 2038 - 2042
  • [8] Hierarchical stress modeling and generation in mandarin for expressive Text-to-Speech
    Li, Ya
    Tao, Jianhua
    Hirose, Keikichi
    Xu, Xiaoying
    Lai, Wei
    SPEECH COMMUNICATION, 2015, 72 : 59 - 73
  • [9] Expressive Visual Text-To-Speech Using Active Appearance Models
    Anderson, Robert
    Stenger, Bjoern
    Wan, Vincent
    Cipolla, Roberto
    2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 3382 - 3389
  • [10] Sentence-Based Sentiment Analysis for Expressive Text-to-Speech
    Trilla, Alexandre
    Alias, Francesc
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (02): : 223 - 233