CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis

被引:0
|
作者
Meng, Yi [1 ,3 ]
Li, Xiang [1 ]
Wu, Zhiyong [1 ,2 ]
Li, Tingtian [3 ]
Sun, Zixun [3 ]
Xiao, Xinyu [3 ]
Sun, Chi [3 ]
Zhan, Hui [3 ]
Meng, Helen [1 ,2 ]
机构
[1] Tsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R China
[3] Tencent, Shanghai, Peoples R China
来源
INTERSPEECH 2022 | 2022年
基金
中国国家自然科学基金;
关键词
text-to-speech synthesis; speaking style; reference selection; style-related text features;
D O I
10.21437/Interspeech.2022-11275
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or selected by semantic features. However, semantic features contain not only style-related information, but also style irrelevant information. The information irrelevant to speaking style in the text could interfere the reference audio selection and result in improper speaking styles. To improve the reference selection, we propose Contrastive Acoustic-Linguistic Module (CALM) to extract the Style-related Text Feature (STF) from the text. CALM optimizes the correlation between the speaking style embedding and the extracted STF with contrastive learning. Thus, a certain number of the most appropriate reference speeches for the input text are selected by retrieving the speeches with the top STF similarities. Then the style embeddings are weighted summarized according to their STF similarities and used to stylize the synthesized speech of TTS. Experiment results demonstrate the effectiveness of our proposed approach, with both objective evaluations and subjective evaluations on the speaking styles of the synthesized speeches outperform a baseline approach with semantic-feature-based reference selection.
引用
收藏
页码:5533 / 5537
页数:5
相关论文
共 50 条
  • [1] Expressive Text-to-Speech Synthesis using Text Chat Dataset with Speaking Style Information
    Homma Y.
    Kanagawa H.
    Kobayashi N.
    Ijima Y.
    Saito K.
    Transactions of the Japanese Society for Artificial Intelligence, 2023, 38 (03)
  • [2] Expressive Text-to-Speech using Style Tag
    Kim, Minchan
    Cheon, Sung Jun
    Choi, Byoung Jin
    Kim, Jong Jin
    Kim, Nam Soo
    INTERSPEECH 2021, 2021, : 4663 - 4667
  • [3] Modeling the Acoustic Correlates of Expressive Elements in Text Genres for Expressive Text-to-Speech Synthesis
    Yang, Hongwu
    Meng, Helen M.
    Cai, Lianhong
    INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1806 - 1809
  • [4] DETECTION AND EMPHATIC REALIZATION OF CONTRASTIVE WORD PAIRS FOR EXPRESSIVE TEXT-TO-SPEECH SYNTHESIS
    Li, Chunrong
    Wu, Zhiyong
    Meng, Fanbo
    Meng, Helen
    Cai, Lianhong
    2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, 2012, : 93 - 97
  • [5] Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion
    Paul, Dipjyoti
    Shifas, Muhammed P., V
    Pantazis, Yannis
    Stylianou, Yannis
    INTERSPEECH 2020, 2020, : 1361 - 1365
  • [6] MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
    Guan, Wenhao
    Li, Yishuang
    Li, Tao
    Huang, Hukai
    Wang, Feng
    Lin, Jiayan
    Huang, Lingyan
    Li, Lin
    Hong, Qingyang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18117 - 18125
  • [7] Cross-modal Contrastive Learning for Speech Translation
    Ye, Rong
    Wang, Mingxuan
    Li, Lei
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5099 - 5113
  • [8] PREDICTING EXPRESSIVE SPEAKING STYLE FROM TEXT IN END-TO-END SPEECH SYNTHESIS
    Stanton, Daisy
    Wang, Yuxuan
    Skerry-Ryan, R. J.
    2018 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2018), 2018, : 595 - 602
  • [9] Speech Modification for Prosody Conversion in Expressive Marathi Text-to-Speech Synthesis
    Anil, Manjare Chandraprabha
    Shirbahadurkar, S. D.
    2014 INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND INTEGRATED NETWORKS (SPIN), 2014, : 56 - 58
  • [10] Hierarchical stress modeling and generation in mandarin for expressive Text-to-Speech
    Li, Ya
    Tao, Jianhua
    Hirose, Keikichi
    Xu, Xiaoying
    Lai, Wei
    SPEECH COMMUNICATION, 2015, 72 : 59 - 73