CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis
被引:0
作者:
Meng, Yi
论文数: 0引用数: 0
h-index: 0
机构:
Tsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Tencent, Shanghai, Peoples R ChinaTsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Meng, Yi
[1
,3
]
Li, Xiang
论文数: 0引用数: 0
h-index: 0
机构:
Tsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R ChinaTsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Li, Xiang
[1
]
Wu, Zhiyong
论文数: 0引用数: 0
h-index: 0
机构:
Tsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R ChinaTsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Wu, Zhiyong
[1
,2
]
Li, Tingtian
论文数: 0引用数: 0
h-index: 0
机构:
Tencent, Shanghai, Peoples R ChinaTsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Li, Tingtian
[3
]
Sun, Zixun
论文数: 0引用数: 0
h-index: 0
机构:
Tencent, Shanghai, Peoples R ChinaTsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Sun, Zixun
[3
]
Xiao, Xinyu
论文数: 0引用数: 0
h-index: 0
机构:
Tencent, Shanghai, Peoples R ChinaTsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Xiao, Xinyu
[3
]
Sun, Chi
论文数: 0引用数: 0
h-index: 0
机构:
Tencent, Shanghai, Peoples R ChinaTsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Sun, Chi
[3
]
Zhan, Hui
论文数: 0引用数: 0
h-index: 0
机构:
Tencent, Shanghai, Peoples R ChinaTsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Zhan, Hui
[3
]
Meng, Helen
论文数: 0引用数: 0
h-index: 0
机构:
Tsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R ChinaTsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
Meng, Helen
[1
,2
]
机构:
[1] Tsinghua Univ, Tsinghua CUHK Joint Res Ctr Media Sci Technol & S, Shenzhen Int Grad Sch, Shenzhen, Peoples R China
[2] Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Hong Kong, Peoples R China
[3] Tencent, Shanghai, Peoples R China
来源:
INTERSPEECH 2022
|
2022年
基金:
中国国家自然科学基金;
关键词:
text-to-speech synthesis;
speaking style;
reference selection;
style-related text features;
D O I:
10.21437/Interspeech.2022-11275
中图分类号:
O42 [声学];
学科分类号:
070206 ;
082403 ;
摘要:
To further improve the speaking styles of synthesized speeches, current text-to-speech (TTS) synthesis systems commonly employ reference speeches to stylize their outputs instead of just the input texts. These reference speeches are obtained by manual selection which is resource-consuming, or selected by semantic features. However, semantic features contain not only style-related information, but also style irrelevant information. The information irrelevant to speaking style in the text could interfere the reference audio selection and result in improper speaking styles. To improve the reference selection, we propose Contrastive Acoustic-Linguistic Module (CALM) to extract the Style-related Text Feature (STF) from the text. CALM optimizes the correlation between the speaking style embedding and the extracted STF with contrastive learning. Thus, a certain number of the most appropriate reference speeches for the input text are selected by retrieving the speeches with the top STF similarities. Then the style embeddings are weighted summarized according to their STF similarities and used to stylize the synthesized speech of TTS. Experiment results demonstrate the effectiveness of our proposed approach, with both objective evaluations and subjective evaluations on the speaking styles of the synthesized speeches outperform a baseline approach with semantic-feature-based reference selection.