Speech Rhythm-Based Speaker Embeddings Extraction from Phonemes and Phoneme Duration for Multi-Speaker Speech Synthesis

被引:1
作者
Fujita, Kenichi [1 ]
Ando, Atsushi [1 ]
Ijima, Yusuke [1 ]
机构
[1] NTT Corp, NTT Human Informat Labs, Yokosuka 2390847, Japan
关键词
speaker embedding; phoneme duration; speech synthesis; speech rhythm;
D O I
10.1587/transinf.2023EDP7039
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for re-producing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker em-beddings generation, speech synthesis with generated embeddings, and em-bedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation anal-ysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.
引用
收藏
页码:93 / 104
页数:12
相关论文
共 50 条
[31]   Speech Synthesis Adaption Method Based on Phoneme-Level Speaker Embedding Under Small Data [J].
Xu Z.-H. ;
Chen B. ;
Zhang H. ;
Yu K. .
Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (05) :1003-1017
[32]   FOCUSING ON ATTENTION: PROSODY TRANSFER AND ADAPTATIVE OPTIMIZATION STRATEGY FOR MULTI-SPEAKER END-TO-END SPEECH SYNTHESIS [J].
Fu, Ruibo ;
Tao, Jianhua ;
Wen, Zhengqi ;
Yi, Jiangyan ;
Wang, Tao .
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :6709-6713
[33]   Waveform-Based Speaker Representations for Speech Synthesis [J].
Wan, Moquan ;
Degottex, Gilles ;
Gales, Mark J. F. .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :897-901
[34]   LINEAR NETWORKS BASED SPEAKER ADAPTATION FOR SPEECH SYNTHESIS [J].
Huang, Zhiying ;
Lu, Heng ;
Lei, Ming ;
Yan, Zhijie .
2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, :5319-5323
[35]   SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech [J].
Choi, Byoung Jin ;
Jeong, Myeonghun ;
Lee, Joun Yeop ;
Kim, Nam Soo .
IEEE SIGNAL PROCESSING LETTERS, 2022, 29 :2502-2506
[36]   A study of speaker adaptation for DNN-based speech synthesis [J].
Wu, Zhizheng ;
Swietojanski, Pawel ;
Veaux, Christophe ;
Renals, Steve ;
King, Simon .
16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, :879-883
[37]   A Method for Emotional Speech Synthesis Based on Speaker Adaptive Training [J].
Lu, Xiaoyong ;
Li, Yanqin ;
Yang, Hongwu .
2018 11TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2018, :31-35
[38]   DNN-Based Speech Synthesis Using Speaker Codes [J].
Hojo, Nobukatsu ;
Ijima, Yusuke ;
Mizuno, Hideyuki .
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (02) :462-472
[39]   SPEAKER-AWARE TRAINING OF ATTENTION-BASED END-TO-END SPEECH RECOGNITION USING NEURAL SPEAKER EMBEDDINGS [J].
Rouhe, Aku ;
Kaseva, Tuomas ;
Kurimo, Mikko .
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, :7064-7068
[40]   Multi-Scale Speaker Vectors for Zero-Shot Speech Synthesis [J].
Cory, Tristin ;
Iqbal, Razib .
2022 IEEE 46TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2022), 2022, :496-501