Automatic Speech Disentanglement for Voice Conversion using Rank Module and Speech Augmentation

被引:2
作者
Liu, Zhonghua [1 ]
Wang, Shijun [2 ]
Chen, Ning [1 ]
机构
[1] East China Univ Sci & Technol, Shanghai, Peoples R China
[2] Univ St Gallen, St Gallen, Switzerland
来源
INTERSPEECH 2023 | 2023年
基金
中国国家自然科学基金;
关键词
voice conversion; speech disentanglement; speech augmentation;
D O I
10.21437/Interspeech.2023-1602
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Voice Conversion (VC) converts the voice of a source speech to that of a target while maintaining the source's content. Speech can be mainly decomposed into four components: content, timbre, rhythm and pitch. Unfortunately, most related works only take into account content and timbre, which results in less natural speech. Some recent works are able to disentangle speech into several components, but they require laborious bottleneck tuning or various hand-crafted features, each assumed to contain disentangled speech information. In this paper, we propose a VC model that can automatically disentangle speech into four components using only two augmentation functions, without the requirement of multiple hand-crafted features or laborious bottleneck tuning. The proposed model is straightforward yet efficient, and the empirical results demonstrate that our model can achieve a better performance than the baseline, regarding disentanglement effectiveness and speech naturalness.
引用
收藏
页码:2298 / 2302
页数:5
相关论文
共 28 条
[1]  
[Anonymous], 2017, CSTR VCTK CORPUS ENG
[2]   SPEECHSPLIT2.0: UNSUPERVISED SPEECH DISENTANGLEMENT FOR VOICE CONVERSION WITHOUT TUNING AUTOENCODER BOTTLENECKS [J].
Chan, Chak Ho ;
Qian, Kaizhi ;
Zhang, Yang ;
Hasegawa-Johnson, Mark .
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, :6332-6336
[3]  
Chen T, 2020, PR MACH LEARN RES, V119
[4]  
Fang FM, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5279, DOI 10.1109/ICASSP.2018.8462342
[5]  
Gan W., 2022, IQDUBBING PROSODY MO
[6]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040
[7]  
Huang W.-C., 2020, SEQUENCE TO SEQUENCE
[8]   ON PROSODY MODELING FOR ASR plus TTS BASED VOICE CONVERSION [J].
Huang, Wen-Chin ;
Hayashi, Tomoki ;
Li, Xinjian ;
Watanabe, Shinji ;
Toda, Tomoki .
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, :642-649
[9]   Direct speech-to-speech translation with a sequence-to-sequence model [J].
Jia, Ye ;
Weiss, Ron J. ;
Biadsy, Fadi ;
Macherey, Wolfgang ;
Johnson, Melvin ;
Chen, Zhifeng ;
Wu, Yonghui .
INTERSPEECH 2019, 2019, :1123-1127
[10]  
Kameoka H, 2018, IEEE W SP LANG TECH, P266, DOI 10.1109/SLT.2018.8639535