Cross-lingual voice conversion based on F0 multi-scale modeling with VITS

被引：0

作者：

Cao, Danyang ^{[1
]}

Zhang, Zeyi ^{[1
]}

机构：

[1] North China Univ Technol, Sch Informat & Technol, Beijing 100144, Peoples R China

来源：

PROCEEDINGS OF 2024 3RD INTERNATIONAL CONFERENCE ON CYBER SECURITY, ARTIFICIAL INTELLIGENCE AND DIGITAL ECONOMY, CSAIDE 2024 | 2024年

关键词：

Cross-lingual; Voice Conversion; F0 Multi-scale Modeling;

D O I：

10.1145/3672919.3672988

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper introduces a cross-lingual voice conversion technology that utilizes an F0 predictor for multi-scale modeling of the fundamental frequency (f0). Based on the VITS architecture, this method achieves high-quality voice generation through an end-to-end approach. In cross-lingual conversion, the voice often carries an unnatural foreign accent due to the involvement of different languages in the source and target voices. To address this issue, Whisper is introduced as a content extraction tool aimed at capturing the detailed content of the speech, including the specific accent of the source voice, which is crucial for efficient cross-lingual conversion. Furthermore, by employing an F0 predictor for multi-scale modeling of f0 to predict the fundamental frequency contour, this strategy helps to retain the accent characteristics of the source voice during conversion. This research, trained solely on the VCTK English dataset, effectively achieves cross-lingual conversion across various languages and significantly reduces the impact of foreign accents. To further explore the contribution of the F0 predictor in the proposed model, a series of ablation experiments were designed. Through objective and subjective evaluations, the effectiveness of the proposed method is demonstrated.

引用

页码：375 / 379

页数：5

共 16 条

[1] INCA Algorithm for Training Voice Conversion Systems From Nonparallel Corpora [J].

Erro, Daniel ;

Moreno, Asuncion ;

Bonafonte, Antonio .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (05) :944-953

[2]

Erro D, 2007, INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, P1533

[3]

Guo HJ, 2023, Arxiv, DOI arXiv:2302.08296

[4]

Guo Houjian, 2023, 2023 IEEE AUT SPEECH, P1

[5]

Kim Jaehyeon, 2021, P MACHINE LEARNING R, V139

[6]

Li J., 2023, PROC ICASSP 2023 202, P1

[7] Any-to-Many Voice Conversion With Location-Relative Sequence-to-Sequence Modeling [J].

Liu, Songxiang ;

Cao, Yuewen ;

Wang, Disong ;

Wu, Xixin ;

Liu, Xunying ;

Meng, Helen .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 :1717-1728

[8]

Qian Y, 2011, INT CONF ACOUST SPEE, P5120

[9]

Radford Alec, 2022, CoRR, vol. abs/2212.04356

[10]

Ren Y, 2021, Arxiv, DOI arXiv:2006.04558

← 1 2 →