Reimagining speech: a scoping review of deep learning-based methods for non-parallel voice conversion

被引:0
作者
Bargum, Anders R. [1 ,2 ]
Serafin, Stefania [1 ]
Erkut, Cumhur [1 ]
机构
[1] Aalborg Univ, Dept Architecture Design & Media Technol, Multisensory Experience Lab, Copenhagen, Denmark
[2] Khora VR, Heka, Copenhagen, Denmark
来源
FRONTIERS IN SIGNAL PROCESSING | 2024年 / 4卷
关键词
voice conversion; voice transformations; voice control; deep learning; disentanglement; speech representation learning; REPRESENTATION DISENTANGLEMENT; MODEL; TIME;
D O I
10.3389/frsip.2024.1339159
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Research on deep learning-powered voice conversion (VC) in speech-to-speech scenarios are gaining increasing popularity. Although many of the works in the field of voice conversion share a common global pipeline, there is considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods included when training voice conversion models can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 628 publications from more than 38 venues between 2017 and 2023, followed by an in-depth review of a final database of 130 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls. We condense the knowledge gathered to identify main challenges, supply solutions grounded in the analysis and provide recommendations for future research directions.
引用
收藏
页数:25
相关论文
共 130 条
[1]  
Abe M., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), P655, DOI 10.1109/ICASSP.1988.196671
[2]   Effects of Sinusoidal Model on Non-Parallel Voice Conversion with Adversarial Learning [J].
Al-Radhi, Mohammed Salah ;
Csapo, Tamas Gabor ;
Nemeth, Geza .
APPLIED SCIENCES-BASEL, 2021, 11 (16)
[3]  
Arksey H., 2005, INT J SOC RES METHOD, V8, P19, DOI [10.1080/1364557032000119616, DOI 10.1080/1364557032000119616]
[4]   GAN YOU HEAR ME? RECLAIMING UNCONDITIONAL SPEECH SYNTHESIS FROM DIFFUSION MODELS [J].
Baas, Matthew ;
Kamper, Herman .
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, :906-911
[5]  
Baas M., 2020, PROC SO AFRICAN C AI, V1342, P69
[6]  
Baevski A, 2020, ADV NEUR IN, V33
[7]   Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks [J].
Bonnici, Russell Sammut ;
Benning, Martin ;
Saitis, Charalampos .
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[8]  
Brice Sophie, 2023, Seminars in Hearing, V44, P213, DOI 10.1055/s-0043-1769610
[9]  
Caillon A., 2022, arXiv, DOI [10.48550/arXiv.2204.07064, DOI 10.48550/ARXIV.2204.07064]
[10]   Nonparallel Emotional Speech Conversion Using VAE-GAN [J].
Cao, Yuexin ;
Liu, Zhengchen ;
Chen, Minchuan ;
Ma, Jun ;
Wang, Shaojun ;
Xiao, Jing .
INTERSPEECH 2020, 2020, :3406-3410