A noise-robust voice conversion method with controllable background sounds

被引：0

作者：

Chen, Lele ^{[1
]}

Zhang, Xiongwei ^{[1
]}

Li, Yihao ^{[1
]}

Sun, Meng ^{[1
]}

Chen, Weiwei ^{[1
]}

机构：

[1] Army Engn Univ PLA, Coll Command & Control Engn, Nanjing 210007, Peoples R China

来源：

COMPLEX & INTELLIGENT SYSTEMS | 2024年 / 10卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Noise-robust voice conversion; Dual-decoder structure; Bridge module; Cycle loss; Speech disentanglement; SPEECH ENHANCEMENT; FRAMEWORK;

D O I：

10.1007/s40747-024-01375-6

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Background noises are usually treated as redundant or even harmful to voice conversion. Therefore, when converting noisy speech, a pretrained module of speech separation is usually deployed to estimate clean speech prior to the conversion. However, this can lead to speech distortion due to the mismatch between the separation module and the conversion one. In this paper, a noise-robust voice conversion model is proposed, where a user can choose to retain or to remove the background sounds freely. Firstly, a speech separation module with a dual-decoder structure is proposed, where two decoders decode the denoised speech and the background sounds, respectively. A bridge module is used to capture the interactions between the denoised speech and the background sounds in parallel layers through information exchanging. Subsequently, a voice conversion module with multiple encoders to convert the estimated clean speech from the speech separation model. Finally, the speech separation and voice conversion module are jointly trained using a loss function combining cycle loss and mutual information loss, aiming to improve the decoupling efficacy among speech contents, pitch, and speaker identity. Experimental results show that the proposed model obtains significant improvements in both subjective and objective evaluation metrics compared with the existing baselines. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.

引用

页码：3981 / 3994

页数：14

共 43 条

[31] UNIFIED ASR SYSTEM USING LGM-BASED SOURCE SEPARATION, NOISE-ROBUST FEATURE EXTRACTION, AND WORD HYPOTHESIS SELECTION
Fujita, Yusuke
Takashima, Ryoichi
Homma, Takeshi
Ikeshita, Rintaro
Kawaguchi, Yohei
Sumiyoshi, Takashi
Endo, Takashi
Togami, Masahito
2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 416 - 422
[32] SuperM2M: Supervised and mixture-to-mixture co-learning for speech enhancement and noise-robust ASR
Wang, Zhong-Qiu
NEURAL NETWORKS, 2025, 188
[33] Audio Effect for Highlighting Speaker's Voice Corrupted by Background Noise on Portable Digital Imaging Devices
Kang, Jin Ah
Chun, Chan Jun
Kim, Hong Kook
Kim, Ji Woon
Kim, Myeong Bo
UBIQUITOUS COMPUTING AND MULTIMEDIA APPLICATIONS, PT II, 2011, 151 : 39 - +
[34] Noise robust voice activity detection using joint phase and magnitude based feature enhancement
Phapatanaburi, Khomdet
Wang, Longbiao
Oo, Zeyan
Li, Weifeng
Nakagawa, Seiichi
Iwahashi, Masahiro
JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2017, 8 (06) : 845 - 859
[35] Speech Enhancement Based on Teacher-Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition
Tu, Yan-Hui
Du, Jun
Lee, Chin-Hui
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) : 2080 - 2091
[36] rVAD: An unsupervised segment-based robust voice activity detection method
Tan, Zheng-Hua
Sarkar, Achintya Kr
Dehak, Najim
COMPUTER SPEECH AND LANGUAGE, 2020, 59 : 1 - 21
[37] A robust and lightweight voice activity detection algorithm for speech enhancement at low signal-to-noise ratio
Zhu, Zhehui
Zhang, Lijun
Pei, Kaikun
Chen, Siqi
DIGITAL SIGNAL PROCESSING, 2023, 141
[38] A Robust Noise Mitigation Method for the Mobile RFID Location in Built Environment
Jing, Changfeng
Sun, Tiancheng
Chen, Qiang
Du, Mingyi
Wang, Mingshu
Wang, Shouqing
Wang, Jian
SENSORS, 2019, 19 (09)
[39] A novel method to correct steering vectors in MVDR beamformer for noise robust ASR
Bu, Suliang
Zhao, Yunxin
Hwang, Mei-Yuh
INTERSPEECH 2019, 2019, : 4280 - 4284
[40] A Noise Robust Speech Recognition Method Using Model Compensation Based on Speech Enhancement
Shen, Guanghu
Jung, Ho-Youl
Chung, Hyun-Yeol
JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2008, 27 (04): : 191 - 199

← 1 2 3 4 5 →