A noise-robust voice conversion method with controllable background sounds

被引：0

作者：

Chen, Lele ^{[1
]}

Zhang, Xiongwei ^{[1
]}

Li, Yihao ^{[1
]}

Sun, Meng ^{[1
]}

Chen, Weiwei ^{[1
]}

机构：

[1] Army Engn Univ PLA, Coll Command & Control Engn, Nanjing 210007, Peoples R China

来源：

COMPLEX & INTELLIGENT SYSTEMS | 2024年 / 10卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Noise-robust voice conversion; Dual-decoder structure; Bridge module; Cycle loss; Speech disentanglement; SPEECH ENHANCEMENT; FRAMEWORK;

D O I：

10.1007/s40747-024-01375-6

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Background noises are usually treated as redundant or even harmful to voice conversion. Therefore, when converting noisy speech, a pretrained module of speech separation is usually deployed to estimate clean speech prior to the conversion. However, this can lead to speech distortion due to the mismatch between the separation module and the conversion one. In this paper, a noise-robust voice conversion model is proposed, where a user can choose to retain or to remove the background sounds freely. Firstly, a speech separation module with a dual-decoder structure is proposed, where two decoders decode the denoised speech and the background sounds, respectively. A bridge module is used to capture the interactions between the denoised speech and the background sounds in parallel layers through information exchanging. Subsequently, a voice conversion module with multiple encoders to convert the estimated clean speech from the speech separation model. Finally, the speech separation and voice conversion module are jointly trained using a loss function combining cycle loss and mutual information loss, aiming to improve the decoupling efficacy among speech contents, pitch, and speaker identity. Experimental results show that the proposed model obtains significant improvements in both subjective and objective evaluation metrics compared with the existing baselines. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.

引用

页码：3981 / 3994

页数：14

共 43 条

[1] Robust speech quality enhancement method against background noise and packet loss at voice-over-IP receiver
Kim, Gee Yeun
Kim, Hyoung-Gook
JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2018, 37 (06): : 512 - 517
[2] WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting
Wu, Zhiliang
Sun, Changchang
Xuan, Hanyu
Liu, Gaowen
Yan, Yan
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6180 - 6188
[3] NOISE-ROBUST DETECTION OF PEAK-CLIPPING IN DECODED SPEECH
Eaton, James
Naylor, Patrick A.
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[4] Speech Enhancement for Noise-Robust Speech Synthesis using Wasserstein GAN
Adiga, Nagaraj
Pantazis, Yannis
Tsiaras, Vassilis
Stylianou, Yannis
INTERSPEECH 2019, 2019, : 1821 - 1825
[5] TENET: A TIME-REVERSAL ENHANCEMENT NETWORK FOR NOISE-ROBUST ASR
Chao, Fu-An
Jiang, Shao-Wei Fan
Yan, Bi-Cheng
Hung, Jeih-weih
Chen, Berlin
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 55 - 61
[6] TOWARD DEGRADATION-ROBUST VOICE CONVERSION
Huang, Chien-Yu
Chang, Kai-Wei
Lee, Hung-Yi
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6777 - 6781
[7] A neural network approach for speech enhancement and noise-robust bandwidth extension
Hao, Xiang
Xu, Chenglin
Zhang, Chen
Xie, Lei
COMPUTER SPEECH AND LANGUAGE, 2025, 89
[8] INTERACTIVE FEATURE FUSION FOR END-TO-END NOISE-ROBUST SPEECH RECOGNITION
Hu, Yuchen
Hou, Nana
Chen, Chen
Chng, Eng Siong
2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6292 - 6296
[9] SPEECH TEMPORAL DYNAMICS FUSION APPROACHES FOR NOISE-ROBUST REVERBERATION TIME ESTIMATION
Senoussaoui, Mohammed
Santos, Joao F.
Falk, Tiago H.
2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5545 - 5549
[10] Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection
Fan, Cunhang
Ding, Mingming
Tao, Jianhua
Fu, Ruibo
Yi, Jiangyan
Wen, Zhengqi
Lv, Zhao
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2453 - 2466

← 1 2 3 4 5 →