A noise-robust voice conversion method with controllable background sounds

被引：0

作者：

Chen, Lele ^{[1
]}

Zhang, Xiongwei ^{[1
]}

Li, Yihao ^{[1
]}

Sun, Meng ^{[1
]}

Chen, Weiwei ^{[1
]}

机构：

[1] Army Engn Univ PLA, Coll Command & Control Engn, Nanjing 210007, Peoples R China

来源：

COMPLEX & INTELLIGENT SYSTEMS | 2024年 / 10卷 / 03期

基金：

中国国家自然科学基金;

关键词：

Noise-robust voice conversion; Dual-decoder structure; Bridge module; Cycle loss; Speech disentanglement; SPEECH ENHANCEMENT; FRAMEWORK;

D O I：

10.1007/s40747-024-01375-6

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Background noises are usually treated as redundant or even harmful to voice conversion. Therefore, when converting noisy speech, a pretrained module of speech separation is usually deployed to estimate clean speech prior to the conversion. However, this can lead to speech distortion due to the mismatch between the separation module and the conversion one. In this paper, a noise-robust voice conversion model is proposed, where a user can choose to retain or to remove the background sounds freely. Firstly, a speech separation module with a dual-decoder structure is proposed, where two decoders decode the denoised speech and the background sounds, respectively. A bridge module is used to capture the interactions between the denoised speech and the background sounds in parallel layers through information exchanging. Subsequently, a voice conversion module with multiple encoders to convert the estimated clean speech from the speech separation model. Finally, the speech separation and voice conversion module are jointly trained using a loss function combining cycle loss and mutual information loss, aiming to improve the decoupling efficacy among speech contents, pitch, and speaker identity. Experimental results show that the proposed model obtains significant improvements in both subjective and objective evaluation metrics compared with the existing baselines. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.

引用

页码：3981 / 3994

页数：14

共 43 条

[21] Two-stage deep spectrum fusion for noise-robust end-to-end speech recognition
Fan, Cunhang
Ding, Mingming
Yi, Jiangyan
Li, Jinpeng
Lv, Zhao
APPLIED ACOUSTICS, 2023, 212
[22] Structural similarity-based Bi-representation through true noise level for noise-robust face super-resolution
Nagar, Surendra
Jain, Ankush
Singh, Pramod Kumar
Kumar, Ajay
MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (17) : 26255 - 26288
[23] Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR
Hu, Yuchen
Chen, Chen
Zhu, Qiushi
Chng, Eng Siong
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1145 - 1156
[24] Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks
Valentini-Botinhao, Cassia
Wang, Xin
Takaki, Shinji
Yamagishi, Junichi
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 352 - 356
[25] Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition
Shimada, Kazuki
Bando, Yoshiaki
Mimura, Masato
Itoyama, Katsutoshi
Yoshii, Kazuyoshi
Kawahara, Tatsuya
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (05) : 960 - 971
[26] Adversarially Learning Disentangled Speech Representations for Robust Multi-factor Voice Conversion
Wang, Jie
Li, Jingbei
Zhao, Xintao
Wu, Zhiyong
Kang, Shiyin
Meng, Helen
INTERSPEECH 2021, 2021, : 846 - 850
[27] A Noise-type and Level-dependent MPO-based Speech Enhancement Architecture with Variable Frame Analysis for Noise-robust Speech Recognition
Mitra, Vikramjit
Borgstrom, Bengt J.
Espy-Wilson, Carol Y.
Alwan, Abeer
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2731 - +
[28] Dealing with Unreliable Annotations: A Noise-Robust Network for Semantic Segmentation through A Transformer-Improved Encoder and Convolution Decoder
Wang, Ziyang
Voiculescu, Irina
APPLIED SCIENCES-BASEL, 2023, 13 (13):
[29] Analyzing the Influence of Diverse Background Noises on Voice Transmission: A Deep Learning Approach to Noise Suppression
Nogales, Alberto
Caracuel-Cayuela, Javier
Garcia-Tejedor, Alvaro J.
APPLIED SCIENCES-BASEL, 2024, 14 (02):
[30] Robust Speaker Recognition against Background Noise in an Enhanced Multi-Condition Domain
Kim, Kichul
Kim, Moo Young
IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2010, 56 (03) : 1684 - 1688

← 1 2 3 4 5 →