A noise-robust voice conversion method with controllable background sounds

被引:0
|
作者
Chen, Lele [1 ]
Zhang, Xiongwei [1 ]
Li, Yihao [1 ]
Sun, Meng [1 ]
Chen, Weiwei [1 ]
机构
[1] Army Engn Univ PLA, Coll Command & Control Engn, Nanjing 210007, Peoples R China
基金
中国国家自然科学基金;
关键词
Noise-robust voice conversion; Dual-decoder structure; Bridge module; Cycle loss; Speech disentanglement; SPEECH ENHANCEMENT; FRAMEWORK;
D O I
10.1007/s40747-024-01375-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Background noises are usually treated as redundant or even harmful to voice conversion. Therefore, when converting noisy speech, a pretrained module of speech separation is usually deployed to estimate clean speech prior to the conversion. However, this can lead to speech distortion due to the mismatch between the separation module and the conversion one. In this paper, a noise-robust voice conversion model is proposed, where a user can choose to retain or to remove the background sounds freely. Firstly, a speech separation module with a dual-decoder structure is proposed, where two decoders decode the denoised speech and the background sounds, respectively. A bridge module is used to capture the interactions between the denoised speech and the background sounds in parallel layers through information exchanging. Subsequently, a voice conversion module with multiple encoders to convert the estimated clean speech from the speech separation model. Finally, the speech separation and voice conversion module are jointly trained using a loss function combining cycle loss and mutual information loss, aiming to improve the decoupling efficacy among speech contents, pitch, and speaker identity. Experimental results show that the proposed model obtains significant improvements in both subjective and objective evaluation metrics compared with the existing baselines. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.
引用
收藏
页码:3981 / 3994
页数:14
相关论文
共 43 条
  • [21] Two-stage deep spectrum fusion for noise-robust end-to-end speech recognition
    Fan, Cunhang
    Ding, Mingming
    Yi, Jiangyan
    Li, Jinpeng
    Lv, Zhao
    APPLIED ACOUSTICS, 2023, 212
  • [22] Structural similarity-based Bi-representation through true noise level for noise-robust face super-resolution
    Nagar, Surendra
    Jain, Ankush
    Singh, Pramod Kumar
    Kumar, Ajay
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (17) : 26255 - 26288
  • [23] Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR
    Hu, Yuchen
    Chen, Chen
    Zhu, Qiushi
    Chng, Eng Siong
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1145 - 1156
  • [24] Speech Enhancement for a Noise-Robust Text-to-Speech Synthesis System using Deep Recurrent Neural Networks
    Valentini-Botinhao, Cassia
    Wang, Xin
    Takaki, Shinji
    Yamagishi, Junichi
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 352 - 356
  • [25] Unsupervised Speech Enhancement Based on Multichannel NMF-Informed Beamforming for Noise-Robust Automatic Speech Recognition
    Shimada, Kazuki
    Bando, Yoshiaki
    Mimura, Masato
    Itoyama, Katsutoshi
    Yoshii, Kazuyoshi
    Kawahara, Tatsuya
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (05) : 960 - 971
  • [26] Adversarially Learning Disentangled Speech Representations for Robust Multi-factor Voice Conversion
    Wang, Jie
    Li, Jingbei
    Zhao, Xintao
    Wu, Zhiyong
    Kang, Shiyin
    Meng, Helen
    INTERSPEECH 2021, 2021, : 846 - 850
  • [27] A Noise-type and Level-dependent MPO-based Speech Enhancement Architecture with Variable Frame Analysis for Noise-robust Speech Recognition
    Mitra, Vikramjit
    Borgstrom, Bengt J.
    Espy-Wilson, Carol Y.
    Alwan, Abeer
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 2731 - +
  • [28] Dealing with Unreliable Annotations: A Noise-Robust Network for Semantic Segmentation through A Transformer-Improved Encoder and Convolution Decoder
    Wang, Ziyang
    Voiculescu, Irina
    APPLIED SCIENCES-BASEL, 2023, 13 (13):
  • [29] Analyzing the Influence of Diverse Background Noises on Voice Transmission: A Deep Learning Approach to Noise Suppression
    Nogales, Alberto
    Caracuel-Cayuela, Javier
    Garcia-Tejedor, Alvaro J.
    APPLIED SCIENCES-BASEL, 2024, 14 (02):
  • [30] Robust Speaker Recognition against Background Noise in an Enhanced Multi-Condition Domain
    Kim, Kichul
    Kim, Moo Young
    IEEE TRANSACTIONS ON CONSUMER ELECTRONICS, 2010, 56 (03) : 1684 - 1688