A noise-robust voice conversion method with controllable background sounds

被引:0
|
作者
Chen, Lele [1 ]
Zhang, Xiongwei [1 ]
Li, Yihao [1 ]
Sun, Meng [1 ]
Chen, Weiwei [1 ]
机构
[1] Army Engn Univ PLA, Coll Command & Control Engn, Nanjing 210007, Peoples R China
基金
中国国家自然科学基金;
关键词
Noise-robust voice conversion; Dual-decoder structure; Bridge module; Cycle loss; Speech disentanglement; SPEECH ENHANCEMENT; FRAMEWORK;
D O I
10.1007/s40747-024-01375-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Background noises are usually treated as redundant or even harmful to voice conversion. Therefore, when converting noisy speech, a pretrained module of speech separation is usually deployed to estimate clean speech prior to the conversion. However, this can lead to speech distortion due to the mismatch between the separation module and the conversion one. In this paper, a noise-robust voice conversion model is proposed, where a user can choose to retain or to remove the background sounds freely. Firstly, a speech separation module with a dual-decoder structure is proposed, where two decoders decode the denoised speech and the background sounds, respectively. A bridge module is used to capture the interactions between the denoised speech and the background sounds in parallel layers through information exchanging. Subsequently, a voice conversion module with multiple encoders to convert the estimated clean speech from the speech separation model. Finally, the speech separation and voice conversion module are jointly trained using a loss function combining cycle loss and mutual information loss, aiming to improve the decoupling efficacy among speech contents, pitch, and speaker identity. Experimental results show that the proposed model obtains significant improvements in both subjective and objective evaluation metrics compared with the existing baselines. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.
引用
收藏
页码:3981 / 3994
页数:14
相关论文
共 43 条
  • [1] Robust speech quality enhancement method against background noise and packet loss at voice-over-IP receiver
    Kim, Gee Yeun
    Kim, Hyoung-Gook
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2018, 37 (06): : 512 - 517
  • [2] WaveFormer: Wavelet Transformer for Noise-Robust Video Inpainting
    Wu, Zhiliang
    Sun, Changchang
    Xuan, Hanyu
    Liu, Gaowen
    Yan, Yan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 6, 2024, : 6180 - 6188
  • [3] NOISE-ROBUST DETECTION OF PEAK-CLIPPING IN DECODED SPEECH
    Eaton, James
    Naylor, Patrick A.
    2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
  • [4] Speech Enhancement for Noise-Robust Speech Synthesis using Wasserstein GAN
    Adiga, Nagaraj
    Pantazis, Yannis
    Tsiaras, Vassilis
    Stylianou, Yannis
    INTERSPEECH 2019, 2019, : 1821 - 1825
  • [5] TENET: A TIME-REVERSAL ENHANCEMENT NETWORK FOR NOISE-ROBUST ASR
    Chao, Fu-An
    Jiang, Shao-Wei Fan
    Yan, Bi-Cheng
    Hung, Jeih-weih
    Chen, Berlin
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 55 - 61
  • [6] TOWARD DEGRADATION-ROBUST VOICE CONVERSION
    Huang, Chien-Yu
    Chang, Kai-Wei
    Lee, Hung-Yi
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6777 - 6781
  • [7] A neural network approach for speech enhancement and noise-robust bandwidth extension
    Hao, Xiang
    Xu, Chenglin
    Zhang, Chen
    Xie, Lei
    COMPUTER SPEECH AND LANGUAGE, 2025, 89
  • [8] INTERACTIVE FEATURE FUSION FOR END-TO-END NOISE-ROBUST SPEECH RECOGNITION
    Hu, Yuchen
    Hou, Nana
    Chen, Chen
    Chng, Eng Siong
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 6292 - 6296
  • [9] SPEECH TEMPORAL DYNAMICS FUSION APPROACHES FOR NOISE-ROBUST REVERBERATION TIME ESTIMATION
    Senoussaoui, Mohammed
    Santos, Joao F.
    Falk, Tiago H.
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5545 - 5549
  • [10] Dual-Branch Knowledge Distillation for Noise-Robust Synthetic Speech Detection
    Fan, Cunhang
    Ding, Mingming
    Tao, Jianhua
    Fu, Ruibo
    Yi, Jiangyan
    Wen, Zhengqi
    Lv, Zhao
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 2453 - 2466