A noise-robust voice conversion method with controllable background sounds

被引:0
|
作者
Chen, Lele [1 ]
Zhang, Xiongwei [1 ]
Li, Yihao [1 ]
Sun, Meng [1 ]
Chen, Weiwei [1 ]
机构
[1] Army Engn Univ PLA, Coll Command & Control Engn, Nanjing 210007, Peoples R China
基金
中国国家自然科学基金;
关键词
Noise-robust voice conversion; Dual-decoder structure; Bridge module; Cycle loss; Speech disentanglement; SPEECH ENHANCEMENT; FRAMEWORK;
D O I
10.1007/s40747-024-01375-6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Background noises are usually treated as redundant or even harmful to voice conversion. Therefore, when converting noisy speech, a pretrained module of speech separation is usually deployed to estimate clean speech prior to the conversion. However, this can lead to speech distortion due to the mismatch between the separation module and the conversion one. In this paper, a noise-robust voice conversion model is proposed, where a user can choose to retain or to remove the background sounds freely. Firstly, a speech separation module with a dual-decoder structure is proposed, where two decoders decode the denoised speech and the background sounds, respectively. A bridge module is used to capture the interactions between the denoised speech and the background sounds in parallel layers through information exchanging. Subsequently, a voice conversion module with multiple encoders to convert the estimated clean speech from the speech separation model. Finally, the speech separation and voice conversion module are jointly trained using a loss function combining cycle loss and mutual information loss, aiming to improve the decoupling efficacy among speech contents, pitch, and speaker identity. Experimental results show that the proposed model obtains significant improvements in both subjective and objective evaluation metrics compared with the existing baselines. The speech naturalness and speaker similarity of the converted speech are 3.47 and 3.43, respectively.
引用
收藏
页码:3981 / 3994
页数:14
相关论文
共 43 条
  • [31] UNIFIED ASR SYSTEM USING LGM-BASED SOURCE SEPARATION, NOISE-ROBUST FEATURE EXTRACTION, AND WORD HYPOTHESIS SELECTION
    Fujita, Yusuke
    Takashima, Ryoichi
    Homma, Takeshi
    Ikeshita, Rintaro
    Kawaguchi, Yohei
    Sumiyoshi, Takashi
    Endo, Takashi
    Togami, Masahito
    2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2015, : 416 - 422
  • [32] SuperM2M: Supervised and mixture-to-mixture co-learning for speech enhancement and noise-robust ASR
    Wang, Zhong-Qiu
    NEURAL NETWORKS, 2025, 188
  • [33] Audio Effect for Highlighting Speaker's Voice Corrupted by Background Noise on Portable Digital Imaging Devices
    Kang, Jin Ah
    Chun, Chan Jun
    Kim, Hong Kook
    Kim, Ji Woon
    Kim, Myeong Bo
    UBIQUITOUS COMPUTING AND MULTIMEDIA APPLICATIONS, PT II, 2011, 151 : 39 - +
  • [34] Noise robust voice activity detection using joint phase and magnitude based feature enhancement
    Phapatanaburi, Khomdet
    Wang, Longbiao
    Oo, Zeyan
    Li, Weifeng
    Nakagawa, Seiichi
    Iwahashi, Masahiro
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2017, 8 (06) : 845 - 859
  • [35] Speech Enhancement Based on Teacher-Student Deep Learning Using Improved Speech Presence Probability for Noise-Robust Speech Recognition
    Tu, Yan-Hui
    Du, Jun
    Lee, Chin-Hui
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2019, 27 (12) : 2080 - 2091
  • [36] rVAD: An unsupervised segment-based robust voice activity detection method
    Tan, Zheng-Hua
    Sarkar, Achintya Kr
    Dehak, Najim
    COMPUTER SPEECH AND LANGUAGE, 2020, 59 : 1 - 21
  • [37] A robust and lightweight voice activity detection algorithm for speech enhancement at low signal-to-noise ratio
    Zhu, Zhehui
    Zhang, Lijun
    Pei, Kaikun
    Chen, Siqi
    DIGITAL SIGNAL PROCESSING, 2023, 141
  • [38] A Robust Noise Mitigation Method for the Mobile RFID Location in Built Environment
    Jing, Changfeng
    Sun, Tiancheng
    Chen, Qiang
    Du, Mingyi
    Wang, Mingshu
    Wang, Shouqing
    Wang, Jian
    SENSORS, 2019, 19 (09)
  • [39] A novel method to correct steering vectors in MVDR beamformer for noise robust ASR
    Bu, Suliang
    Zhao, Yunxin
    Hwang, Mei-Yuh
    INTERSPEECH 2019, 2019, : 4280 - 4284
  • [40] A Noise Robust Speech Recognition Method Using Model Compensation Based on Speech Enhancement
    Shen, Guanghu
    Jung, Ho-Youl
    Chung, Hyun-Yeol
    JOURNAL OF THE ACOUSTICAL SOCIETY OF KOREA, 2008, 27 (04): : 191 - 199