Denoising-and-Dereverberation Hierarchical Neural Vocoder for Statistical Parametric Speech Synthesis

被引:4
作者
Ai, Yang [1 ]
Ling, Zhen-Hua [1 ]
Wu, Wei-Lu [2 ]
Li, Ang [2 ]
机构
[1] Univ Sci & Technol China, Natl Engn Res Ctr Speech & Language Informat Proc, Hefei 230027, Peoples R China
[2] Natl Univ Def Technol, Hefei 230037, Peoples R China
基金
中国国家自然科学基金;
关键词
Vocoders; Speech enhancement; Noise reduction; Task analysis; Noise measurement; Speech synthesis; Hidden Markov models; Neural vocoder; denoising; dereverberation; speech enhancement; statistical parametric speech synthesis; ENHANCEMENT;
D O I
10.1109/TASLP.2022.3182268
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper presents a denoising and dereverberation hierarchical neural vocoder (DNR-HiNet) to convert noisy and reverberant acoustic features into clean speech waveforms. The DNR-HiNet vocoder is built by modifying the amplitude spectrum predictor (ASP) in the original HiNet vocoder. This modified denoising and dereverberation ASP (DNR-ASP) can predict clean log amplitude spectra from input degraded acoustic features. To achieve this, the DNR-ASP first predicts the log amplitude spectra of noisy and reverberant speech, the log amplitude spectra of additive noise and the room impulse response (RIR) and then performs initial denoising and dereverberation by signal processing algorithms. The initially processed log amplitude spectra are then enhanced by another neural network to obtain the final clean log amplitude spectra. We also introduce a bandwidth extension model and a frequency resolution extension model into the DNR-ASP to further improve its performance. Finally, a statistical parametric speech synthesis (SPSS) method with DNR-HiNet is proposed to deal with the situation that the quality of target speaker's recordings is degraded by noise and reverberation. Experimental results indicate that the DNR-HiNet vocoder was able to generate denoised and dereverberated waveforms given noisy and reverberant acoustic features and outperformed the original HiNet vocoder and a few other neural vocoders. On speech enhancement tasks, its performance was competitive with several advanced speech enhancement methods. Furthermore, the SPSS method with DNR-HiNet achieved better quality of synthetic speech than the conventional approach which directly applied speech enhancement to the degraded adaptation data.
引用
收藏
页码:2036 / 2048
页数:13
相关论文
共 51 条
[1]   Reverberation Modeling for Source-Filter-based Neural Vocoder [J].
Ai, Yang ;
Wang, Xin ;
Yamagishi, Junichi ;
Ling, Zhen-Hua .
INTERSPEECH 2020, 2020, :3560-3564
[2]   Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders [J].
Ai, Yang ;
Ling, Zhen-Hua .
INTERSPEECH 2020, 2020, :190-194
[3]   DENOISING-AND-DEREVERBERATION HIERARCHICAL NEURAL VOCODER FOR ROBUST WAVEFORM GENERATION [J].
Ai, Yang ;
Li, Haoyu ;
Wang, Xin ;
Yamagishi, Junichi ;
Ling, Zhenhua .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :477-484
[4]   A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis [J].
Ai, Yang ;
Ling, Zhen-Hua .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 :839-851
[5]  
Ai Y, 2019, INT CONF ACOUST SPEE, P7025, DOI [10.1109/ICASSP.2019.8683016, 10.1109/icassp.2019.8683016]
[6]  
Ai Y, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5659, DOI 10.1109/ICASSP.2018.8461878
[7]  
[Anonymous], 1939, Bell Labs Record
[8]  
Buchholz S, 2011, 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, P3060
[9]  
Cooper E, 2020, INT CONF ACOUST SPEE, P6184, DOI [10.1109/ICASSP40776.2020.9054535, 10.1109/icassp40776.2020.9054535]
[10]  
Fan Yuchen, 2014, Interspeech