DENOISING-AND-DEREVERBERATION HIERARCHICAL NEURAL VOCODER FOR ROBUST WAVEFORM GENERATION

被引:4
作者
Ai, Yang [1 ]
Li, Haoyu [2 ]
Wang, Xin [2 ]
Yamagishi, Junichi [2 ]
Ling, Zhenhua [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Peoples R China
[2] Natl Inst Informat, Tokyo, Japan
来源
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT) | 2021年
基金
中国国家自然科学基金;
关键词
neural vocoder; denoising; dereverberation; speech enhancement; SPEECH;
D O I
10.1109/SLT48900.2021.9383611
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a denoising and dereverberation hierarchical neural vocoder (DNR-HiNet) to convert noisy and reverberant acoustic features into a clean speech waveform. We implement it mainly by modifying the amplitude spectrum predictor (ASP) in the original HiNet vocoder. This modified denoising and dereverberation ASP (DNR-ASP) can predict clean log amplitude spectra (LAS) from input degraded acoustic features. To achieve this, the DNR-ASP first predicts the noisy and reverberant LAS, noise LAS related to the noise information, and room impulse response related to the reverberation information then performs initial denoising and dereverberation. The initial processed LAS are then enhanced by another neural network as the final clean LAS. To further improve the quality of the generated clean LAS, we also introduce a bandwidth extension model and frequency resolution extension model in the DNR-ASP. The experimental results indicate that the DNR-HiNet vocoder was able to generate a denoised and dereverberated waveform given noisy and reverberant acoustic features and outperformed the original HiNet vocoder and a few other neural vocoders. We also applied the DNR-HiNet vocoder to speech enhancement tasks, and its performance was competitive with several advanced speech enhancement methods.
引用
收藏
页码:477 / 484
页数:8
相关论文
共 36 条
[1]   A Neural Vocoder With Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis [J].
Ai, Yang ;
Ling, Zhen-Hua .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 :839-851
[2]  
Ai Y, 2019, INT CONF ACOUST SPEE, P7025, DOI [10.1109/ICASSP.2019.8683016, 10.1109/icassp.2019.8683016]
[3]  
Ai Y, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5659, DOI 10.1109/ICASSP.2018.8461878
[4]  
Ai Yang, 2020, P INTERSPEECH
[5]  
Buchholz S, 2011, 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, P3060
[6]  
Dumoulin V., 2017, Advances in neural information processing systems, P5767
[7]  
Hadad E, 2014, 2014 14TH INTERNATIONAL WORKSHOP ON ACOUSTIC SIGNAL ENHANCEMENT (IWAENC), P313, DOI 10.1109/IWAENC.2014.6954309
[8]   Evaluation of objective quality measures for speech enhancement [J].
Hu, Yi ;
Loizou, Philipos C. .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (01) :229-238
[9]  
ITUR Recommendation, 2001, Method for the subjective assessment of intermediate sound quality (MUSHRA), P1543
[10]  
Jeub Marco, 2009, BINAURAL ROOM IMPULS, V1, P550