SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping

被引：5

作者：

Koizumi, Yuma ^{[1
]}

Zen, Heiga ^{[1
]}

Yatabe, Kohei ^{[2
]}

Chen, Nanxin ^{[1
]}

Bacchiani, Michiel ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

[2] Tokyo Univ Agr & Technol, Fuchu, Tokyo, Japan

来源：

INTERSPEECH 2022 | 2022年

关键词：

Denoising diffusion probabilistic model; neural vocoder; spectral envelope; and speech enhancement;

D O I：

10.21437/Interspeech.2022-301

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its timevarying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality especially in the high-frequency bands. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders. Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios. Audio demos are available at wavegrad.github.io/specgrad/.

引用

页码：803 / 807

页数：5

共 39 条

[1] IMAGE METHOD FOR EFFICIENTLY SIMULATING SMALL-ROOM ACOUSTICS
ALLEN, JB
BERKLEY, DA
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (04) : 943 - 950
[2] [Anonymous], 2007, P8622 ITUT STD
[3] Chen N., 2021, PROC INT C LEARN REP
[4] Chen Z., 2022, PROC INT C ACOUST SP
[5] den Oord A.C., arXiv
[6] Donahue C., 2019, PROC INT C LEARN REP
[7] Goel K, 2022, Arxiv, DOI arXiv:2202.09729
[8] Gritsenko A. A., 2020, ADV NEUR IN
[9] Ho J., 2020, Advances in Neural Information Processing Systems, V33, P6840
[10] Evaluation of objective quality measures for speech enhancement
Hu, Yi
Loizou, Philipos C.
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (01): : 229 - 238

← 1 2 3 4 →