SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping

被引:5
作者
Koizumi, Yuma [1 ]
Zen, Heiga [1 ]
Yatabe, Kohei [2 ]
Chen, Nanxin [1 ]
Bacchiani, Michiel [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Tokyo Univ Agr & Technol, Fuchu, Tokyo, Japan
来源
INTERSPEECH 2022 | 2022年
关键词
Denoising diffusion probabilistic model; neural vocoder; spectral envelope; and speech enhancement;
D O I
10.21437/Interspeech.2022-301
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Neural vocoder using denoising diffusion probabilistic model (DDPM) has been improved by adaptation of the diffusion noise distribution to given acoustic features. In this study, we propose SpecGrad that adapts the diffusion noise so that its timevarying spectral envelope becomes close to the conditioning log-mel spectrogram. This adaptation by time-varying filtering improves the sound quality especially in the high-frequency bands. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders. Experimental results showed that SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based neural vocoders in both analysis-synthesis and speech enhancement scenarios. Audio demos are available at wavegrad.github.io/specgrad/.
引用
收藏
页码:803 / 807
页数:5
相关论文
共 39 条
  • [1] IMAGE METHOD FOR EFFICIENTLY SIMULATING SMALL-ROOM ACOUSTICS
    ALLEN, JB
    BERKLEY, DA
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (04) : 943 - 950
  • [2] [Anonymous], 2007, P8622 ITUT STD
  • [3] Chen N., 2021, PROC INT C LEARN REP
  • [4] Chen Z., 2022, PROC INT C ACOUST SP
  • [5] den Oord A.C., arXiv
  • [6] Donahue C., 2019, PROC INT C LEARN REP
  • [7] Goel K, 2022, Arxiv, DOI arXiv:2202.09729
  • [8] Gritsenko A. A., 2020, ADV NEUR IN
  • [9] Ho J., 2020, Advances in Neural Information Processing Systems, V33, P6840
  • [10] Evaluation of objective quality measures for speech enhancement
    Hu, Yi
    Loizou, Philipos C.
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2008, 16 (01): : 229 - 238