Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders

被引:25
作者
Bie, Xiaoyu [1 ]
Leglaive, Simon [2 ]
Alameda-Pineda, Xavier [1 ]
Girin, Laurent [3 ]
机构
[1] Univ Grenoble Alpes, Inria Grenoble Rhone Alpes, F-38000 Grenoble, France
[2] Cent Supelec, IETR UMR CNRS 6164, F-35576 Cesson Sevigne, France
[3] Univ Grenoble Alpes, GIPSA Lab, CNRS, Grenoble INP, F-38402 Grenoble, France
基金
欧盟地平线“2020”;
关键词
Speech enhancement; Noise measurement; Training; Recording; Inference algorithms; Time-domain analysis; Time series analysis; dynamical variational autoencoders; nonnegative matrix factorization; variational inference; NONNEGATIVE MATRIX FACTORIZATION; SEMI-SUPERVISED SEPARATION; ALGORITHM; NOISE;
D O I
10.1109/TASLP.2022.3207349
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.
引用
收藏
页码:2993 / 3007
页数:15
相关论文
共 79 条
  • [1] Aksan E., 2018, PROC INT C LEARN REP
  • [2] Improving deep speech denoising by Noisy2Noisy signal mapping
    Alamdari, N.
    Azarang, A.
    Kehtarnavaz, N.
    [J]. APPLIED ACOUSTICS, 2021, 172
  • [3] [Anonymous], 2015, ICLR
  • [4] Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
    Bando, Yoshiaki
    Sekiguchi, Kouhei
    Yoshii, Kazuyoshi
    [J]. INTERSPEECH 2020, 2020, : 2437 - 2441
  • [5] Bando Y, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P716, DOI 10.1109/ICASSP.2018.8461530
  • [6] Bayer J, 2015, Arxiv, DOI arXiv:1411.7610
  • [7] Benesty J., 2006, Speech enhancement
  • [8] Benesty J, 2008, SPRINGER TOP SIGN PR, V1, P1
  • [9] Bengio S, 2015, ADV NEUR IN, V28
  • [10] A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling
    Bie, Xiaoyu
    Girin, Laurent
    Leglaive, Simon
    Hueber, Thomas
    Alameda-Pineda, Xavier
    [J]. INTERSPEECH 2021, 2021, : 46 - 50