Unsupervised Speech Enhancement Using Dynamical Variational Autoencoders

被引：25

作者：

Bie, Xiaoyu ^{[1
]}

Leglaive, Simon ^{[2
]}

Alameda-Pineda, Xavier ^{[1
]}

Girin, Laurent ^{[3
]}

机构：

[1] Univ Grenoble Alpes, Inria Grenoble Rhone Alpes, F-38000 Grenoble, France

[2] Cent Supelec, IETR UMR CNRS 6164, F-35576 Cesson Sevigne, France

[3] Univ Grenoble Alpes, GIPSA Lab, CNRS, Grenoble INP, F-38402 Grenoble, France

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2022年 / 30卷

基金：

欧盟地平线“2020”;

关键词：

Speech enhancement; Noise measurement; Training; Recording; Inference algorithms; Time-domain analysis; Time series analysis; dynamical variational autoencoders; nonnegative matrix factorization; variational inference; NONNEGATIVE MATRIX FACTORIZATION; SEMI-SUPERVISED SEPARATION; ALGORITHM; NOISE;

D O I：

10.1109/TASLP.2022.3207349

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Dynamical variational autoencoders (DVAEs) are a class of deep generative models with latent variables, dedicated to model time series of high-dimensional data. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include temporal dependencies between successive observed and/or latent vectors. Previous work has shown the interest of using DVAEs over the VAE for speech spectrograms modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that requires neither noise samples nor noisy speech samples at training time, but only requires clean speech signals. In this paper, we extend these works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm that combines a DVAE speech prior pre-trained on clean speech signals with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. The algorithm is presented with the most general DVAE formulation and is then applied with three specific DVAE models to illustrate the versatility of the framework. Experimental results show that the proposed DVAE-based approach outperforms its VAE-based counterpart, as well as several supervised and unsupervised noise-dependent baselines, especially when the noise type is unseen during training.

引用

页码：2993 / 3007

页数：15

共 79 条

[1] Aksan E., 2018, PROC INT C LEARN REP
[2] Improving deep speech denoising by Noisy2Noisy signal mapping
Alamdari, N.
Azarang, A.
Kehtarnavaz, N.
[J]. APPLIED ACOUSTICS, 2021, 172
[3] [Anonymous], 2015, ICLR
[4] Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder
Bando, Yoshiaki
Sekiguchi, Kouhei
Yoshii, Kazuyoshi
[J]. INTERSPEECH 2020, 2020, : 2437 - 2441
[5] Bando Y, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P716, DOI 10.1109/ICASSP.2018.8461530
[6] Bayer J, 2015, Arxiv, DOI arXiv:1411.7610
[7] Benesty J., 2006, Speech enhancement
[8] Benesty J, 2008, SPRINGER TOP SIGN PR, V1, P1
[9] Bengio S, 2015, ADV NEUR IN, V28
[10] A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling
Bie, Xiaoyu
Girin, Laurent
Leglaive, Simon
Hueber, Thomas
Alameda-Pineda, Xavier
[J]. INTERSPEECH 2021, 2021, : 46 - 50

← 1 2 3 4 5 6 7 8 →