DISENTANGLEMENT LEARNING FOR VARIATIONAL AUTOENCODERS APPLIED TO AUDIO-VISUAL SPEECH ENHANCEMENT

被引:6
作者
Carbajal, Guillaume [1 ]
Richter, Julius [1 ]
Gerkmann, Timo [1 ]
机构
[1] Univ Hamburg, Signal Proc SP, Hamburg, Germany
来源
2021 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA) | 2021年
关键词
Speech enhancement; conditional generative model; variational autoencoder; disentanglement learning; adversarial training; semi-supervised learning; audio-visual;
D O I
10.1109/WASPAA52581.2021.9632676
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Recently, the standard variational autoencoder has been successfully used to learn a probabilistic prior over speech signals, which is then used to perform speech enhancement. Variational autoencoders have then been conditioned on a label describing a high-level speech attribute (e.g. speech activity) that allows for a more explicit control of speech generation. However, the label is not guaranteed to be disentangled from the other latent variables, which results in limited performance improvements compared to the standard variational autoencoder. In this work, we propose to use an adversarial training scheme for variational autoencoders to disentangle the label from the other latent variables. At training, we use a discriminator that competes with the encoder of the variational autoencoder. Simultaneously, we also use an additional encoder that estimates the label for the decoder of the variational autoencoder, which proves to be crucial to learn disentanglement. We show the benefit of the proposed disentanglement learning when a voice activity label, estimated from visual data, is used for speech enhancement.
引用
收藏
页码:126 / 130
页数:5
相关论文
共 31 条
[1]   NTCD-TIMIT: A New Database and Baseline for Noise-robust Audio-visual Speech Recognition [J].
Abdelaziz, Ahmed Hussen .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :3752-3756
[2]  
[Anonymous], 2013, SYNTHESIS LECT SPEEC
[3]  
[Anonymous], 2017, NEURIPS
[4]   An End-to-End Multimodal Voice Activity Detection Using WaveNet Encoder and Residual Networks [J].
Ariav, Ido ;
Cohen, Israel .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (02) :265-274
[5]  
Bando Y., 2020, ISCA INTERSPEECH, P2437
[6]  
Bando Y, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P716, DOI 10.1109/ICASSP.2018.8461530
[7]   Cepstral smoothing of spectral filter gains for speech enhancement without musical noise [J].
Breithaupt, Colin ;
Gerkmann, Timo ;
Martin, Rainer .
IEEE SIGNAL PROCESSING LETTERS, 2007, 14 (12) :1036-1039
[8]   GUIDED VARIATIONAL AUTOENCODER FOR SPEECH ENHANCEMENT WITH A SUPERVISED CLASSIFIER [J].
Carbajal, Guillaume ;
Richter, Julius ;
Gerkmann, Timo .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :681-685
[9]  
Chen RTQ, 2018, 32 C NEURAL INFORM P, V31
[10]  
Creswell A., 2018, ARXIV171105175CS