VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

被引:169
作者
Wang, Quan [1 ]
Muckenhirn, Hannah [2 ,3 ,4 ]
Wilson, Kevin [1 ]
Sridhar, Prashant [1 ]
Wu, Zelin [1 ]
Hershey, John R. [1 ]
Saurous, Rif A. [1 ]
Weiss, Ron J. [1 ]
Jia, Ye [1 ]
Moreno, Ignacio Lopez [1 ]
机构
[1] Google Inc, Mountain View, CA 94043 USA
[2] Idiap Res Inst, Martigny, Switzerland
[3] Ecole Polytech Fed Lausanne, Lausanne, Switzerland
[4] Google, Mountain View, CA 94043 USA
来源
INTERSPEECH 2019 | 2019年
关键词
Source separation; speaker recognition; spectrogram masking; speech recognition;
D O I
10.21437/Interspeech.2019-1101
中图分类号
R36 [病理学]; R76 [耳鼻咽喉科学];
学科分类号
100104 ; 100213 ;
摘要
In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.
引用
收藏
页码:2728 / 2732
页数:5
相关论文
共 22 条
[1]   Deep Lip Reading: a comparison of models and an online application [J].
Afouras, Triantafyllos ;
Chung, Joon Son ;
Zisserman, Andrew .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3514-3518
[2]  
[Anonymous], 2017, INTERSPEECH
[3]  
Barker JonP., 2017, New era for robust speech recognition: Exploiting deep learning, P327, DOI [DOI 10.1007/978-3-319-64680-0, DOI 10.1007/978-3-319-64680-0_14]
[4]   High-Resolution Wideband Spectrum Sensing Based on Sparse Bayesian Learning [J].
Cheng, Peng ;
Li, Yonghui ;
Chen, Zhuo ;
Vucetic, Branka .
2017 IEEE 28TH ANNUAL INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR, AND MOBILE RADIO COMMUNICATIONS (PIMRC), 2017,
[5]  
Chung Joon Son, 2018, P INTERSPEECH, DOI DOI 10.21437/INTERSPEECH.2018-1929
[6]  
Delcroix M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5554, DOI 10.1109/ICASSP.2018.8462661
[7]   Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation [J].
Ephrat, Ariel ;
Mosseri, Inbar ;
Lang, Oran ;
Dekel, Tali ;
Wilson, Kevin ;
Hassidim, Avinatan ;
Freeman, William T. ;
Rubinstein, Michael .
ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04)
[8]  
Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
[9]  
Jia Y., 2018, C NEUR INF PROC SYST
[10]   Direct speech-to-speech translation with a sequence-to-sequence model [J].
Jia, Ye ;
Weiss, Ron J. ;
Biadsy, Fadi ;
Macherey, Wolfgang ;
Johnson, Melvin ;
Chen, Zhifeng ;
Wu, Yonghui .
INTERSPEECH 2019, 2019, :1123-1127