VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

被引：169

作者：

Wang, Quan ^{[1
]}

Muckenhirn, Hannah ^{[2
,3
,4
]}

Wilson, Kevin ^{[1
]}

Sridhar, Prashant ^{[1
]}

Wu, Zelin ^{[1
]}

Hershey, John R. ^{[1
]}

Saurous, Rif A. ^{[1
]}

Weiss, Ron J. ^{[1
]}

Jia, Ye ^{[1
]}

Moreno, Ignacio Lopez ^{[1
]}

机构：

[1] Google Inc, Mountain View, CA 94043 USA

[2] Idiap Res Inst, Martigny, Switzerland

[3] Ecole Polytech Fed Lausanne, Lausanne, Switzerland

[4] Google, Mountain View, CA 94043 USA

来源：

INTERSPEECH 2019 | 2019年

关键词：

Source separation; speaker recognition; spectrogram masking; speech recognition;

D O I：

10.21437/Interspeech.2019-1101

中图分类号：

R36 [病理学]; R76 [耳鼻咽喉科学];

学科分类号：

100104 ; 100213 ;

摘要：

In this paper, we present a novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker. We achieve this by training two separate neural networks: (1) A speaker recognition network that produces speaker-discriminative embeddings; (2) A spectrogram masking network that takes both noisy spectrogram and speaker embedding as input, and produces a mask. Our system significantly reduces the speech recognition WER on multi-speaker signals, with minimal WER degradation on single-speaker signals.

引用

页码：2728 / 2732

页数：5

共 22 条

[1] Deep Lip Reading: a comparison of models and an online application [J].

Afouras, Triantafyllos ;

Chung, Joon Son ;

Zisserman, Andrew .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3514-3518

[2]

[Anonymous], 2017, INTERSPEECH

[3]

Barker JonP., 2017, New era for robust speech recognition: Exploiting deep learning, P327, DOI [DOI 10.1007/978-3-319-64680-0, DOI 10.1007/978-3-319-64680-0_14]

[4] High-Resolution Wideband Spectrum Sensing Based on Sparse Bayesian Learning [J].

Cheng, Peng ;

Li, Yonghui ;

Chen, Zhuo ;

Vucetic, Branka .

2017 IEEE 28TH ANNUAL INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR, AND MOBILE RADIO COMMUNICATIONS (PIMRC), 2017,

[5]

Chung Joon Son, 2018, P INTERSPEECH, DOI DOI 10.21437/INTERSPEECH.2018-1929

[6]

Delcroix M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5554, DOI 10.1109/ICASSP.2018.8462661

[7] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation [J].

Ephrat, Ariel ;

Mosseri, Inbar ;

Lang, Oran ;

Dekel, Tali ;

Wilson, Kevin ;

Hassidim, Avinatan ;

Freeman, William T. ;

Rubinstein, Michael .

ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04)

[8]

Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631

[9]

Jia Y., 2018, C NEUR INF PROC SYST

[10] Direct speech-to-speech translation with a sequence-to-sequence model [J].

Jia, Ye ;

Weiss, Ron J. ;

Biadsy, Fadi ;

Macherey, Wolfgang ;

Johnson, Melvin ;

Chen, Zhifeng ;

Wu, Yonghui .

INTERSPEECH 2019, 2019, :1123-1127

← 1 2 3 →