Listen only to me! How well can target speech extraction handle false alarms?

被引：9

作者：

Delcroix, Marc ^{[1
]}

Kinoshita, Keisuke ^{[1
]}

Ochiai, Tsubasa ^{[1
]}

Zmolikova, Katerina ^{[2
]}

Sato, Hiroshi ^{[1
]}

Nakatani, Tomohiro ^{[1
]}

机构：

[1] NTT Corp, Tokyo, Japan

[2] Brno Univ Technol, Speech FIT, Brno, Czech Republic

来源：

INTERSPEECH 2022 | 2022年

关键词：

Speech enhancement; Target speech extraction; Inactive speaker; SPEAKER EXTRACTION; NETWORK;

D O I：

10.21437/Interspeech.2022-11252

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enhancement and separation. Most studies have focused on processing mixtures where the target speaker is actively speaking. However, the target speaker is sometimes silent in practice, i.e., inactive speaker (IS). A typical TSE system will tend to output a signal in IS cases, causing false alarms. This is a severe problem for the practical deployment of TSE systems. This paper aims at understanding better how well TSE systems can handle IS cases. We consider two approaches to deal with IS, (1) training a system to directly output zero signals or (2) detecting IS with an extra speaker verification module. We perform an extensive experimental comparison of these schemes in terms of extraction performance and IS detection using the LibriMix dataset and reveal their pros and cons.

引用

页码：216 / 220

页数：5

共 32 条

[21] Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures
Wang, Jun
Chen, Jie
Su, Dan
Chen, Lianwu
Yu, Meng
Qian, Yanmin
Yu, Dong
[J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 307 - 311
[22] VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
Wang, Quan
Muckenhirn, Hannah
Wilson, Kevin
Sridhar, Prashant
Wu, Zelin
Hershey, John R.
Saurous, Rif A.
Weiss, Ron J.
Jia, Ye
Moreno, Ignacio Lopez
[J]. INTERSPEECH 2019, 2019, : 2728 - 2732
[23] Wisdom S., 2020, PROC NEURIPS 20
[24] WHAT'S ALL THE FUSS ABOUT FREE UNIVERSAL SOUND SEPARATION DATA?
Wisdom, Scott
Erdogan, Hakan
Ellis, Daniel P. W.
Serizel, Romain
Turpault, Nicolas
Fonseca, Eduardo
Salamon, Justin
Seetharaman, Prem
Hershey, John R.
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 186 - 190
[25] Xiao X, 2019, INT CONF ACOUST SPEE, P86, DOI [10.1109/ICASSP.2019.8682245, 10.1109/icassp.2019.8682245]
[26] Xu CL, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P327, DOI [10.1109/ASRU46091.2019.9004016, 10.1109/asru46091.2019.9004016]
[27] Permutation invariant training of deep models for speaker-independent multi-talker speech separation
Takahashi, Kohei
Shiraishi, Toshihiko
[J]. MECHANICAL ENGINEERING JOURNAL, 2023,
[28] TOWARDS ROBUST SPEAKER VERIFICATION WITH TARGET SPEAKER ENHANCEMENT
Zhang, Chunlei
Yu, Meng
Weng, Chao
Yu, Dong
[J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6693 - 6697
[29] X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network
Zhang, Zining
He, Bingsheng
Zhang, Zhenjie
[J]. INTERSPEECH 2020, 2020, : 1421 - 1425
[30] SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures
Zmolikova, Katerina
Delcroix, Marc
Kinoshita, Keisuke
Ochiai, Tsubasa
Nakatani, Tomohiro
Burget, Lukas
Cernocky, Jan
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (04) : 800 - 814

← 1 2 3 4 →