Listen only to me! How well can target speech extraction handle false alarms?

被引：9

作者：

Delcroix, Marc ^{[1
]}

Kinoshita, Keisuke ^{[1
]}

Ochiai, Tsubasa ^{[1
]}

Zmolikova, Katerina ^{[2
]}

Sato, Hiroshi ^{[1
]}

Nakatani, Tomohiro ^{[1
]}

机构：

[1] NTT Corp, Tokyo, Japan

[2] Brno Univ Technol, Speech FIT, Brno, Czech Republic

来源：

INTERSPEECH 2022 | 2022年

关键词：

Speech enhancement; Target speech extraction; Inactive speaker; SPEAKER EXTRACTION; NETWORK;

D O I：

10.21437/Interspeech.2022-11252

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enhancement and separation. Most studies have focused on processing mixtures where the target speaker is actively speaking. However, the target speaker is sometimes silent in practice, i.e., inactive speaker (IS). A typical TSE system will tend to output a signal in IS cases, causing false alarms. This is a severe problem for the practical deployment of TSE systems. This paper aims at understanding better how well TSE systems can handle IS cases. We consider two approaches to deal with IS, (1) training a system to directly output zero signals or (2) detecting IS with an extra speaker verification module. We perform an extensive experimental comparison of these schemes in terms of extraction performance and IS detection using the LibriMix dataset and reveal their pros and cons.

引用

页码：216 / 220

页数：5

共 32 条

[1] Afouras T, 2018, INTERSPEECH, P3244
[2] Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers
Borsdorf, Marvin
Xu, Chenglin
Li, Haizhou
Schultz, Tanja
[J]. INTERSPEECH 2021, 2021, : 1469 - 1473
[3] Cosentino J., 2020, Librimix: An open-source dataset for generalizable speech separation
[4] Delcroix M, 2020, INT CONF ACOUST SPEE, P691, DOI [10.1109/ICASSP40776.2020.9054683, 10.1109/icassp40776.2020.9054683]
[5] Delcroix M, 2019, INT CONF ACOUST SPEE, P6965, DOI [10.1109/icassp.2019.8683087, 10.1109/ICASSP.2019.8683087]
[6] Delcroix M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5554, DOI 10.1109/ICASSP.2018.8462661
[7] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Ephrat, Ariel
Mosseri, Inbar
Lang, Oran
Dekel, Tali
Wilson, Kevin
Hassidim, Avinatan
Freeman, William T.
Rubinstein, Michael
[J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
[8] Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information
Gu, Rongzhi
Chen, Lianwu
Zhang, Shi-Xiong
Zheng, Jimeng
Xu, Yong
Yu, Meng
Su, Dan
Zou, Yuexian
Yu, Dong
[J]. INTERSPEECH 2019, 2019, : 4290 - 4294
[9] Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
[10] Jansky J, 2020, INT CONF ACOUST SPEE, P676, DOI [10.1109/icassp40776.2020.9054693, 10.1109/ICASSP40776.2020.9054693]

← 1 2 3 4 →