Listen only to me! How well can target speech extraction handle false alarms?

被引:9
作者
Delcroix, Marc [1 ]
Kinoshita, Keisuke [1 ]
Ochiai, Tsubasa [1 ]
Zmolikova, Katerina [2 ]
Sato, Hiroshi [1 ]
Nakatani, Tomohiro [1 ]
机构
[1] NTT Corp, Tokyo, Japan
[2] Brno Univ Technol, Speech FIT, Brno, Czech Republic
来源
INTERSPEECH 2022 | 2022年
关键词
Speech enhancement; Target speech extraction; Inactive speaker; SPEAKER EXTRACTION; NETWORK;
D O I
10.21437/Interspeech.2022-11252
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enhancement and separation. Most studies have focused on processing mixtures where the target speaker is actively speaking. However, the target speaker is sometimes silent in practice, i.e., inactive speaker (IS). A typical TSE system will tend to output a signal in IS cases, causing false alarms. This is a severe problem for the practical deployment of TSE systems. This paper aims at understanding better how well TSE systems can handle IS cases. We consider two approaches to deal with IS, (1) training a system to directly output zero signals or (2) detecting IS with an extra speaker verification module. We perform an extensive experimental comparison of these schemes in terms of extraction performance and IS detection using the LibriMix dataset and reveal their pros and cons.
引用
收藏
页码:216 / 220
页数:5
相关论文
共 32 条
  • [1] Afouras T, 2018, INTERSPEECH, P3244
  • [2] Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers
    Borsdorf, Marvin
    Xu, Chenglin
    Li, Haizhou
    Schultz, Tanja
    [J]. INTERSPEECH 2021, 2021, : 1469 - 1473
  • [3] Cosentino J., 2020, Librimix: An open-source dataset for generalizable speech separation
  • [4] Delcroix M, 2020, INT CONF ACOUST SPEE, P691, DOI [10.1109/ICASSP40776.2020.9054683, 10.1109/icassp40776.2020.9054683]
  • [5] Delcroix M, 2019, INT CONF ACOUST SPEE, P6965, DOI [10.1109/icassp.2019.8683087, 10.1109/ICASSP.2019.8683087]
  • [6] Delcroix M, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5554, DOI 10.1109/ICASSP.2018.8462661
  • [7] Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
    Ephrat, Ariel
    Mosseri, Inbar
    Lang, Oran
    Dekel, Tali
    Wilson, Kevin
    Hassidim, Avinatan
    Freeman, William T.
    Rubinstein, Michael
    [J]. ACM TRANSACTIONS ON GRAPHICS, 2018, 37 (04):
  • [8] Neural Spatial Filter: Target Speaker Speech Separation Assisted with Directional Information
    Gu, Rongzhi
    Chen, Lianwu
    Zhang, Shi-Xiong
    Zheng, Jimeng
    Xu, Yong
    Yu, Meng
    Su, Dan
    Zou, Yuexian
    Yu, Dong
    [J]. INTERSPEECH 2019, 2019, : 4290 - 4294
  • [9] Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
  • [10] Jansky J, 2020, INT CONF ACOUST SPEE, P676, DOI [10.1109/icassp40776.2020.9054693, 10.1109/ICASSP40776.2020.9054693]