Listen only to me! How well can target speech extraction handle false alarms?

被引:9
作者
Delcroix, Marc [1 ]
Kinoshita, Keisuke [1 ]
Ochiai, Tsubasa [1 ]
Zmolikova, Katerina [2 ]
Sato, Hiroshi [1 ]
Nakatani, Tomohiro [1 ]
机构
[1] NTT Corp, Tokyo, Japan
[2] Brno Univ Technol, Speech FIT, Brno, Czech Republic
来源
INTERSPEECH 2022 | 2022年
关键词
Speech enhancement; Target speech extraction; Inactive speaker; SPEAKER EXTRACTION; NETWORK;
D O I
10.21437/Interspeech.2022-11252
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the recent development of neural networks for speech enhancement and separation. Most studies have focused on processing mixtures where the target speaker is actively speaking. However, the target speaker is sometimes silent in practice, i.e., inactive speaker (IS). A typical TSE system will tend to output a signal in IS cases, causing false alarms. This is a severe problem for the practical deployment of TSE systems. This paper aims at understanding better how well TSE systems can handle IS cases. We consider two approaches to deal with IS, (1) training a system to directly output zero signals or (2) detecting IS with an extra speaker verification module. We perform an extensive experimental comparison of these schemes in terms of extraction performance and IS detection using the LibriMix dataset and reveal their pros and cons.
引用
收藏
页码:216 / 220
页数:5
相关论文
共 32 条
  • [21] Deep Extractor Network for Target Speaker Recovery From Single Channel Speech Mixtures
    Wang, Jun
    Chen, Jie
    Su, Dan
    Chen, Lianwu
    Yu, Meng
    Qian, Yanmin
    Yu, Dong
    [J]. 19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, : 307 - 311
  • [22] VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
    Wang, Quan
    Muckenhirn, Hannah
    Wilson, Kevin
    Sridhar, Prashant
    Wu, Zelin
    Hershey, John R.
    Saurous, Rif A.
    Weiss, Ron J.
    Jia, Ye
    Moreno, Ignacio Lopez
    [J]. INTERSPEECH 2019, 2019, : 2728 - 2732
  • [23] Wisdom S., 2020, PROC NEURIPS 20
  • [24] WHAT'S ALL THE FUSS ABOUT FREE UNIVERSAL SOUND SEPARATION DATA?
    Wisdom, Scott
    Erdogan, Hakan
    Ellis, Daniel P. W.
    Serizel, Romain
    Turpault, Nicolas
    Fonseca, Eduardo
    Salamon, Justin
    Seetharaman, Prem
    Hershey, John R.
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 186 - 190
  • [25] Xiao X, 2019, INT CONF ACOUST SPEE, P86, DOI [10.1109/ICASSP.2019.8682245, 10.1109/icassp.2019.8682245]
  • [26] Xu CL, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P327, DOI [10.1109/ASRU46091.2019.9004016, 10.1109/asru46091.2019.9004016]
  • [27] Permutation invariant training of deep models for speaker-independent multi-talker speech separation
    Takahashi, Kohei
    Shiraishi, Toshihiko
    [J]. MECHANICAL ENGINEERING JOURNAL, 2023,
  • [28] TOWARDS ROBUST SPEAKER VERIFICATION WITH TARGET SPEAKER ENHANCEMENT
    Zhang, Chunlei
    Yu, Meng
    Weng, Chao
    Yu, Dong
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6693 - 6697
  • [29] X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network
    Zhang, Zining
    He, Bingsheng
    Zhang, Zhenjie
    [J]. INTERSPEECH 2020, 2020, : 1421 - 1425
  • [30] SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech Mixtures
    Zmolikova, Katerina
    Delcroix, Marc
    Kinoshita, Keisuke
    Ochiai, Tsubasa
    Nakatani, Tomohiro
    Burget, Lukas
    Cernocky, Jan
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (04) : 800 - 814