Text-aware Speech Separation for Multi-talker Keyword Spotting

被引:0
|
作者
Li, Haoyu [1 ]
Yang, Baochen [1 ]
Xi, Yu [1 ]
Yu, Linfeng [1 ]
Tan, Tian [1 ]
Li, Hao [2 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, AI Inst, X LANCE Lab, Shanghai, Peoples R China
[2] AISpeech Ltd, Beijing, Peoples R China
来源
关键词
multi-talker keyword spotting; text-aware speech separation; robustness;
D O I
10.21437/Interspeech.2024-789
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail party problem where multi-talker speech is separated using speaker clues, the key challenge here is to extract the target speech for KWS based on text clues. To address it, this paper proposes a novel Text-aware Permutation Determinization Training method for multi-talker KWS with a clue-based Speech Separation front-end (TPDT-SS). Our research highlights the critical role of SS front-ends and shows that incorporating keyword-specific clues into these models can greatly enhance the effectiveness. TPDT-SS shows remarkable success in addressing permutation problems in mixed keyword speech, thereby greatly boosting the performance of the backend. Additionally, fine-tuning our system on unseen mixed speech results in further performance improvement.
引用
收藏
页码:337 / 341
页数:5
相关论文
共 50 条
  • [31] EFFECTS OF MULTI-TALKER COMPETING SPEECH ON THE VARIABILITY OF THE CALIFORNIA CONSONANT TEST
    SURR, RK
    SCHWARTZ, DM
    EAR AND HEARING, 1980, 1 (06): : 319 - 323
  • [32] Chinese speech identification in multi-talker babble with diotic and dichotic listening
    Peng JianXin
    Zhang HongHu
    Wang ZiYou
    CHINESE SCIENCE BULLETIN, 2012, 57 (20): : 2548 - 2553
  • [33] Hierarchical Variational Loopy Belief Propagation for Multi-talker Speech Recognition
    Rennie, Steven J.
    Hershey, John R.
    Olsen, Peder A.
    2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 176 - 181
  • [34] Selective cortical representation of attended speaker in multi-talker speech perception
    Nima Mesgarani
    Edward F. Chang
    Nature, 2012, 485 : 233 - 236
  • [35] Effects of face masks on speech recognition in multi-talker babble noise
    Toscano, Joseph C.
    Toscano, Cheyenne M.
    PLOS ONE, 2021, 16 (02):
  • [36] Speaker Identification in Multi-Talker Overlapping Speech Using Neural Networks
    Tran, Van-Thuan
    Tsai, Wei-Ho
    IEEE ACCESS, 2020, 8 : 134868 - 134879
  • [37] Selective cortical representation of attended speaker in multi-talker speech perception
    Mesgarani, Nima
    Chang, Edward F.
    NATURE, 2012, 485 (7397) : 233 - U118
  • [38] USING BINARUAL PROCESSING FOR AUTOMATIC SPEECH RECOGNITION IN MULTI-TALKER SCENES
    Spille, Constantin
    Dietz, Mathias
    Hohmann, Volker
    Meyer, Bernd T.
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7805 - 7809
  • [39] Auditory spatial cuing for speech perception in a dynamic multi-talker environment
    Tomoriova, Beata
    Kopco, Norbert
    2008 6TH INTERNATIONAL SYMPOSIUM ON APPLIED MACHINE INTELLIGENCE AND INFORMATICS, 2008, : 230 - 233
  • [40] Audio-Visual Multi-Talker Speech Recognition in A Cocktail Party
    Wu, Yifei
    Hi, Chenda
    Yang, Song
    Wu, Zhongqin
    Qian, Yanmin
    INTERSPEECH 2021, 2021, : 3021 - 3025