Text-aware Speech Separation for Multi-talker Keyword Spotting

被引:0
|
作者
Li, Haoyu [1 ]
Yang, Baochen [1 ]
Xi, Yu [1 ]
Yu, Linfeng [1 ]
Tan, Tian [1 ]
Li, Hao [2 ]
Yu, Kai [1 ]
机构
[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, AI Inst, X LANCE Lab, Shanghai, Peoples R China
[2] AISpeech Ltd, Beijing, Peoples R China
来源
关键词
multi-talker keyword spotting; text-aware speech separation; robustness;
D O I
10.21437/Interspeech.2024-789
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail party problem where multi-talker speech is separated using speaker clues, the key challenge here is to extract the target speech for KWS based on text clues. To address it, this paper proposes a novel Text-aware Permutation Determinization Training method for multi-talker KWS with a clue-based Speech Separation front-end (TPDT-SS). Our research highlights the critical role of SS front-ends and shows that incorporating keyword-specific clues into these models can greatly enhance the effectiveness. TPDT-SS shows remarkable success in addressing permutation problems in mixed keyword speech, thereby greatly boosting the performance of the backend. Additionally, fine-tuning our system on unseen mixed speech results in further performance improvement.
引用
收藏
页码:337 / 341
页数:5
相关论文
共 50 条
  • [41] Speech-derived haptic stimulation enhances speech recognition in a multi-talker background
    Rautu, I. Sabina
    De Tiege, Xavier
    Jousmaki, Veikko
    Bourguignon, Mathieu
    Bertels, Julie
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [42] Speech-derived haptic stimulation enhances speech recognition in a multi-talker background
    I. Sabina Răutu
    Xavier De Tiège
    Veikko Jousmäki
    Mathieu Bourguignon
    Julie Bertels
    Scientific Reports, 13
  • [43] A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition
    Tu, Yan-Hui
    Du, Jun
    Dai, Li-Rung
    Lee, Chin-Hui
    2016 10TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP), 2016,
  • [44] Single-channel multi-talker speech recognition with permutation invariant training
    Qian, Yanmin
    Chang, Xuankai
    Yu, Dong
    SPEECH COMMUNICATION, 2018, 104 : 1 - 11
  • [45] Deep Neural Networks for Single-Channel Multi-Talker Speech Recognition
    Weng, Chao
    Yu, Dong
    Seltzer, Michael L.
    Droppo, Jasha
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2015, 23 (10) : 1670 - 1679
  • [46] Deep neural networks based binary classification for single channel speaker independent multi-talker speech separation
    Saleem, Nasir
    Khattak, Muhammad Irfan
    APPLIED ACOUSTICS, 2020, 167
  • [47] JOINT SEPARATION AND DENOISING OF NOISY MULTI-TALKER SPEECH USING RECURRENT NEURAL NETWORKS AND PERMUTATION INVARIANT TRAINING
    Kolbaek, Morten
    Yu, Dong
    Tan, Zheng-Hua
    Jensen, Jesper
    2017 IEEE 27TH INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, 2017,
  • [48] ADAPTATION OF RNN TRANSDUCER WITH TEXT-TO-SPEECH TECHNOLOGY FOR KEYWORD SPOTTING
    Sharma, Eva
    Ye, Guoli
    Wei, Wenning
    Zhao, Rui
    Tian, Yao
    Wu, Jian
    He, Lei
    Lin, Ed
    Gong, Yifan
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 7484 - 7488
  • [49] The effects of selective attention and speech acoustics on neural speech-tracking in a multi-talker scene
    Rimmele, Johanna M.
    Golumbic, Elana Zion
    Schroeger, Erich
    Poeppel, David
    CORTEX, 2015, 68 : 144 - 154
  • [50] Super-human multi-talker speech recognition: A graphical modeling approach
    Hershey, John R.
    Rennie, Steven J.
    Olsen, Peder A.
    Kristjansson, Trausti T.
    COMPUTER SPEECH AND LANGUAGE, 2010, 24 (01): : 45 - 66