Text-aware Speech Separation for Multi-talker Keyword Spotting

被引：0

作者：

Li, Haoyu ^{[1
]}

Yang, Baochen ^{[1
]}

Xi, Yu ^{[1
]}

Yu, Linfeng ^{[1
]}

Tan, Tian ^{[1
]}

Li, Hao ^{[2
]}

Yu, Kai ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, MoE Key Lab Artificial Intelligence, AI Inst, X LANCE Lab, Shanghai, Peoples R China

[2] AISpeech Ltd, Beijing, Peoples R China

来源：

INTERSPEECH 2024 | 2024年

关键词：

multi-talker keyword spotting; text-aware speech separation; robustness;

D O I：

10.21437/Interspeech.2024-789

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

For noisy environments, ensuring the robustness of keyword spotting (KWS) systems is essential. While much research has focused on noisy KWS, less attention has been paid to multi-talker mixed speech scenarios. Unlike the usual cocktail party problem where multi-talker speech is separated using speaker clues, the key challenge here is to extract the target speech for KWS based on text clues. To address it, this paper proposes a novel Text-aware Permutation Determinization Training method for multi-talker KWS with a clue-based Speech Separation front-end (TPDT-SS). Our research highlights the critical role of SS front-ends and shows that incorporating keyword-specific clues into these models can greatly enhance the effectiveness. TPDT-SS shows remarkable success in addressing permutation problems in mixed keyword speech, thereby greatly boosting the performance of the backend. Additionally, fine-tuning our system on unseen mixed speech results in further performance improvement.

引用

页码：337 / 341

页数：5

共 50 条

[31] EFFECTS OF MULTI-TALKER COMPETING SPEECH ON THE VARIABILITY OF THE CALIFORNIA CONSONANT TEST
SURR, RK
SCHWARTZ, DM
EAR AND HEARING, 1980, 1 (06): : 319 - 323
[32] Chinese speech identification in multi-talker babble with diotic and dichotic listening
Peng JianXin
Zhang HongHu
Wang ZiYou
CHINESE SCIENCE BULLETIN, 2012, 57 (20): : 2548 - 2553
[33] Hierarchical Variational Loopy Belief Propagation for Multi-talker Speech Recognition
Rennie, Steven J.
Hershey, John R.
Olsen, Peder A.
2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 176 - 181
[34] Selective cortical representation of attended speaker in multi-talker speech perception
Nima Mesgarani
Edward F. Chang
Nature, 2012, 485 : 233 - 236
[35] Effects of face masks on speech recognition in multi-talker babble noise
Toscano, Joseph C.
Toscano, Cheyenne M.
PLOS ONE, 2021, 16 (02):
[36] Speaker Identification in Multi-Talker Overlapping Speech Using Neural Networks
Tran, Van-Thuan
Tsai, Wei-Ho
IEEE ACCESS, 2020, 8 : 134868 - 134879
[37] Selective cortical representation of attended speaker in multi-talker speech perception
Mesgarani, Nima
Chang, Edward F.
NATURE, 2012, 485 (7397) : 233 - U118
[38] USING BINARUAL PROCESSING FOR AUTOMATIC SPEECH RECOGNITION IN MULTI-TALKER SCENES
Spille, Constantin
Dietz, Mathias
Hohmann, Volker
Meyer, Bernd T.
2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7805 - 7809
[39] Auditory spatial cuing for speech perception in a dynamic multi-talker environment
Tomoriova, Beata
Kopco, Norbert
2008 6TH INTERNATIONAL SYMPOSIUM ON APPLIED MACHINE INTELLIGENCE AND INFORMATICS, 2008, : 230 - 233
[40] Audio-Visual Multi-Talker Speech Recognition in A Cocktail Party
Wu, Yifei
Hi, Chenda
Yang, Song
Wu, Zhongqin
Qian, Yanmin
INTERSPEECH 2021, 2021, : 3021 - 3025

← 1 2 3 4 5 →