PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords

被引:5
作者
Lee, Yong-Hyeok [1 ]
Cho, Namhyun [1 ]
机构
[1] NCSOFT Corp, Speech AI Lab, Seoul, South Korea
来源
INTERSPEECH 2023 | 2023年
关键词
keyword spotting; user-defined; zero-shot; open-vocabulary;
D O I
10.21437/Interspeech.2023-597
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This study presents a novel zero-shot user-defined keyword spotting model that utilizes the audio-phoneme relationship of the keyword to improve performance. Unlike the previous approach that estimates at utterance level, we use both utterance and phoneme level information. Our proposed method comprises a two-stream speech encoder architecture, self-attention-based pattern extractor, and phoneme-level detection loss for high performance in various pronunciation environments. Based on experimental results, our proposed model outperforms the baseline model and achieves competitive performance compared with full-shot keyword spotting models. Our proposed model significantly improves the EER and AUC across all datasets, including familiar words, proper nouns, and indistinguishable pronunciations, with an average relative improvement of 67% and 80%, respectively. The implementation code of our proposed model is available at https://github.com/ncsoft/PhonMatchNet.
引用
收藏
页码:3964 / 3968
页数:5
相关论文
共 30 条
[1]  
Baevski A, 2020, ADV NEUR IN, V33
[2]   Keyword Transformer: A Self-Attention Model for Keyword Spotting [J].
Berg, Axel ;
O'Connor, Mark ;
Cruz, Miguel Tairum .
INTERSPEECH 2021, 2021, :4249-4253
[3]   Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers [J].
Chefer, Hila ;
Gur, Shir ;
Wolf, Lior .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :387-396
[4]  
Chen GG, 2015, INT CONF ACOUST SPEE, P5236, DOI 10.1109/ICASSP.2015.7178970
[5]   Temporal Convolution for Real-time Keyword Spotting on Mobile Devices [J].
Choi, Seungwoo ;
Seo, Seokjun ;
Shin, Beomjun ;
Byun, Hyeongmin ;
Kersner, Martin ;
Kim, Beomsu ;
Kim, Dongyoung ;
Ha, Sungjoo .
INTERSPEECH 2019, 2019, :3372-3376
[6]  
Graves Alex, 2006, Proceedings of the 23rd international conference on Machine learning-ICML'06, P369, DOI DOI 10.1145/1143844.1143891
[7]   QbyE-MLPMixer: Query-by-Example Open-Vocabulary Keyword Spotting using MLPMixer [J].
Huang, Jinmiao ;
Gharbieh, Waseem ;
Wan, Qianhui ;
Shim, Han Suk ;
Lee, Hyun Chul .
INTERSPEECH 2022, 2022, :5200-5204
[8]   QUERY-BY-EXAMPLE KEYWORD SPOTTING SYSTEM USING MULTI-HEAD ATTENTION AND SOFTTRIPLE LOSS [J].
Huang, Jinmiao ;
Gharbieh, Waseem ;
Shim, Han Suk ;
Kim, Eugene .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6858-6862
[9]   Broadcasted Residual Learning for Efficient Keyword Spotting [J].
Kim, Byeonggeun ;
Chang, Simyung ;
Lee, Jinkyu ;
Sung, Dooyong .
INTERSPEECH 2021, 2021, :4538-4542
[10]  
Kim B, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P532, DOI [10.1109/asru46091.2019.9004014, 10.1109/ASRU46091.2019.9004014]