Combined Keyword Spotting and Localization Network Based on Multi-Task Learning

被引:0
|
作者
Ko, Jungbeom [1 ]
Kim, Hyunchul [2 ]
Kim, Jungsuk [3 ]
机构
[1] Gachon Univ, Gachon Adv Inst Hlth Sci & Technol GAIHST, Dept Hlth Sci & Technol, Incheon 21936, South Korea
[2] Univ Calif Berkeley, Sch Informat, 102 South Hall 4600, Berkeley, CA 94720 USA
[3] Gachon Univ, Coll IT Convergence, Dept Biomed Engn, Seongnam Si 13120, South Korea
基金
新加坡国家研究基金会;
关键词
deep neural network; keyword spotting; sound source localization; multi-task learning;
D O I
10.3390/math12213309
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
The advent of voice assistance technology and its integration into smart devices has facilitated many useful services, such as texting and application execution. However, most assistive technologies lack the capability to enable the system to act as a human who can localize the speaker and selectively spot meaningful keywords. Because keyword spotting (KWS) and sound source localization (SSL) are essential and must operate in real time, the efficiency of a neural network model is crucial for memory and computation. In this paper, a single neural network model for KWS and SSL is proposed to overcome the limitations of sequential KWS and SSL, which require more memory and inference time. The proposed model uses multi-task learning to utilize the limited resources of the device efficiently. A shared encoder is used as the initial layer to extract common features from the multichannel audio data. Subsequently, the task-specific parallel layers utilize these features for KWS and SSL. The proposed model was evaluated on a synthetic dataset with multiple speakers, and a 7-module shared encoder structure was identified as optimal in terms of accuracy, direction of arrival (DOA) accuracy, DOA error, and latency. It achieved a KWS accuracy of 94.51%, DOA error of 12.397 degrees, and DOA accuracy of 89.86%. Consequently, the proposed model requires significantly less memory owing to the shared network architecture, which enhances the inference time without compromising KWS accuracy, DOA error, and DOA accuracy.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Personalized Keyword Spotting through Multi-task Learning
    Yang, Seunghan
    Kim, Byeonggeun
    Chung, Inseop
    Chang, Simyung
    INTERSPEECH 2022, 2022, : 1881 - 1885
  • [2] MULTI-TASK LEARNING WITH CROSS ATTENTION FOR KEYWORD SPOTTING
    Higuchil, Takuya
    Gupta, Anmol
    Dhir, Chandra
    2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, : 571 - 578
  • [3] Multi-task learning and Weighted Cross-entropy for DNN-based Keyword Spotting
    Panchapagesan, Sankaran
    Sun, Ming
    Khare, Aparna
    Mandal, Spyros Matsoukas Arindam
    Hoffineister, Bjorn
    Vitaladevuni, Shiv
    17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 760 - 764
  • [4] Multi-task learning for simultaneous script identification and keyword spotting in document images
    Cheikhrouhou, Ahmed
    Kessentini, Yousri
    Kanoun, Slim
    PATTERN RECOGNITION, 2021, 113
  • [5] Sound source localization based on multi-task learning and image translation network
    Wu, Yifan
    Ayyalasomayajula, Roshan
    Bianco, Michael J.
    Bharadia, Dinesh
    Gerstoft, Peter
    JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2021, 150 (05): : 3374 - 3386
  • [6] Multi-Task Learning Based Network Embedding
    Wang, Shanfeng
    Wang, Qixiang
    Gong, Maoguo
    FRONTIERS IN NEUROSCIENCE, 2020, 13
  • [7] Multi-Task ConvMixer Networks with Triplet Attention for Low-Resource Keyword Spotting
    Kivaisi, Alexander Rogath
    Zhao, Qingjie
    Zou, Yuanbing
    TSINGHUA SCIENCE AND TECHNOLOGY, 2025, 30 (02): : 875 - 893
  • [8] Task Switching Network for Multi-task Learning
    Sun, Guolei
    Probst, Thomas
    Paudel, Danda Pani
    Popovic, Nikola
    Kanakis, Menelaos
    Patel, Jagruti
    Dai, Dengxin
    Van Gool, Luc
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 8271 - 8280
  • [9] Stratified Multi-Task Learning for Robust Spotting of Scene Texts
    Dasgupta, Kinjal
    Das, Sudip
    Bhattacharya, Ujjwal
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3130 - 3137
  • [10] Multi-Task Network Representation Learning
    Xie, Yu
    Jin, Peixuan
    Gong, Maoguo
    Zhang, Chen
    Yu, Bin
    FRONTIERS IN NEUROSCIENCE, 2020, 14