Joint Framework of Curriculum Learning and Knowledge Distillation for Noise-Robust and Small-Footprint Keyword Spotting

被引:1
作者
Lim, Jaebong [1 ]
Baek, Yunju [1 ]
机构
[1] Pusan Natl Univ, Sch Comp Sci & Engn, Busan 46241, South Korea
关键词
Curriculum learning; data augmentation; joint framework; knowledge distillation; neural network compression; noise-robust keyword spotting; small-footprint keyword spotting;
D O I
10.1109/ACCESS.2023.3314191
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Spoken keyword spotting, which is characterized by simplicity and low latency, has been widely used in consumer electronics to facilitate always-on voice interfaces. Small-footprint keyword spotting based on tiny convolutional neural networks can be implemented on resource-constrained, yet energy-efficient, microcontrollers in real time. However, it is difficult for tiny neural networks to learn the noise-robustness properties essential for successful voice interfaces. To overcome this problem, this study proposes a joint framework of curriculum learning and knowledge distillation for noise-robust small-footprint keyword spotting. The proposed joint framework applies noise-mixture curriculum learning to a network that is sufficiently large, to learn various noise situations. Subsequently, knowledge distillation is applied to compress the large network into a sufficiently small network for use in an onboard microcontroller. To enhance the effectiveness of the joint framework, a curriculum learning approach is proposed with a new noise mixture strategy along with knowledge distillation that employs an effective ensemble of neural network snapshots for each curriculum stage. The proposed methods enable large networks to effectively learn noisy situations, thereby transferring noise robustness to small networks. The effectiveness of the joint framework was illustrated on the Google Speech Commands dataset with noise mixtures incorporated from various public noise datasets. The performance of the joint framework was superior in noisy situations compared to that of state-of-the-art noise-robust keyword-spotting methods. Consequently, the proposed framework significantly improves the usability of voice interfaces in consumer electronics.
引用
收藏
页码:100540 / 100553
页数:14
相关论文
共 44 条
[1]  
Allen-Zhu Z., 2023, P INT C LEARN REPR I, P1
[2]  
Bengio Y., 2009, P 26 ANN INT C MACH, P41, DOI DOI 10.1145/1553374.1553380
[3]  
Benmeziane H, 2021, PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, P4322
[4]  
Braun S, 2017, EUR SIGNAL PR CONF, P548, DOI 10.23919/EUSIPCO.2017.8081267
[5]   Target-Aware Neural Architecture Search and Deployment for Keyword Spotting [J].
Busia, Paola ;
Deriu, Gianfranco ;
Rinelli, Luca ;
Chesta, Cristina ;
Raffo, Luigi ;
Meloni, Paolo .
IEEE ACCESS, 2022, 10 :40687-40700
[6]   A Comprehensive Survey of Scene Graphs: Generation and Application [J].
Chang, Xiaojun ;
Ren, Pengzhen ;
Xu, Pengfei ;
Li, Zhihui ;
Chen, Xiaojiang ;
Hauptmann, Alex .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (01) :1-26
[7]   Temporal Convolution for Real-time Keyword Spotting on Mobile Devices [J].
Choi, Seungwoo ;
Seo, Seokjun ;
Shin, Beomjun ;
Byun, Hyeongmin ;
Kersner, Martin ;
Kim, Beomsu ;
Kim, Dongyoung ;
Ha, Sungjoo .
INTERSPEECH 2019, 2019, :3372-3376
[8]  
Dean D, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P3110
[9]  
Fernandez-Marques J., 2018, P C MACH LEARN SYST, P1
[10]   FSD50K: An Open Dataset of Human-Labeled Sound Events [J].
Fonseca, Eduardo ;
Favory, Xavier ;
Pons, Jordi ;
Font, Frederic ;
Serra, Xavier .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 :829-852