IMPROVING RNN TRANSDUCER MODELING FOR SMALL-FOOTPRINT KEYWORD SPOTTING

被引:16
作者
Tian, Yao [1 ]
Yao, Haitao [1 ]
Cai, Meng [1 ]
Liu, Yaming [1 ]
Ma, Zejun [1 ]
机构
[1] Bytedance AI Lab, Beijing, Peoples R China
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
keyword spotting; RNN-T; CTC; multi-task; transfer learning;
D O I
10.1109/ICASSP39728.2021.9414339
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The recurrent neural network transducer (RNN-T) model has been proved effective for keyword spotting (KWS) recently. However, compared with cross-entropy (CE) or connectionist temporal classification (CTC) based models, the additional prediction network in the RNN-T model increases the model size and computational cost. Besides, since the keyword training data usually only contain the keyword sequence, the prediction network might has over-fitting problems. In this paper, we improve the RNN-T modeling for small-footprint keyword spotting in three aspects. First, to address the over-fitting issue, we explore multi-task training where a CTC loss is added to the encoder. The CTC loss is calculated with both KWS data and ASR data, while the RNN-T loss is calculated with ASR data so that only the encoder is augmented with KWS data. Second, we use the feed-forward neural network to replace the LSTM for prediction network modeling. Thus all possible prediction network outputs could be pre-computed for decoding. Third, we further improve the model with transfer learning, where a model trained with 160 thousand hours of ASR data is used to initialize the KWS model. On a self-collected far-field wake-word testset, the proposed RNN-T system greatly improves the performance comparing with a strong "keyword-filler" baseline.
引用
收藏
页码:5624 / 5628
页数:5
相关论文
共 18 条
[1]  
Alvarez R, 2019, INT CONF ACOUST SPEE, P6336, DOI 10.1109/ICASSP.2019.8683557
[2]  
[Anonymous], 2012, Sequence transduction with recurrent neural networks
[3]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[4]  
Chen GG, 2014, INT CONF ACOUST SPEE
[5]  
Ghodsi M, 2020, INT CONF ACOUST SPEE, P7049, DOI [10.1109/ICASSP40776.2020.9054419, 10.1109/icassp40776.2020.9054419]
[6]  
He YZ, 2019, INT CONF ACOUST SPEE, P6381, DOI [10.1109/icassp.2019.8682336, 10.1109/ICASSP.2019.8682336]
[7]  
He YZ, 2017, 2017 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), P474, DOI 10.1109/ASRU.2017.8268974
[8]  
Hu H, 2020, INT CONF ACOUST SPEE, P7079, DOI [10.1109/ICASSP40776.2020.9054663, 10.1109/icassp40776.2020.9054663]
[9]   Transfer Learning Approaches for Streaming End-to-End Speech Recognition System [J].
Joshi, Vikas ;
Zhao, Rui ;
Mehta, Rupesh R. ;
Kumar, Kshitiz ;
Li, Jinyu .
INTERSPEECH 2020, 2020, :2152-2156
[10]  
Li JY, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5794, DOI 10.1109/ICASSP.2018.8462017