Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device

被引:1
作者
Huo, Zhouyuan [1 ]
Hwang, Dongseong [1 ]
Sim, Khe Chai [1 ]
Garg, Shefali [1 ]
Misra, Ananya [1 ]
Siddhartha, Nikhil [1 ]
Strohman, Trevor [1 ]
Beaufays, Francoise [1 ]
机构
[1] Google LLC, Mountain View, CA 94043 USA
来源
INTERSPEECH 2022 | 2022年
关键词
speech recognition; domain adaptation; layer-wise; self-supervised; low-resource;
D O I
10.21437/Interspeech.2022-10904
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server data distribution can be very different from the data distribution on user devices, which could affect the model performance. There are two main challenges for on device training, limited reliable labels and limited training memory. While self-supervised learning algorithms can mitigate the mismatch between domains using unlabeled data, they are not applicable on mobile devices directly because of the memory constraint. In this paper, we propose an incremental layer-wise self-supervised learning algorithm for efficient unsupervised speech domain adaptation on mobile devices, in which only one layer is updated at a time. Extensive experimental results demonstrate that the proposed algorithm achieves a 24.2% relative Word Error Rate (WER) improvement on the target domain compared to a supervised baseline and costs 95.7% less training memory than the end-to-end self-supervised learning algorithm.
引用
收藏
页码:4845 / 4849
页数:5
相关论文
共 23 条
[1]  
[Anonymous], 2018, PR MACH LEARN RES
[2]  
Baevski A., 2020, wav2vec 2.0: A framework for self-supervised learning of speech representations
[3]  
Bengio Y., 2012, P ICML WORKSH UNS TR, P17
[4]  
Chen T., 2016, Training deep nets with sublinear memory cost
[5]  
Chung YA, 2020, INT CONF ACOUST SPEE, P3497, DOI [10.1109/icassp40776.2020.9054438, 10.1109/ICASSP40776.2020.9054438]
[6]  
Erhan D, 2010, J MACH LEARN RES, V11, P625
[7]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040
[8]   TRAINING SPEECH RECOGNITION MODELS WITH FEDERATED LEARNING: A QUALITY/COST FRAMEWORK [J].
Guliani, Dhruv ;
Beaufays, Francoise ;
Motta, Giovanni .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :3080-3084
[9]  
Hard A., 2018, CoRR
[10]  
He YZ, 2019, INT CONF ACOUST SPEE, P6381, DOI [10.1109/ICASSP.2019.8682336, 10.1109/icassp.2019.8682336]