Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

被引:9
作者
Ding, Shaojin [1 ]
Rikhye, Rajeev [1 ]
Liang, Qiao [1 ]
He, Yanzhang [1 ]
Wang, Quan [1 ]
Narayanan, Arun [1 ]
O'Malley, Tom [1 ]
McGraw, Ian [1 ]
机构
[1] Google LLC, Mountain View, CA 94043 USA
来源
INTERSPEECH 2022 | 2022年
关键词
personal VAD; voice activity detection; speech recognition; on-device;
D O I
10.21437/Interspeech.2022-856
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Personalization of on-device speech recognition (ASR) has seen explosive growth in recent years, largely due to the increasing popularity of personal assistant features on mobile devices and smart home speakers. In this work, we present Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system. Although previous proof-of-concept studies have validated the effectiveness of Personal VAD, there are still several critical challenges to address before this model can be used in production: first, the quality must be satisfactory in both enrollment and enrollment-less scenarios; second, it should operate in a streaming fashion; and finally, the model size should be small enough to fit a limited latency and CPU/Memory budget. To meet the multi-faceted requirements, we propose a series of novel designs: 1) advanced speaker embedding modulation methods; 2) a new training paradigm to generalize to enrollment-less conditions; 3) architecture and runtime optimizations for latency and resource restrictions. Extensive experiments on a realistic speech recognition system demonstrated the state-of-the-art performance of our proposed method.
引用
收藏
页码:3744 / 3748
页数:5
相关论文
共 27 条
[1]  
Bakhtin A., 2016, ARXIV160704683
[2]   Use of Mobile Computing Devices Among Nursing Students for Information Seeking in Simulation [J].
Carter-Templeton, Heather ;
March, Alice L. ;
Perez, Ernesto .
CIN-COMPUTERS INFORMATICS NURSING, 2018, 36 (01) :1-4
[4]  
Ding S., 2020, PROC ODYSSEY
[5]  
Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[6]  
Dumoulin Vincent, 2018, Distill, DOI DOI 10.23915/DISTILL.00011
[7]  
Google, ART INT GOOGL PRINC
[8]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040
[9]  
He Maokui, 2021, ARXIV210803342
[10]  
He YZ, 2019, INT CONF ACOUST SPEE, P6381, DOI [10.1109/ICASSP.2019.8682336, 10.1109/icassp.2019.8682336]