Speaking style compensation on synthetic audio for robust keyword spotting

被引:1
作者
Huang, Houjun [1 ,2 ]
Qian, Yanmin [2 ]
机构
[1] AISpeech Ltd, Suzhou, Peoples R China
[2] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, MoE Key Lab Artificial Intelligence, AI Inst,XLANCE Lab, Shanghai, Peoples R China
来源
2022 13TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING (ISCSLP) | 2022年
关键词
Keyword Spotting; Text-To-Speech; Data Augmentation; DCCRN; Speaking Style Compensation;
D O I
10.1109/ISCSLP57327.2022.10038031
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rise of intelligent speech processing applications, quickly producing keyword spotting (KWS) models with low resource has gained particular importance in recent years. Multi-speaker text-to-speech (TTS) has been proved to be an effective data augmentation technique for KWS to help complement inadequacies in the training data. However, in previous works, KWS system built with TTS augmented data couldn't obtain considerable performance of that trained with real recordings as synthetic speeches could not fully represent target speaker's speaking style. This work focuses on how to make synthesized speeches to be more similar to reference speaker's speaking style under a specific metric. Speaker classification accuracy of synthesized keyword data on a speaker recognition model trained with real common recordings is used as the objective metrics, and the deep complex convolution recurrent network (DCCRN) is used to optimize it. Experimental results show that TTS augmentation helps improve the KWS system's robustness. Moreover, by compensating speaking style of the synthetic data, we achieve a significant further improvement.
引用
收藏
页码:448 / 452
页数:5
相关论文
共 25 条
[1]  
[Anonymous], 2011, IEEE 2011 WORKSHOP
[2]  
Bu H, 2017, 2017 20TH CONFERENCE OF THE ORIENTAL CHAPTER OF THE INTERNATIONAL COORDINATING COMMITTEE ON SPEECH DATABASES AND SPEECH I/O SYSTEMS AND ASSESSMENT (O-COCOSDA), P58, DOI 10.1109/ICSDA.2017.8384449
[3]  
Choi S, 2019, Arxiv, DOI arXiv:1904.03814
[4]  
de Andrade DC, 2018, Arxiv, DOI arXiv:1808.08929
[5]  
Desplanques B, 2020, Arxiv, DOI arXiv:2005.07143
[6]  
Du JY, 2018, Arxiv, DOI arXiv:1808.10583
[7]   DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement [J].
Hu, Yanxin ;
Liu, Yun ;
Lv, Shubo ;
Xing, Mengtao ;
Zhang, Shimin ;
Fu, Yihui ;
Wu, Jian ;
Zhang, Bihong ;
Xie, Lei .
INTERSPEECH 2020, 2020, :2472-2476
[8]   AISPEECH-SJTU ACCENT IDENTIFICATION SYSTEM FOR THE ACCENTED ENGLISH SPEECH RECOGNITION CHALLENGE [J].
Huang, Houjun ;
Xiang, Xu ;
Yang, Yexin ;
Ma, Rao ;
Qian, Yanmin .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6254-6258
[9]   UNIT SELECTION SYNTHESIS BASED DATA AUGMENTATION FOR FIXED PHRASE SPEAKER VERIFICATION [J].
Huang, Houjun ;
Xiang, Xu ;
Zhao, Fei ;
Wang, Shuai ;
Qian, Yanmin .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5849-5853
[10]   SYNTH2AUG: CROSS-DOMAIN SPEAKER RECOGNITION WITH TTS SYNTHESIZED SPEECH [J].
Huang, Yiling ;
Chen, Yutian ;
Pelecanos, Jason ;
Wang, Quan .
2021 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP (SLT), 2021, :316-322