Robust Automatic Speech Recognition via WavAugment Guided Phoneme Adversarial Training

被引:0
作者
Qi, Gege [1 ]
Chen, Yuefeng [1 ]
Mao, Xiaofeng [1 ]
Jia, Xiaojun [2 ]
Duan, Ranjie [1 ]
Zhang, Rong [1 ]
Xue, Hui [1 ]
机构
[1] Alibaba Grp, Hangzhou, Peoples R China
[2] Chinese Acad Sci, Beijing, Peoples R China
来源
INTERSPEECH 2023 | 2023年
关键词
robust automatic speech recognition; data augmentation; adversarial training;
D O I
10.21437/Interspeech.2023-1556
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Developing a practically-robust automatic speech recognition (ASR) is challenging since the model should not only maintain the original performance on clean samples, but also achieve consistent efficacy under small volume perturbations and large domain shifts. To address this problem, we propose a novel WavAugment Guided Phoneme Adversarial Training (WAPAT). WAPAT use adversarial examples in phoneme space as augmentation to make the model invariant to minor fluctuations in phoneme representation and preserve the performance on clean samples. In addition, WAPAT utilizes the phoneme representation of augmented samples to guide the generation of adversaries, which helps to find more stable and diverse gradient-directions, resulting in improved generalization. Extensive experiments demonstrate the effectiveness of WAPAT on End-to-end Speech Challenge Benchmark (ESB). Notably, SpeechLM-WAPAT outperforms the original model by 6.28% WER reduction on ESB, achieving the new state-of-the-art.
引用
收藏
页码:561 / 565
页数:5
相关论文
共 36 条
[1]  
Ardila R, 2020, Arxiv, DOI arXiv:1912.06670
[2]  
Baevski A., 2020, Advances in Neural Information Processing Systems
[3]   Unleashing the killer corpus: experiences in creating the multi-everything AMI Meeting Corpus [J].
Carletta, Jean .
LANGUAGE RESOURCES AND EVALUATION, 2007, 41 (02) :181-190
[4]  
Chen G., 2021, arXiv
[5]  
Damania R., 2022, Combining simple but novel data augmentation methods for improving lowresource asr
[6]  
Defossez A., 2020, arXiv, DOI DOI 10.48550/ARXIV.2006.12847
[7]  
Del Rio M., 2022, arXiv
[8]   gpuRIR: A python']python library for room impulse response simulation with GPU acceleration [J].
Diaz-Guerra, David ;
Miguel, Antonio ;
Beltran, Jose R. .
MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (04) :5653-5671
[9]  
Fan RC, 2022, Arxiv, DOI [arXiv:2206.07931, 10.48550/arXiv.2206.07931]
[10]   Robust speech recognition in noisy environments based on subband spectral centroid histograms [J].
Gajic, B ;
Paliwal, KK .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (02) :600-608