Time-Domain Speech Enhancement for Robust Automatic Speech Recognition

被引:0
作者
Yang, Yufeng [1 ]
Pandey, Ashutosh [1 ]
Wang, DeLiang [1 ,2 ]
机构
[1] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA
[2] Ohio State Univ, Ctr Cognit & Brain Sci, Columbus, OH USA
来源
INTERSPEECH 2023 | 2023年
关键词
CHiME-2; robust ASR; speech distortion; time-domain speech enhancement; SEPARATION; NETWORKS; NOISE; END;
D O I
10.21437/Interspeech.2023-167
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement algorithms. However, speech enhancement has not been established as an effective frontend for robust automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between speech enhancement and ASR impedes the progress of robust ASR systems especially as speech enhancement has made big strides in recent years. In this work, we focus on eliminating this divide with an ARN (attentive recurrent network) based time-domain enhancement model. The proposed system fully decouples speech enhancement and an acoustic model trained only on clean speech. Results on the CHiME-2 corpus show that ARN enhanced speech translates to improved ASR results. The proposed system achieves 6.28% average word error rate, outperforming the previous best by 19.3% relatively.
引用
收藏
页码:4913 / 4917
页数:5
相关论文
共 31 条
[1]   The PASCAL CHiME speech separation and recognition challenge [J].
Barker, Jon ;
Vincent, Emmanuel ;
Ma, Ning ;
Christensen, Heidi ;
Green, Phil .
COMPUTER SPEECH AND LANGUAGE, 2013, 27 (03) :621-633
[2]  
Chan William, 2021, SPEECHSTEW SIMPLY MI
[3]   End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation [J].
Chang, Xuankai ;
Maekaku, Takashi ;
Fujita, Yuya ;
Watanabe, Shinji .
INTERSPEECH 2022, 2022, :3819-3823
[4]  
Christensen H., 2010, P INTERSPEECH
[5]   Robust automatic speech recognition with missing and unreliable acoustic data [J].
Cooke, M ;
Green, P ;
Josifovski, L ;
Vizinho, A .
SPEECH COMMUNICATION, 2001, 34 (03) :267-285
[6]  
Fu SW, 2017, ASIAPAC SIGN INFO PR, P6, DOI 10.1109/APSIPA.2017.8281993
[7]   AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario [J].
Fu, Yihui ;
Cheng, Luyao ;
Lv, Shubo ;
Jv, Yukai ;
Kong, Yuxiang ;
Chen, Zhuo ;
Hu, Yanxin ;
Xie, Lei ;
Wu, Jian ;
Bu, Hui ;
Xu, Xin ;
Du, Jun ;
Chen, Jingdong .
INTERSPEECH 2021, 2021, :3665-3669
[8]  
Heymann J, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P6722, DOI 10.1109/ICASSP.2018.8462372
[9]  
King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001
[10]   Adversarial Feature-Mapping for Speech Enhancement [J].
Meng, Zhong ;
Li, Jinyu ;
Gong, Yifan ;
Juang, Biing-Hwang .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :3259-3263