An end-to-end integration of speech separation and recognition with self-supervised learning representation

被引:0
作者
Masuyama, Yoshiki [1 ,2 ]
Chang, Xuankai [3 ]
Zhang, Wangyou [4 ]
Cornell, Samuele [3 ]
Wang, Zhong-Qiu [5 ]
Ono, Nobutaka [2 ]
Qian, Yanmin [4 ]
Watanabe, Shinji [3 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
[2] Tokyo Metropolitan Univ, Dept Comp Sci, Tokyo, Japan
[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA USA
[4] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China
[5] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen, Peoples R China
关键词
Speech separation; Automatic speech recognition; Self-supervised learning; Joint training; Multi-task learning; ENHANCEMENT; DEREVERBERATION;
D O I
10.1016/j.csl.2025.101813
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-speaker automatic speech recognition (ASR) has gained growing attention in a wide range of applications, including conversation analysis and human-computer interaction. Speech separation and enhancement (SSE) and single-speaker ASR have witnessed remarkable performance improvements with the rapid advances in deep learning. Complex spectral mapping predicts the short-time Fourier transform (STFT) coefficients of each speaker and has achieved promising results in several SSE benchmarks. Meanwhile, self-supervised learning representation (SSLR) has demonstrated its significant advantage in single-speaker ASR. In this work, we push forward the performance of multi-speaker ASR under noisy reverberant conditions by integrating powerful SSE, SSL, and ASR models in an end-to-end manner. We systematically investigate both monaural and multi-channel SSE methods and various feature representations. Our experiments demonstrate the advantages of recently proposed complex spectral mapping and SSLRs in multi-speaker ASR. The experimental results also confirm that end-to-end finetuning with an ASR criterion is important to achieve state-of-the-art word error rates (WERs) even with powerful pre-trained models. Moreover, we show the performance trade-off between SSE and ASE and mitigate it with a multi-task learning framework with both SSE and ASR criteria.
引用
收藏
页数:18
相关论文
共 114 条
[21]  
Fazel-Zarandi M., 2023, P ICASSP, P1
[22]  
Fiscus JG, 2008, LECT NOTES COMPUT SC, V4625, P373
[23]   Signal enhancement using beamforming and nonstationarity with applications to speech [J].
Gannot, S ;
Burshtein, D ;
Weinstein, E .
IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2001, 49 (08) :1614-1626
[24]   A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation [J].
Gannot, Sharon ;
Vincent, Emmanuel ;
Markovich-Golan, Shmulik ;
Ozerov, Alexey .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (04) :692-730
[25]  
Graves A., 2012, Sequence transduction with recurrent neural networks
[26]  
Graves Alex, 2006, P 23 INT C MACH LEAR, P369, DOI [10.1145/1143844.1143891, DOI 10.1145/1143844.1143891]
[27]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040
[28]   Multi-channel multi-speaker transformer for speech recognition [J].
Guo Yifan ;
Tian Yao ;
Suo Hongbin ;
Wan Yulong .
INTERSPEECH 2023, 2023, :4918-4922
[29]  
Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
[30]  
Heymann J, 2017, INT CONF ACOUST SPEE, P5325, DOI 10.1109/ICASSP.2017.7953173