An end-to-end integration of speech separation and recognition with self-supervised learning representation

被引：0

作者：

Masuyama, Yoshiki ^{[1
,2
]}

Chang, Xuankai ^{[3
]}

Zhang, Wangyou ^{[4
]}

Cornell, Samuele ^{[3
]}

Wang, Zhong-Qiu ^{[5
]}

Ono, Nobutaka ^{[2
]}

Qian, Yanmin ^{[4
]}

Watanabe, Shinji ^{[3
]}

机构：

[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA

[2] Tokyo Metropolitan Univ, Dept Comp Sci, Tokyo, Japan

[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA USA

[4] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China

[5] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen, Peoples R China

来源：

COMPUTER SPEECH AND LANGUAGE | 2026年 / 95卷

关键词：

Speech separation; Automatic speech recognition; Self-supervised learning; Joint training; Multi-task learning; ENHANCEMENT; DEREVERBERATION;

D O I：

10.1016/j.csl.2025.101813

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-speaker automatic speech recognition (ASR) has gained growing attention in a wide range of applications, including conversation analysis and human-computer interaction. Speech separation and enhancement (SSE) and single-speaker ASR have witnessed remarkable performance improvements with the rapid advances in deep learning. Complex spectral mapping predicts the short-time Fourier transform (STFT) coefficients of each speaker and has achieved promising results in several SSE benchmarks. Meanwhile, self-supervised learning representation (SSLR) has demonstrated its significant advantage in single-speaker ASR. In this work, we push forward the performance of multi-speaker ASR under noisy reverberant conditions by integrating powerful SSE, SSL, and ASR models in an end-to-end manner. We systematically investigate both monaural and multi-channel SSE methods and various feature representations. Our experiments demonstrate the advantages of recently proposed complex spectral mapping and SSLRs in multi-speaker ASR. The experimental results also confirm that end-to-end finetuning with an ASR criterion is important to achieve state-of-the-art word error rates (WERs) even with powerful pre-trained models. Moreover, we show the performance trade-off between SSE and ASE and mitigate it with a multi-task learning framework with both SSE and ASR criteria.

引用

页数：18

共 114 条

[21]

Fazel-Zarandi M., 2023, P ICASSP, P1

[22]

Fiscus JG, 2008, LECT NOTES COMPUT SC, V4625, P373

[23] Signal enhancement using beamforming and nonstationarity with applications to speech [J].

Gannot, S ;

Burshtein, D ;

Weinstein, E .

IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2001, 49 (08) :1614-1626

[24] A Consolidated Perspective on Multimicrophone Speech Enhancement and Source Separation [J].

Gannot, Sharon ;

Vincent, Emmanuel ;

Markovich-Golan, Shmulik ;

Ozerov, Alexey .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (04) :692-730

[25]

Graves A., 2012, Sequence transduction with recurrent neural networks

[26]

Graves Alex, 2006, P 23 INT C MACH LEAR, P369, DOI [10.1145/1143844.1143891, DOI 10.1145/1143844.1143891]

[27] Conformer: Convolution-augmented Transformer for Speech Recognition [J].

Gulati, Anmol ;

Qin, James ;

Chiu, Chung-Cheng ;

Parmar, Niki ;

Zhang, Yu ;

Yu, Jiahui ;

Han, Wei ;

Wang, Shibo ;

Zhang, Zhengdong ;

Wu, Yonghui ;

Pang, Ruoming .

INTERSPEECH 2020, 2020, :5036-5040

[28] Multi-channel multi-speaker transformer for speech recognition [J].

Guo Yifan ;

Tian Yao ;

Suo Hongbin ;

Wan Yulong .

INTERSPEECH 2023, 2023, :4918-4922

[29]

Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631

[30]

Heymann J, 2017, INT CONF ACOUST SPEE, P5325, DOI 10.1109/ICASSP.2017.7953173

← 1 2 3 4 5 6 7 8 9 10 →