An end-to-end integration of speech separation and recognition with self-supervised learning representation

被引：0

作者：

Masuyama, Yoshiki ^{[1
,2
]}

Chang, Xuankai ^{[3
]}

Zhang, Wangyou ^{[4
]}

Cornell, Samuele ^{[3
]}

Wang, Zhong-Qiu ^{[5
]}

Ono, Nobutaka ^{[2
]}

Qian, Yanmin ^{[4
]}

Watanabe, Shinji ^{[3
]}

机构：

[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA

[2] Tokyo Metropolitan Univ, Dept Comp Sci, Tokyo, Japan

[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA USA

[4] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China

[5] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen, Peoples R China

来源：

COMPUTER SPEECH AND LANGUAGE | 2026年 / 95卷

关键词：

Speech separation; Automatic speech recognition; Self-supervised learning; Joint training; Multi-task learning; ENHANCEMENT; DEREVERBERATION;

D O I：

10.1016/j.csl.2025.101813

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Multi-speaker automatic speech recognition (ASR) has gained growing attention in a wide range of applications, including conversation analysis and human-computer interaction. Speech separation and enhancement (SSE) and single-speaker ASR have witnessed remarkable performance improvements with the rapid advances in deep learning. Complex spectral mapping predicts the short-time Fourier transform (STFT) coefficients of each speaker and has achieved promising results in several SSE benchmarks. Meanwhile, self-supervised learning representation (SSLR) has demonstrated its significant advantage in single-speaker ASR. In this work, we push forward the performance of multi-speaker ASR under noisy reverberant conditions by integrating powerful SSE, SSL, and ASR models in an end-to-end manner. We systematically investigate both monaural and multi-channel SSE methods and various feature representations. Our experiments demonstrate the advantages of recently proposed complex spectral mapping and SSLRs in multi-speaker ASR. The experimental results also confirm that end-to-end finetuning with an ASR criterion is important to achieve state-of-the-art word error rates (WERs) even with powerful pre-trained models. Moreover, we show the performance trade-off between SSE and ASE and mitigate it with a multi-task learning framework with both SSE and ASR criteria.

引用

页数：18

共 114 条

[1]

Baevski A, 2020, ADV NEUR IN, V33

[2] The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines [J].

Barker, Jon ;

Watanabe, Shinji ;

Vincent, Emmanuel ;

Trmal, Jan .

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1561-1565

[3]

Carletta J, 2005, LECT NOTES COMPUT SC, V3869, P28

[4]

Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621

[5] End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation [J].

Chang, Xuankai ;

Maekaku, Takashi ;

Fujita, Yuya ;

Watanabe, Shinji .

INTERSPEECH 2022, 2022, :3819-3823

[6] AN EXPLORATION OF SELF-SUPERVISED PRETRAINED REPRESENTATIONS FOR END-TO-END SPEECH RECOGNITION [J].

Chang, Xuankai ;

Maekaku, Takashi ;

Guo, Pengcheng ;

Shi, Jing ;

Lu, Yen-Ju ;

Subramanian, Aswin Shanmugam ;

Wang, Tianzi ;

Yang, Shu-wen ;

Tsao, Yu ;

Lee, Hung-yi ;

Watanabe, Shinji .

2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, :228-235

[7]

Chang XK, 2020, INT CONF ACOUST SPEE, P6134, DOI [10.1109/ICASSP40776.2020.9054029, 10.1109/icassp40776.2020.9054029]

[8]

Chang XK, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P237, DOI [10.1109/asru46091.2019.9003986, 10.1109/ASRU46091.2019.9003986]

[9] GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio [J].

Chen, Guoguo ;

Chai, Shuzhou ;

Wang, Guanbo ;

Du, Jiayu ;

Zhang, Wei-Qiang ;

Weng, Chao ;

Su, Dan ;

Povey, Daniel ;

Trmal, Jan ;

Zhang, Junbo ;

Jin, Mingjie ;

Khudanpur, Sanjeev ;

Watanabe, Shinji ;

Zhae, Shuaijiang ;

Zou, Wei ;

Li, Xiangang ;

Yao, Xuchen ;

Wang, Yongqing ;

You, Zhao ;

Yan, Zhiyong .

INTERSPEECH 2021, 2021, :3670-3674

[10] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [J].

Chen, Sanyuan ;

Wang, Chengyi ;

Chen, Zhengyang ;

Wu, Yu ;

Liu, Shujie ;

Chen, Zhuo ;

Li, Jinyu ;

Kanda, Naoyuki ;

Yoshioka, Takuya ;

Xiao, Xiong ;

Wu, Jian ;

Zhou, Long ;

Ren, Shuo ;

Qian, Yanmin ;

Qian, Yao ;

Zeng, Michael ;

Yu, Xiangzhan ;

Wei, Furu .

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) :1505-1518

← 1 2 3 4 5 6 7 8 9 10 →