An end-to-end integration of speech separation and recognition with self-supervised learning representation

被引:0
作者
Masuyama, Yoshiki [1 ,2 ]
Chang, Xuankai [3 ]
Zhang, Wangyou [4 ]
Cornell, Samuele [3 ]
Wang, Zhong-Qiu [5 ]
Ono, Nobutaka [2 ]
Qian, Yanmin [4 ]
Watanabe, Shinji [3 ]
机构
[1] Mitsubishi Elect Res Labs MERL, Cambridge, MA 02139 USA
[2] Tokyo Metropolitan Univ, Dept Comp Sci, Tokyo, Japan
[3] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA USA
[4] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai, Peoples R China
[5] Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen, Peoples R China
关键词
Speech separation; Automatic speech recognition; Self-supervised learning; Joint training; Multi-task learning; ENHANCEMENT; DEREVERBERATION;
D O I
10.1016/j.csl.2025.101813
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Multi-speaker automatic speech recognition (ASR) has gained growing attention in a wide range of applications, including conversation analysis and human-computer interaction. Speech separation and enhancement (SSE) and single-speaker ASR have witnessed remarkable performance improvements with the rapid advances in deep learning. Complex spectral mapping predicts the short-time Fourier transform (STFT) coefficients of each speaker and has achieved promising results in several SSE benchmarks. Meanwhile, self-supervised learning representation (SSLR) has demonstrated its significant advantage in single-speaker ASR. In this work, we push forward the performance of multi-speaker ASR under noisy reverberant conditions by integrating powerful SSE, SSL, and ASR models in an end-to-end manner. We systematically investigate both monaural and multi-channel SSE methods and various feature representations. Our experiments demonstrate the advantages of recently proposed complex spectral mapping and SSLRs in multi-speaker ASR. The experimental results also confirm that end-to-end finetuning with an ASR criterion is important to achieve state-of-the-art word error rates (WERs) even with powerful pre-trained models. Moreover, we show the performance trade-off between SSE and ASE and mitigate it with a multi-task learning framework with both SSE and ASR criteria.
引用
收藏
页数:18
相关论文
共 114 条
[1]  
Baevski A, 2020, ADV NEUR IN, V33
[2]   The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines [J].
Barker, Jon ;
Watanabe, Shinji ;
Vincent, Emmanuel ;
Trmal, Jan .
19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES, 2018, :1561-1565
[3]  
Carletta J, 2005, LECT NOTES COMPUT SC, V3869, P28
[4]  
Chan W, 2016, INT CONF ACOUST SPEE, P4960, DOI 10.1109/ICASSP.2016.7472621
[5]   End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation [J].
Chang, Xuankai ;
Maekaku, Takashi ;
Fujita, Yuya ;
Watanabe, Shinji .
INTERSPEECH 2022, 2022, :3819-3823
[6]   AN EXPLORATION OF SELF-SUPERVISED PRETRAINED REPRESENTATIONS FOR END-TO-END SPEECH RECOGNITION [J].
Chang, Xuankai ;
Maekaku, Takashi ;
Guo, Pengcheng ;
Shi, Jing ;
Lu, Yen-Ju ;
Subramanian, Aswin Shanmugam ;
Wang, Tianzi ;
Yang, Shu-wen ;
Tsao, Yu ;
Lee, Hung-yi ;
Watanabe, Shinji .
2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), 2021, :228-235
[7]  
Chang XK, 2020, INT CONF ACOUST SPEE, P6134, DOI [10.1109/ICASSP40776.2020.9054029, 10.1109/icassp40776.2020.9054029]
[8]  
Chang XK, 2019, 2019 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU 2019), P237, DOI [10.1109/asru46091.2019.9003986, 10.1109/ASRU46091.2019.9003986]
[9]   GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio [J].
Chen, Guoguo ;
Chai, Shuzhou ;
Wang, Guanbo ;
Du, Jiayu ;
Zhang, Wei-Qiang ;
Weng, Chao ;
Su, Dan ;
Povey, Daniel ;
Trmal, Jan ;
Zhang, Junbo ;
Jin, Mingjie ;
Khudanpur, Sanjeev ;
Watanabe, Shinji ;
Zhae, Shuaijiang ;
Zou, Wei ;
Li, Xiangang ;
Yao, Xuchen ;
Wang, Yongqing ;
You, Zhao ;
Yan, Zhiyong .
INTERSPEECH 2021, 2021, :3670-3674
[10]   WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [J].
Chen, Sanyuan ;
Wang, Chengyi ;
Chen, Zhengyang ;
Wu, Yu ;
Liu, Shujie ;
Chen, Zhuo ;
Li, Jinyu ;
Kanda, Naoyuki ;
Yoshioka, Takuya ;
Xiao, Xiong ;
Wu, Jian ;
Zhou, Long ;
Ren, Shuo ;
Qian, Yanmin ;
Qian, Yao ;
Zeng, Michael ;
Yu, Xiangzhan ;
Wei, Furu .
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2022, 16 (06) :1505-1518