CONTINUOUS SPEECH SEPARATION WITH CONFORMER

被引:87
作者
Chen, Sanyuan [2 ]
Wu, Yu [1 ]
Chen, Zhuo [1 ]
Wu, Jian [1 ]
Li, Jinyu [1 ]
Yoshioka, Takuya [1 ]
Wang, Chengyi [1 ]
Liu, Shujie [1 ]
Zhou, Ming [1 ]
机构
[1] Microsoft Corp, Redmond, WA 98052 USA
[2] Harbin Inst Technol, Harbin, Peoples R China
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
Multi-speaker ASR; Transformer; Conformer; Continuous speech separation;
D O I
10.1109/ICASSP39728.2021.9413423
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Continuous speech separation was recently proposed to deal with the overlapped speech in natural conversations. While it was shown to significantly improve the speech recognition performance for multi-channel conversation transcription, its effectiveness has yet to be proven for a single-channel recording scenario. This paper examines the use of Conformer architecture in lieu of recurrent neural networks for the separation model. Conformer allows the separation model to efficiently capture both local and global context information, which is helpful for speech separation. Experimental results using the LibriCSS dataset show that the Conformer separation model achieves the state of the art results for both single-channel and multi-channel settings. Results for real meeting recordings are also presented, showing significant performance gains in both word error rate (WER) and speaker-attributed WER.
引用
收藏
页码:5749 / 5753
页数:5
相关论文
共 32 条
[1]   IMAGE METHOD FOR EFFICIENTLY SIMULATING SMALL-ROOM ACOUSTICS [J].
ALLEN, JB ;
BERKLEY, DA .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (04) :943-950
[2]  
Chang XK, 2020, INT CONF ACOUST SPEE, P6134, DOI [10.1109/ICASSP40776.2020.9054029, 10.1109/icassp40776.2020.9054029]
[3]   Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation [J].
Chen, Jingjing ;
Mao, Qirong ;
Liu, Dong .
INTERSPEECH 2020, 2020, :2642-2646
[4]  
Chen Z, 2020, INT CONF ACOUST SPEE, P7284, DOI [10.1109/ICASSP40776.2020.9053426, 10.1109/icassp40776.2020.9053426]
[5]  
Dai ZH, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P2978
[6]  
Dong LH, 2018, 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5884, DOI 10.1109/ICASSP.2018.8462506
[7]  
Erdogan Hakan, 2017, New Era Robust Speech Recognit Exploit Deep Learn, P165
[8]   Conformer: Convolution-augmented Transformer for Speech Recognition [J].
Gulati, Anmol ;
Qin, James ;
Chiu, Chung-Cheng ;
Parmar, Niki ;
Zhang, Yu ;
Yu, Jiahui ;
Han, Wei ;
Wang, Shibo ;
Zhang, Zhengdong ;
Wu, Yonghui ;
Pang, Ruoming .
INTERSPEECH 2020, 2020, :5036-5040
[9]   Generating sensor signals in isotropic noise fields [J].
Habets, Emanuel A. P. ;
Gannot, Sharon .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2007, 122 (06) :3464-3470
[10]  
He YZ, 2019, INT CONF ACOUST SPEE, P6381, DOI [10.1109/icassp.2019.8682336, 10.1109/ICASSP.2019.8682336]