Overlap Aware Continuous Speech Separation without Permutation Invariant Training

被引:0
作者
Yu, Linfeng [1 ]
Zhang, Wangyou [1 ]
Li, Chenda [1 ]
Qian, Yanmin [1 ]
机构
[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, X LANCE Lab, MoE Key Lab Artificial Intelligence,AI Inst, Shanghai, Peoples R China
来源
INTERSPEECH 2023 | 2023年
关键词
Continuous speech separation; overlapping speech detection; permutation-free training;
D O I
10.21437/Interspeech.2023-1530
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Continuous speech separation (CSS) aims to separate a long-form signal with multiple partially overlapped utterances into a set of non-overlapped speech signals. While most existing CSS methods rely on the permutation invariant training (PIT) algorithm for training and inference, we argue that one may not need PIT at all to achieve promising CSS performance. In this paper, we propose a novel overlap aware CSS method, which explicitly identifies the non-overlapped segments in the longform input to guide the separation of overlapped segments. We show that with the help of an external overlapping speech detection (OSD) model, an overlap-aware CSS model can be trained without PIT. In addition, an overlap-aware inference algorithm is proposed to greatly reduce the computational cost while preserving strong performance. Experiment results show that our proposed methods outperform the conventional stitching-based CSS approach, with over 1 dB signal-to-noise ratio (SNR) improvement.
引用
收藏
页码:3512 / 3516
页数:5
相关论文
共 35 条
[1]  
Çetin Ö, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P293
[2]  
Chen M., 2021, P ICLR
[3]   CONTINUOUS SPEECH SEPARATION WITH CONFORMER [J].
Chen, Sanyuan ;
Wu, Yu ;
Chen, Zhuo ;
Wu, Jian ;
Li, Jinyu ;
Yoshioka, Takuya ;
Wang, Chengyi ;
Liu, Shujie ;
Zhou, Ming .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :5749-5753
[4]  
Chen Z, 2020, INT CONF ACOUST SPEE, P7284, DOI [10.1109/icassp40776.2020.9053426, 10.1109/ICASSP40776.2020.9053426]
[6]   MULTI-STAGE SPEAKER EXTRACTION WITH UTTERANCE AND FRAME-LEVEL REFERENCE SIGNALS [J].
Ge, Meng ;
Xu, Chenglin ;
Wang, Longbiao ;
Chng, Eng Siong ;
Dang, Jianwu ;
Li, Haizhou .
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, :6109-6113
[7]   Continuous speech separation using speaker inventory for long recording [J].
Han, Cong ;
Luo, Yi ;
Li, Chenda ;
Zhou, Tianyan ;
Kinoshita, Keisuke ;
Watanabe, Shinji ;
Delcroix, Marc ;
Erdogan, Hakan ;
Hershey, John R. ;
Mesgarani, Nima ;
Chen, Zhuo .
INTERSPEECH 2021, 2021, :3036-3040
[8]  
Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
[9]  
Kingma D.P., 2015, INT C LEARN REPRESEN, P1, DOI [10.48550/ARXIV.1412.6980, DOI 10.48550/ARXIV.1412.6980]
[10]   Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks [J].
Kolbaek, Morten ;
Yu, Dong ;
Tan, Zheng-Hua ;
Jensen, Jesper .
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2017, 25 (10) :1901-1913