ANALYSIS OF ROBUSTNESS OF DEEP SINGLE-CHANNEL SPEECH SEPARATION USING CORPORA CONSTRUCTED FROM MULTIPLE DOMAINS

被引:0
作者
Maciejewski, Matthew [1 ]
Sell, Gregory [2 ]
Fujita, Yusuke [1 ,3 ]
Garcia-Perera, Leibny Paola [1 ]
Watanabe, Shinji [1 ]
Khudanpur, Sanjeev [1 ,2 ]
机构
[1] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Human Language Technol Ctr Excellence, Baltimore, MD 21218 USA
[3] Hitachi Ltd, Res & Dev Grp, Tokyo, Japan
来源
2019 IEEE WORKSHOP ON APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS (WASPAA) | 2019年
关键词
single-channel speech separation; deep learning; far-field speech;
D O I
10.1109/waspaa.2019.8937153
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Deep-learning based single-channel speech separation has been studied with great success, though evaluations have typically been limited to relatively controlled environments based on clean, near-field, and read speech. This work investigates the robustness of such representative techniques in more realistic environments with multiple and diverse conditions. To this end, we first construct datasets from the Mixer 6 and CHiME-5 corpora, featuring studio interviews and dinner parties respectively, using a procedure carefully designed to generate desirable synthetic overlap data sufficient for evaluation as well as for training deep learning models. Using these new datasets, we demonstrate the substantial shortcomings in mismatched conditions of these separation techniques. Though multi-condition training greatly mitigated the performance degradation in near-field conditions, one of the important findings is that both matched and multi-condition training have significant gaps from the oracle performance in far-field conditions, which advocates a need for extending existing separation techniques to deal with farfield/highly-reverberant speech mixtures.
引用
收藏
页码:165 / 169
页数:5
相关论文
共 32 条
[1]  
[Anonymous], 2006, Computational auditory scene analysis: Principles, algorithms, and applications
[2]  
Barker J., 2018, P INT
[3]  
Bengio S., 2005, MACHINE LEARNING MUL
[4]  
Brandschain L., MIXER 6 SPEECH LDC20
[5]  
Chen Zhuo, 2017, Proc IEEE Int Conf Acoust Speech Signal Process, V2017, P246, DOI 10.1109/ICASSP.2017.7952155
[6]  
Chung J. S., 2018, P INT
[7]  
Garofolo J., 1993, CSR-I (WSJ0) Complete
[8]   Acoustic modelling from the signal domain using CNNs [J].
Ghahremanil, Pegah ;
Manoharl, Vimal ;
Povey, Daniel ;
Khudanpur, Sanjeev .
17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, :3434-3438
[9]  
Hershey JR, 2016, INT CONF ACOUST SPEE, P31, DOI 10.1109/ICASSP.2016.7471631
[10]  
Ioffe S, 2006, LECT NOTES COMPUT SC, V3954, P531