FRAME-BASED OVERLAPPING SPEECH DETECTION USING CONVOLUTIONAL NEURAL NETWORKS

被引:0
作者
Yousefi, Midia [1 ]
Hansen, John H. L. [1 ]
机构
[1] Univ Texas Dallas, Erik Jonsson Sch Engn, Ctr Robust Speech Syst CRSS, Richardson, TX 75083 USA
来源
2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING | 2020年
关键词
overlapping speech detection; co-channel speech detection; mixed speech; source counting; convolutional neural networks; speech separation; DIARIZATION;
D O I
10.1109/icassp40776.2020.9053108
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Naturalistic speech recordings usually contain speech signals from multiple speakers. This phenomenon can degrade the performance of speech technologies due to the complexity of tracing and recognizing individual speakers. In this study, we investigate the detection of overlapping speech on segments as short as 25 ms using Convolutional Neural Networks. We evaluate the detection performance using different spectral features, and show that pyknogram features outperforms other commonly used speech features. The proposed system can predict overlapping speech with an accuracy of 84% and Fscore of 88% on a dataset of mixed speech generated based on the GRID dataset.
引用
收藏
页码:6744 / 6748
页数:5
相关论文
共 21 条
[1]  
Adda-Decker Martine, 2008, ANNOTATION ANAL OVER
[2]  
Amodei D, 2016, PR MACH LEARN RES, V48
[3]   Detecting overlapped speech on short timeframes using deep learning [J].
Andrei, Valentin ;
Cucu, Horia ;
Burileanu, Corneliu .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :1198-1202
[4]  
[Anonymous], 2019, ARXIV190801768
[5]  
[Anonymous], 2019, 2019 IEEE AUT SPEECH
[6]   Overlapped speech detection for improved speaker diarization in multiparty meetings [J].
Boakye, Kofi ;
Trueba-Hornero, Beatriz ;
Vinyals, Oriol ;
Friedland, Gerald .
2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, :4353-4356
[7]  
Carletta J, 2005, LECT NOTES COMPUT SC, V3869, P28
[8]   An audio-visual corpus for speech perception and automatic speech recognition (L) [J].
Cooke, Martin ;
Barker, Jon ;
Cunningham, Stuart ;
Shao, Xu .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 120 (05) :2421-2424
[9]  
Geiger JT, 2013, INTERSPEECH, P1667
[10]  
Ghorbani S, 2018, IEEE W SP LANG TECH, P29, DOI 10.1109/SLT.2018.8639566