SYNTHETIC DATA FOR DNN-BASED DOA ESTIMATION OF INDOOR SPEECH

被引:11
作者
Gelderblom, Femke B. [1 ,2 ]
Liu, Yi [2 ]
Kvam, Johannes [2 ]
Myrvoll, Tor Andre [1 ,2 ]
机构
[1] NTNU, Trondheim, Norway
[2] SINTEF, Trondheim, Norway
来源
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) | 2021年
关键词
synthetic data; speech source localization; direction of arrival estimation; room impulse response; deep neural network; generalized cross correlation features; TIME;
D O I
10.1109/ICASSP39728.2021.9414415
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper investigates the use of different room impulse response (RIR) simulation methods for synthesizing training data for deep neural network-based direction of arrival (DOA) estimation of speech in reverberant rooms. Different sets of synthetic RIRs are obtained using the image source method (ISM) and more advanced methods including diffuse reflections and/or source directivity. Multi-layer perceptron (MLP) deep neural network (DNN) models are trained on generalized cross correlation (GCC) features extracted for each set. Finally, models are tested on features obtained from measured RIRs. This study shows the importance of training with RIRs from directive sources, as resultant DOA models achieved up to 51% error reduction compared to the steered response power with phase transform (SRP-PHAT) baseline (significant with p << .01), while models trained with RIRs from omnidirectional sources did worse than the baseline. The performance difference was specifically present when estimating the azimuth of speakers not facing the array directly.
引用
收藏
页码:4390 / 4394
页数:5
相关论文
共 19 条
[1]   IMAGE METHOD FOR EFFICIENTLY SIMULATING SMALL-ROOM ACOUSTICS [J].
ALLEN, JB ;
BERKLEY, DA .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (04) :943-950
[2]  
[Anonymous], 2000, THESIS BROWN U PROVI
[3]   Time-delay estimation via linear interpolation and cross correlation [J].
Benesty, J ;
Chen, JD ;
Huang, YT .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2004, 12 (05) :509-519
[4]  
Bergstra J., 2011, ADV NEURAL INFORM PR, V24
[5]   ROBUST TESTS FOR EQUALITY OF VARIANCES [J].
BROWN, MB ;
FORSYTHE, AB .
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1974, 69 (346) :364-367
[6]  
Chakrabarty S, 2017, IEEE WORK APPL SIG, P136, DOI 10.1109/WASPAA.2017.8170010
[7]  
Diaz-Guerra D, 2018, PR IEEE SEN ARRAY, P617, DOI 10.1109/SAM.2018.8448492
[8]   Reverberation time in "dead" rooms [J].
Eyring, CF .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1930, 1 (02) :217-241
[9]   Real-time passive source localization: A practical linear-correction least-squares approach [J].
Huang, YT ;
Benesty, J ;
Elko, GW ;
Mersereau, RM .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2001, 9 (08) :943-956
[10]   Generation of large-scale simulated utterances in virtual rooms to train deep-neural networks for far-field speech recognition in Google Home [J].
Kim, Chanwoo ;
Misra, Ananya ;
Chin, Kean ;
Hughes, Thad ;
Narayanan, Arun ;
Sainath, Tara ;
Bacchiani, Michiel .
18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, :379-383