Dual input neural networks for positional sound source localization

被引:3
作者
Grinstein, Eric [1 ]
Neo, Vincent W. [1 ]
Naylor, Patrick A. [1 ]
机构
[1] Imperial Coll London, Dept Elect & Elect Engn, London, England
基金
英国工程与自然科学研究理事会;
关键词
Sound source localization; Multichannel audio processing; Multimodal machine learning; Convolutional recurrent neural networks; TIME;
D O I
10.1186/s13636-023-00301-x
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
In many signal processing applications, metadata may be advantageously used in conjunction with a high dimensional signal to produce a desired output. In the case of classical Sound Source Localization (SSL) algorithms, information from a high dimensional, multichannel audio signals received by many distributed microphones is combined with information describing acoustic properties of the scene, such as the microphones' coordinates in space, to estimate the position of a sound source. We introduce Dual Input Neural Networks (DI-NNs) as a simple and effective way to model these two data types in a neural network. We train and evaluate our proposed DI-NN on scenarios of varying difficulty and realism and compare it against an alternative architecture, a classical Least-Squares (LS) method as well as a classical Convolutional Recurrent Neural Network (CRNN). Our results show that the DI-NN significantly outperforms the baselines, achieving a five times lower localization error than the LS method and two times lower than the CRNN in a test dataset of real recordings.
引用
收藏
页数:12
相关论文
共 55 条
[1]  
Adavanne S, 2018, EUR SIGNAL PR CONF, P1462, DOI 10.23919/EUSIPCO.2018.8553182
[2]   SHORT-TERM SPECTRAL ANALYSIS, SYNTHESIS, AND MODIFICATION BY DISCRETE FOURIER-TRANSFORM [J].
ALLEN, JB .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1977, 25 (03) :235-238
[3]   IMAGE METHOD FOR EFFICIENTLY SIMULATING SMALL-ROOM ACOUSTICS [J].
ALLEN, JB ;
BERKLEY, DA .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1979, 65 (04) :943-950
[4]  
[Anonymous], 2011, IEEE S COMM VEH TECH
[5]  
[Anonymous], 2022, Under Review. Deep Complex-Valued ConvolutionalRecurrent Networks for Single Source DOA Estimation
[6]   Multimodal fusion for multimedia analysis: a survey [J].
Atrey, Pradeep K. ;
Hossain, M. Anwar ;
El Saddik, Abdulmotaleb ;
Kankanhalli, Mohan S. .
MULTIMEDIA SYSTEMS, 2010, 16 (06) :345-379
[7]   Accuracy map of an optical motion capture system with 42 or 21 cameras in a large measurement volume [J].
Aurand, Alexander M. ;
Dufour, Jonathan S. ;
Marras, William S. .
JOURNAL OF BIOMECHANICS, 2017, 58 :237-240
[8]   Parameterized neural networks for high-energy physics [J].
Baldi, Pierre ;
Cranmer, Kyle ;
Faucett, Taylor ;
Sadowski, Peter ;
Whiteson, Daniel .
EUROPEAN PHYSICAL JOURNAL C, 2016, 76 (05)
[9]   Multimodal Machine Learning: A Survey and Taxonomy [J].
Baltrusaitis, Tadas ;
Ahuja, Chaitanya ;
Morency, Louis-Philippe .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (02) :423-443
[10]   Reverberation time and maximum background-noise level for classrooms from a comparative study of speech intelligibility metrics [J].
Bistafa, SR ;
Bradley, JS .
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2000, 107 (02) :861-875