Localizing speakers in multiple rooms by using Deep Neural Networks

被引:28
作者
Vesperini, Fabio [1 ]
Vecchitti, Paolo [1 ]
Principi, Emanuele [1 ]
Squartini, Stefano [1 ]
Piazza, Francesco [1 ]
机构
[1] Univ Politecn Marche, Dept Informat Engn, Via Brecce Bianche, I-60131 Ancona, Italy
关键词
Acoustic source localization; Speaker localization; GCC-PHAT; Deep Neural Networks; Convolutional Neural Networks; Computational Audio Processing; SOURCE LOCALIZATION; COMMAND RECOGNITION; TIME-DELAY;
D O I
10.1016/j.csl.2017.12.002
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In the field of human speech capturing systems, a fundamental role is played by the source localization algorithms. In this paper a Speaker Localization algorithm (SLOC) based on Deep Neural Networks (DNN) is evaluated and compared with state of-the art approaches. The speaker position in the room under analysis is directly determined by the DNN, leading the proposed algorithm to be fully data-driven. Two different neural network architectures are investigated: the Multi Layer Perceptron (MLP) and Convolutional Neural Networks (CNN). GCC-PHAT (Generalized Cross Correlation-PHAse Transform) Patterns, computed from the audio signals captured by the microphone are used as input features for the DNN. In particular, a multi-room case study is dealt with, where the acoustic scene of each room is influenced by sounds emitted in the other rooms. The algorithm is tested by means of the home recorded DIRHA dataset, characterized by multiple wall and ceiling microphone signals for each room. In detail, the focus goes to speaker localization task in two distinct neighboring rooms. As term of comparison, two algorithms proposed in literature for the addressed applicative context are evaluated, the Cross power Spectrum Phase Speaker Localization (CSP-SLOC) and the Steered Response Power using the Phase Transform speaker localization (SRP-SLOC). Besides providing an extensive analysis of the proposed method, the article shows how DNN-based algorithm significantly outperforms the state-of-the-art approaches evaluated on the DIRHA dataset, providing an average localization error, expressed in terms of Root Mean Square Error (RMSE), equal to 324 mm and 367 mm, respectively, for the Simulated and the Real subsets. (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:83 / 106
页数:24
相关论文
共 59 条
[1]  
Al-Rfou R., 2016, Theano: A Python framework for fast computation of mathematical expressions, V472, P473
[2]  
[Anonymous], 2009, Distant Speech Recognition
[3]   Verified speaker localization utilizing voicing level in split-bands [J].
Asaei, Afsaneh ;
Taghizadeh, Mohammad Javad ;
Bahrololum, Marjan ;
Ghanbari, Mohammed .
SIGNAL PROCESSING, 2009, 89 (06) :1038-1049
[4]   Intelligent Environments: a manifesto [J].
Augusto, Juan C. ;
Callaghan, Vic ;
Cook, Diane ;
Kameas, Achilles ;
Satoh, Ichiro .
HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2013, 3 (03) :1-18
[5]  
Brutti A., 2014, P EV
[6]  
Brutti A, 2007, INT CONF ACOUST SPEE, P493
[7]   Performance of time-delay estimation in the presence of room reverberation [J].
Champagne, B ;
Bedard, S ;
Stephenne, A .
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1996, 4 (02) :148-152
[8]  
Chollet F., 2015, about us
[9]  
Cobos M., 2017, WIREL COMMUN MOB COM, V24
[10]  
Cristoforetti L, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P2629