Deep Audio-Visual Beamforming for Speaker Localization

被引：7

作者：

Qian, Xinyuan ^{[1
]}

Zhang, Qiquan ^{[1
]}

Guan, Guohui ^{[1
]}

Xue, Wei ^{[2
]}

机构：

[1] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 117583, Singapore

[2] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Peoples R China

来源：

IEEE SIGNAL PROCESSING LETTERS | 2022年 / 29卷

关键词：

Microphones; Location awareness; Correlation; Array signal processing; Visualization; Feature extraction; Delay effects; Audio-visual fusion; speaker localization; varia- tional auto-encoder;

D O I：

10.1109/LSP.2022.3165466

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

Generalized Cross Correlation (GCC) is the most popular localization technique over the past decades and can be extended with the beamforming method e.g. Steered Response Power (SRP) when multiple microphone pairs exist. Considering the promising results of Deep Learning (DL) strategies over classical approaches, in this work, instead of directly using Generalized Cross Correlation (GCC), SRP is derived with the DL-learnt ideal correlation functions for each pair of a microphone array. To deploy visual information, we explore the Conditional Variational Auto-Encoder (CVAE) framework in which the audio generative process is conditioned on the visual features encoded by face detections. The vision-derived auxiliary correlation function eventually contributes to the back-end beamformer for improved localization performance. To the best of our knowledge, this is the first deep-generative audiovisual method for speaker localization. Experimental results demonstrate our superior performance over other competitive methods, especially when the speech signal is corrupted by noise.

引用

页码：1132 / 1136

页数：5

共 34 条

[1] [Anonymous], 1998, SPOKEN DIALOGUE COMP
[2] Antoine D, 2013, THESIS GRENOBLE
[3] Ba, 2015, 3 INT C LEARN REPR I, P1412
[4] Semi-Supervised Source Localization in Reverberant Environments With Deep Generative Modeling
Bianco, Michael J.
Gannot, Sharon
Fernandez-Grande, Efren
Gerstoft, Peter
[J]. IEEE ACCESS, 2021, 9 : 84956 - 84970
[5] Energy-based sensor network source localization via projection onto convex sets
Blatt, Doron
Hero, Alfred O., III
[J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2006, 54 (09) : 3614 - 3619
[6] Variational Inference: A Review for Statisticians
Blei, David M.
Kucukelbir, Alp
McAuliffe, Jon D.
[J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (518) : 859 - 877
[7] Brandstein M., 2013, Microphone Arrays: Signal Processing Techniques and Applications
[8] Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals
Chakrabarty, Soumitro
Habets, Emanuel A. P.
[J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (01) : 8 - 21
[9] RetinaFace: Single-shot Multi-level Face Localisation in the Wild
Deng, Jiankang
Guo, Jia
Ververas, Evangelos
Kotsia, Irene
Zafeiriou, Stefanos
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 5202 - 5211
[10] DiBiase JH, 2001, DIGITAL SIGNAL PROC, P157

← 1 2 3 4 →