Deep Audio-Visual Beamforming for Speaker Localization

被引:7
作者
Qian, Xinyuan [1 ]
Zhang, Qiquan [1 ]
Guan, Guohui [1 ]
Xue, Wei [2 ]
机构
[1] Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 117583, Singapore
[2] Hong Kong Baptist Univ, Dept Comp Sci, Hong Kong, Peoples R China
关键词
Microphones; Location awareness; Correlation; Array signal processing; Visualization; Feature extraction; Delay effects; Audio-visual fusion; speaker localization; varia- tional auto-encoder;
D O I
10.1109/LSP.2022.3165466
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Generalized Cross Correlation (GCC) is the most popular localization technique over the past decades and can be extended with the beamforming method e.g. Steered Response Power (SRP) when multiple microphone pairs exist. Considering the promising results of Deep Learning (DL) strategies over classical approaches, in this work, instead of directly using Generalized Cross Correlation (GCC), SRP is derived with the DL-learnt ideal correlation functions for each pair of a microphone array. To deploy visual information, we explore the Conditional Variational Auto-Encoder (CVAE) framework in which the audio generative process is conditioned on the visual features encoded by face detections. The vision-derived auxiliary correlation function eventually contributes to the back-end beamformer for improved localization performance. To the best of our knowledge, this is the first deep-generative audiovisual method for speaker localization. Experimental results demonstrate our superior performance over other competitive methods, especially when the speech signal is corrupted by noise.
引用
收藏
页码:1132 / 1136
页数:5
相关论文
共 34 条
  • [1] [Anonymous], 1998, SPOKEN DIALOGUE COMP
  • [2] Antoine D, 2013, THESIS GRENOBLE
  • [3] Ba, 2015, 3 INT C LEARN REPR I, P1412
  • [4] Semi-Supervised Source Localization in Reverberant Environments With Deep Generative Modeling
    Bianco, Michael J.
    Gannot, Sharon
    Fernandez-Grande, Efren
    Gerstoft, Peter
    [J]. IEEE ACCESS, 2021, 9 : 84956 - 84970
  • [5] Energy-based sensor network source localization via projection onto convex sets
    Blatt, Doron
    Hero, Alfred O., III
    [J]. IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2006, 54 (09) : 3614 - 3619
  • [6] Variational Inference: A Review for Statisticians
    Blei, David M.
    Kucukelbir, Alp
    McAuliffe, Jon D.
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (518) : 859 - 877
  • [7] Brandstein M., 2013, Microphone Arrays: Signal Processing Techniques and Applications
  • [8] Multi-Speaker DOA Estimation Using Deep Convolutional Networks Trained With Noise Signals
    Chakrabarty, Soumitro
    Habets, Emanuel A. P.
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (01) : 8 - 21
  • [9] RetinaFace: Single-shot Multi-level Face Localisation in the Wild
    Deng, Jiankang
    Guo, Jia
    Ververas, Evangelos
    Kotsia, Irene
    Zafeiriou, Stefanos
    [J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2020, : 5202 - 5211
  • [10] DiBiase JH, 2001, DIGITAL SIGNAL PROC, P157