Multi-speaker Direction of Arrival Estimation Using Audio and Visual Modalities with Convolutional Neural Network

被引:0
|
作者
Wu, Yulin [1 ]
Hu, Ruimin [1 ]
Wang, Xiaochen [1 ]
机构
[1] Wuhan Univ, Hubei Key Lab Multimedia & Network Commun Engn, Sch Comp Sci, Natl Engn Res Ctr Multimedia Software, Wuhan, Peoples R China
来源
2023 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, ICME | 2023年
关键词
DoA estimation; 3D-CNNs and 2D-CNNs; residual dense; audio and visual modalities; LOCALIZATION;
D O I
10.1109/ICME55011.2023.00115
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In reality, audible and visible sound sources are closely aligned, and they can help humans locate sources exactly. To exploit the complementarity between audio and visual data in multi-speaker direction of arrival (DoA) estimation, we propose a novel network consisting of 3D convolution neural networks (3D-CNNs) and 2D-CNNs mixture networks with residual dense blocks. It has two main advantages: 1) both input audio and visual features are low-level signal representation: the real and imaginary parts of STFT coefficients for the audio feature and pixel coordinates for the visual feature, which can allow the network to learn to extract the most informative high-level features. 2) 3D-CNNs with the residual dense block are used for audio and visual feature mapping along the time and frequency axis. The following 2D-CNNs are to ensemble the high-level features along the DoA axis. Experimental results demonstrate promising SSL performance.
引用
收藏
页码:636 / 641
页数:6
相关论文
共 50 条
  • [1] Neural Network Adaptation and Data Augmentation for Multi-Speaker Direction-of-Arrival Estimation
    He, Weipeng
    Motlicek, Petr
    Odobez, Jean-Marc
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 1303 - 1317
  • [2] Multi-speaker DoA Estimation Using Audio and Visual Modality
    Yulin Wu
    Ruimin Hu
    Xiaochen Wang
    Shanfa Ke
    Neural Processing Letters, 2023, 55 : 8887 - 8901
  • [3] Multi-speaker DoA Estimation Using Audio and Visual Modality
    Wu, Yulin
    Hu, Ruimin
    Wang, Xiaochen
    Ke, Shanfa
    NEURAL PROCESSING LETTERS, 2023, 55 (07) : 8887 - 8901
  • [4] Multi-Speaker Direction of Arrival Estimation using SRP-PHAT Algorithm with a Weighted Histogram
    Hadad, Elior
    Gannot, Sharon
    2018 IEEE INTERNATIONAL CONFERENCE ON THE SCIENCE OF ELECTRICAL ENGINEERING IN ISRAEL (ICSEE), 2018,
  • [5] MAXIMUM LIKELIHOOD MULTI-SPEAKER DIRECTION OF ARRIVAL ESTIMATION UTILIZING A WEIGHTED HISTOGRAM
    Hadad, Elior
    Gannot, Sharon
    2020 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2020, : 586 - 590
  • [6] Joint estimation of pitch and direction of arrival: improving robustness and accuracy for multi-speaker scenarios
    Gerlach, Stephan
    Bitzer, Joerg
    Goetze, Stefan
    Doclo, Simon
    EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2014,
  • [7] Joint estimation of pitch and direction of arrival: improving robustness and accuracy for multi-speaker scenarios
    Stephan Gerlach
    Jörg Bitzer
    Stefan Goetze
    Simon Doclo
    EURASIP Journal on Audio, Speech, and Music Processing, 2014 (1)
  • [8] Broadband Direction of Arrival Estimation Based on Convolutional Neural Network
    Zhu, Wenli
    Zhang, Min
    Wu, Chenxi
    Zeng, Lingqing
    IEICE TRANSACTIONS ON COMMUNICATIONS, 2020, E103B (03) : 148 - 154
  • [9] Integration of audio-visual information for multi-speaker multimedia speaker recognition
    Yang, Jichen
    Chen, Fangfan
    Cheng, Yu
    Lin, Pei
    DIGITAL SIGNAL PROCESSING, 2024, 145
  • [10] Direction of arrival estimation for smart antenna in multipath environment using convolutional neural network
    Harkouss, Youssef
    Shraim, Hassan
    Bazzi, Hussein
    INTERNATIONAL JOURNAL OF RF AND MICROWAVE COMPUTER-AIDED ENGINEERING, 2018, 28 (06)