Multimodal System for Audio Scene Source Counting and Analysis

被引:1
|
作者
Nigro, Michael [1 ]
Krishnan, Sridhar [1 ]
机构
[1] Ryerson Univ, Dept Elect Comp & Biomed Engn, Toronto, ON M5B 2K3, Canada
关键词
Audio scene analysis; source counting; speaker count estimation; SOURCE SEPARATION; DIARIZATION;
D O I
10.1109/TASLP.2022.3156795
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
Audio scene analysis (ASA) is a challenging and multifaceted task in audio signal processing that uncovers information about the nature of an audio recording. Regardless of the analysis goal, a number of audio sources are observed in any audio scene. However, this consideration is usually not explored or given considerable thought in research. This work aims to demonstrate the utility of audio source counting with a novel solution consisting of a multimodal system for ASA. Both speaker counting and sound event counting techniques use deep neural networks (DNN) to predict the number of sources. We are able to present competitive results for audio source counting by achieving prediction accuracy of 46.03% and 89.57% with a margin of error of +/- 1 for speaker counting, which outperforms state-of-the-art systems for similar tasks. For sound event counting we achieve 50.55% and 86.59% prediction accuracy and accuracy with a margin of error of +/- 1, respectively, that establishes a clear baseline. Our system also demonstrates real-time aspects with an overall processing time of similar to 0.4614 s per audio recording.
引用
收藏
页码:1073 / 1082
页数:10
相关论文
共 50 条
  • [1] Trends in audio scene source counting and analysis
    Nigro, Michael
    Krishnan, Sridhar
    MACHINE LEARNING WITH APPLICATIONS, 2024, 18
  • [2] SARdB: A dataset for audio scene source counting and analysis
    Nigro, Michael
    Krishnan, Sridhar
    APPLIED ACOUSTICS, 2021, 178
  • [3] Audio scene analysis as a control system for hearing aids
    Roch, M
    Huang, T
    Liu, J
    Hurtig, RR
    ISM 2005: SEVENTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2005, : 772 - 778
  • [4] Visual Scene Graphs for Audio Source Separation
    Chatterjee, Moitreya
    Le Roux, Jonathan
    Ahuja, Narendra
    Cherian, Anoop
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1184 - 1193
  • [5] Multimodal Fusion of Audio, Scene, and Face Features for First Impression Estimation
    Gurpinar, Furkan
    Kaya, Heysem
    Salah, Albert Ali
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 43 - 48
  • [6] Investigating topics, audio representations and attention for multimodal scene -aware dialog
    Kumar, Shachi H.
    Okur, Eda
    Sahay, Saurav
    Huang, Jonathan
    Nachman, Lama
    COMPUTER SPEECH AND LANGUAGE, 2020, 64
  • [7] A System for the Semantic Multimodal Analysis of News Audio-Visual Content
    Vasileios Mezaris
    Spyros Gidaros
    GeorgiosTh Papadopoulos
    Walter Kasper
    Jörg Steffen
    Roeland Ordelman
    Marijn Huijbregts
    Franciska de Jong
    Ioannis Kompatsiaris
    MichaelG Strintzis
    EURASIP Journal on Advances in Signal Processing, 2010
  • [8] A System for the Semantic Multimodal Analysis of News Audio-Visual Content
    Mezaris, Vasileios
    Gidaros, Spyros
    Papadopoulos, Georgios Th.
    Kasper, Walter
    Steffen, Joerg
    Ordelman, Roeland
    Huijbregts, Marijn
    de Jong, Franciska
    Kompatsiaris, Ioannis
    Strintzis, Michael G.
    EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2010,
  • [9] Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation
    Chatterjee, Moitreya
    Ahuja, Narendra
    Cherian, Anoop
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [10] DialogMCF: Multimodal Context Flow for Audio Visual Scene-Aware Dialog
    Chen, Zhe
    Liu, Hongcheng
    Wang, Yu
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 753 - 764