Zero-Shot Single-Microphone Sound Classification and Localization in a Building Via the Synthesis of Unseen Features

被引：2

作者：

Lee, Seungjun ^{[1
]}

Yang, Haesang ^{[1
]}

Choi, Hwiyong ^{[1
]}

Seong, Woojae ^{[2
,3
]}

机构：

[1] Seoul Natl Univ, Dept Naval Architecture & Ocean Engn, Seoul 08826, South Korea

[2] Seoul Natl Univ, Dept Naval Architecture & Ocean Enn, Seoul 08826, South Korea

[3] Seoul Natl Univ, Res Inst Marine Syst Engn, Seoul 08826, South Korea

来源：

IEEE TRANSACTIONS ON MULTIMEDIA | 2022年 / 24卷

关键词：

Location awareness; Microphones; Buildings; Feature extraction; Training; Reverberation; Data models; Generative adversarial network; sound classification; sound source localization; zero-shot learning; EVENT LOCALIZATION; NEURAL-NETWORKS; NOISE;

D O I：

10.1109/TMM.2021.3079705

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In this paper, we propose a learning-based approach to identify the type and position of sounds using a single microphone in a real-world building. We attempt to treat this problem as a joint classification problem in which we predict the exact positions of sounds while classifying the types that are assumed to be from pre-defined types of sounds. The most problematic issue is that while the types are readily classified under supervised learning frameworks with one-hot encoded labels, it is difficult to predict the exact positions of the sound from unseen positions during training. To address this potential discrepancy, we formulate the position identification problem as a zero-shot learning problem inspired by the human ability to perceive new concepts from previously learned concepts. We extract feature representations from audio data and vectorize the type and position of the sound source as 'type/position-aware attributes,' instead of labeling each class with a simple one-hot vector. We then train a promising generative model to bridge the extracted features and the attributes by learning the class-invariant structure to transfer the knowledge from seen to unseen classes through their attributes; generative adversarial networks are conditioned on the class-embeddings. Our proposed methods are evaluated on an indoor noise dataset, SNU-B36-EX, a real-world dataset collected inside a building.

引用

页码：2339 / 2351

页数：13

共 87 条

[1] Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks [J].

Adavanne, Sharath ;

Politis, Archontis ;

Nikunen, Joonas ;

Virtanen, Tuomas .

IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2019, 13 (01) :34-48

[2]

Adavanne S, 2017, INT CONF ACOUST SPEE, P771, DOI 10.1109/ICASSP.2017.7952260

[3] Label-Embedding for Image Classification [J].

Akata, Zeynep ;

Perronnin, Florent ;

Harchaoui, Zaid ;

Schmid, Cordelia .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (07) :1425-1438

[4] A Geometric Approach to Sound Source Localization from Time-Delay Estimates [J].

Alameda-Pineda, Xavier ;

Horaud, Radu .

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2014, 22 (06) :1082-1095

[5]

An I, 2019, IEEE INT CONF ROBOT, P4061, DOI [10.1109/ICRA.2019.8794093, 10.1109/icra.2019.8794093]

[6]

[Anonymous], 2015, 2015 INT JOINT C NEU, DOI [DOI 10.1109/IJCNN.2015.7280624MATTHEWVANGUNDY, 10.1109/IJCNN.2015.7280624, DOI 10.1109/IJCNN.2015.7280624]

[7]

[Anonymous], 2011, Ismir, DOI DOI 10.7916/D8NZ8J07

[8]

[Anonymous], 2017, Transactions of the Association for Computational Linguistics, DOI [10.1162/tacl_a_00065, DOI 10.1162/TACL_A_00065]

[9]

[Anonymous], 2015, IEEE INT WORKS MACH

[10]

[Anonymous], 2016, P WORKSH DET CLASS A

← 1 2 3 4 5 6 7 8 9 →