Zero-Shot Single-Microphone Sound Classification and Localization in a Building Via the Synthesis of Unseen Features

被引:2
作者
Lee, Seungjun [1 ]
Yang, Haesang [1 ]
Choi, Hwiyong [1 ]
Seong, Woojae [2 ,3 ]
机构
[1] Seoul Natl Univ, Dept Naval Architecture & Ocean Engn, Seoul 08826, South Korea
[2] Seoul Natl Univ, Dept Naval Architecture & Ocean Enn, Seoul 08826, South Korea
[3] Seoul Natl Univ, Res Inst Marine Syst Engn, Seoul 08826, South Korea
关键词
Location awareness; Microphones; Buildings; Feature extraction; Training; Reverberation; Data models; Generative adversarial network; sound classification; sound source localization; zero-shot learning; EVENT LOCALIZATION; NEURAL-NETWORKS; NOISE;
D O I
10.1109/TMM.2021.3079705
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we propose a learning-based approach to identify the type and position of sounds using a single microphone in a real-world building. We attempt to treat this problem as a joint classification problem in which we predict the exact positions of sounds while classifying the types that are assumed to be from pre-defined types of sounds. The most problematic issue is that while the types are readily classified under supervised learning frameworks with one-hot encoded labels, it is difficult to predict the exact positions of the sound from unseen positions during training. To address this potential discrepancy, we formulate the position identification problem as a zero-shot learning problem inspired by the human ability to perceive new concepts from previously learned concepts. We extract feature representations from audio data and vectorize the type and position of the sound source as 'type/position-aware attributes,' instead of labeling each class with a simple one-hot vector. We then train a promising generative model to bridge the extracted features and the attributes by learning the class-invariant structure to transfer the knowledge from seen to unseen classes through their attributes; generative adversarial networks are conditioned on the class-embeddings. Our proposed methods are evaluated on an indoor noise dataset, SNU-B36-EX, a real-world dataset collected inside a building.
引用
收藏
页码:2339 / 2351
页数:13
相关论文
共 87 条
[21]   Energy-Based Sound Source Localization with Low Power Consumption in Wireless Sensor Networks [J].
Deng, Fang ;
Guan, Shengpan ;
Yue, Xianghu ;
Gu, Xiaodan ;
Chen, Jie ;
Lv, Jianyao ;
Li, Jiahong .
IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2017, 64 (06) :4894-4902
[22]   Audio-based context recognition [J].
Eronen, AJ ;
Peltonen, VT ;
Tuomi, JT ;
Klapuri, AP ;
Fagerlund, S ;
Sorsa, T ;
Lorho, G ;
Huopaniemi, J .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (01) :321-329
[23]   Multi-modal Cycle-Consistent Generalized Zero-Shot Learning [J].
Felix, Rafael ;
Kumar, B. G. Vijay ;
Reid, Ian ;
Carneiro, Gustavo .
COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 :21-37
[24]  
Floor Noise Management Centre, 2018, MONTHL REP
[25]  
Frome A., 2013, Advances in neural information processing systems, P2121
[26]   Recent Advances in Zero-Shot Recognition Toward data-efficient understanding of visual content [J].
Fu, Yanwei ;
Xiang, Tao ;
Jiang, Yu-Gang ;
Xue, Xiangyang ;
Sigal, Leonid ;
Gong, Shaogang .
IEEE SIGNAL PROCESSING MAGAZINE, 2018, 35 (01) :112-125
[27]   CI-GNN: Building a Category-Instance Graph for Zero-Shot Video Classification [J].
Gao, Junyu ;
Xu, Changsheng .
IEEE TRANSACTIONS ON MULTIMEDIA, 2020, 22 (12) :3088-3100
[28]  
Gemmeke JortF., 2013, 2013 IEEE workshop on applications of signal processing to audio and acoustics, P1
[29]  
Goodfellow IJ, 2014, ADV NEUR IN, V27, P2672
[30]  
Gulrajani I., 2017, ADV NEURAL INFORM PR, P5767