Multichannel environmental sound segmentationwith separately trained spectral and spatial features

被引：0

作者：

Yui Sudo

Katsutoshi Itoyama

Kenji Nishida

Kazuhiro Nakadai

机构：

[1] Tokyo Institute of Technology,Department of Systems and Control Engineering, School of Engineering

[2] Honda Research Institute Japan Co.,undefined

[3] Ltd.,undefined

来源：

Applied Intelligence | 2021年 / 51卷

关键词：

Environmental sound segmentation; Sound source separation; Inter-channel phase difference; Semantic segmentation;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

This paper proposes a multichannel environmental sound segmentation method. Environmental sound segmentation is an integrated method to achieve sound source localization, sound source separation and classification, simultaneously. When multiple microphones are available, spatial features can be used to improve the localization and separation accuracy of sounds from different directions; however, conventional methods have three drawbacks: (a) Sound source localization and sound source separation methods using spatial features and classification using spectral features trained in the same neural network, may overfit to the relationship between the direction of arrival and the class of a sound, thereby reducing their reliability to deal with novel events. (b) Although permutation invariant training used in autonomous speech recognition could be extended, it is impractical for environmental sounds that include an unlimited number of sound sources. (c) Various features, such as complex values of short time Fourier transform and interchannel phase differences have been used as spatial features, but no study has compared them. This paper proposes a multichannel environmental sound segmentation method comprising two discrete blocks, a sound source localization and separation block and a sound source separation and classification block. By separating the blocks, overfitting to the relationship between the direction of arrival and the class is avoided. Simulation experiments using created datasets including 75-class environmental sounds showed the root mean squared error of the proposed method was lower than that of conventional methods.

引用

页码：8245 / 8259

页数：14

共 37 条

[1]

Stowell D(2015)Detection and classification of acoustic scenes and events IEEE Trans Multimed Speech Signal Process 17 1733-1746

[2]

Giannoulis D(2019)2D sound source position estimation using microphone arrays and its application to a VR-based bird song analysis system J Adv Robot 33 403-414

[3]

Benetos E(2013)A real-time super-resolution robot audition system that improves the robustness of simultaneous speech recognition J Adv Robot 27 933-945

[4]

Lagrange M(2020)Sound event aware environmental sound segmentation with Mask U-Net J Adv Robot 34 1280-1290

[5]

Plumbley MD(2017)Recognizing multi-talker speech with permutation invariant training Proc Interspeech 2017 2456-2460

[6]

Gabriel D(2017)Convolutional recurrent neural networks for polyphonic sound event detection IEEE/ACM Trans Audio Speech Language Process 25 1291-1303

[7]

Kojima R(2017)Bird song scene analysis using a spatial-cue-based probabilistic model J Robot Mechatron 29 236-246

[8]

Hoshiba K(2017)DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs IEEE Trans Pattern Anal Mach Intell 40 834-848

[9]

Itoyama K(undefined)undefined undefined undefined undefined-undefined

[10]

Nishida K(undefined)undefined undefined undefined undefined-undefined

← 1 2 3 4 →