Multichannel environmental sound segmentationwith separately trained spectral and spatial features

被引：0

作者：

Yui Sudo

Katsutoshi Itoyama

Kenji Nishida

Kazuhiro Nakadai

机构：

[1] Tokyo Institute of Technology,Department of Systems and Control Engineering, School of Engineering

[2] Honda Research Institute Japan Co.,undefined

[3] Ltd.,undefined

来源：

Applied Intelligence | 2021年 / 51卷

关键词：

Environmental sound segmentation; Sound source separation; Inter-channel phase difference; Semantic segmentation;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

This paper proposes a multichannel environmental sound segmentation method. Environmental sound segmentation is an integrated method to achieve sound source localization, sound source separation and classification, simultaneously. When multiple microphones are available, spatial features can be used to improve the localization and separation accuracy of sounds from different directions; however, conventional methods have three drawbacks: (a) Sound source localization and sound source separation methods using spatial features and classification using spectral features trained in the same neural network, may overfit to the relationship between the direction of arrival and the class of a sound, thereby reducing their reliability to deal with novel events. (b) Although permutation invariant training used in autonomous speech recognition could be extended, it is impractical for environmental sounds that include an unlimited number of sound sources. (c) Various features, such as complex values of short time Fourier transform and interchannel phase differences have been used as spatial features, but no study has compared them. This paper proposes a multichannel environmental sound segmentation method comprising two discrete blocks, a sound source localization and separation block and a sound source separation and classification block. By separating the blocks, overfitting to the relationship between the direction of arrival and the class is avoided. Simulation experiments using created datasets including 75-class environmental sounds showed the root mean squared error of the proposed method was lower than that of conventional methods.

引用

页码：8245 / 8259

页数：14

共 37 条

[21]

Qian Y(undefined)undefined undefined undefined undefined-undefined

[22]

Cakir E(undefined)undefined undefined undefined undefined-undefined

[23]

Parascandolo G(undefined)undefined undefined undefined undefined-undefined

[24]

Heittola T(undefined)undefined undefined undefined undefined-undefined

[25]

Huttunen H(undefined)undefined undefined undefined undefined-undefined

[26]

Virtanen T(undefined)undefined undefined undefined undefined-undefined

[27]

Kojima R(undefined)undefined undefined undefined undefined-undefined

[28]

Sugiyama O(undefined)undefined undefined undefined undefined-undefined

[29]

Hoshiba K(undefined)undefined undefined undefined undefined-undefined

[30]

Nakadai K(undefined)undefined undefined undefined undefined-undefined

← 1 2 3 4 →