Overlapping sound event recognition using local spectrogram features and the generalised hough transform

被引：53

作者：

Dennis, J. ^{[1
,2
]}

Tran, H. D. ^{[1
]}

Chng, E. S. ^{[2
]}

机构：

[1] Inst Infocomm Res, Singapore 138632, Singapore

[2] Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore

来源：

PATTERN RECOGNITION LETTERS | 2013年 / 34卷 / 09期

关键词：

Overlapping sound event recognition; Local spectrogram features; Keypoint detection; Generalised Hough Transform; AUTOMATIC SPEECH RECOGNITION; SCALE; NOISE;

D O I：

10.1016/j.patrec.2013.02.015

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we address the challenging task of simultaneous recognition of overlapping sound events from single channel audio. Conventional frame-based methods are not well suited to the problem, as each time frame contains a mixture of information from multiple sources. Missing feature masks are able to improve the recognition in such cases, but are limited by the accuracy of the mask, which is a non-trivial problem. In this paper, we propose an approach based on Local Spectrogram Features (LSFs) which represent local spectral information that is extracted from the two-dimensional region surrounding "keypoints" detected in the spectrogram. The keypoints are designed to locate the sparse, discriminative peaks in the spectrogram, such that we can model sound events through a set of representative LSF clusters and their occurrences in the spectrogram. To recognise overlapping sound events, we use a Generalised Hough Transform (GHT) voting system, which sums the information over many independent keypoints to produce onset hypotheses, that can detect any arbitrary combination of sound events in the spectrogram. Each hypothesis is then scored against the class distribution models to recognise the existence of the sound in the spectrogram. Experiments on a set of five overlapping sound events, in the presence of non-stationary background noise, demonstrate the potential of our approach. (C) 2013 Elsevier B.V. All rights reserved.

引用

页码：1085 / 1093

页数：9

共 34 条

[1] How Do Humans Process and Recognize Speech? [J].

Allen, Jont B. .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (04) :567-577

[2]

Baggenstoss P. M., 2002, Statistical modeling using gaussian mixtures and hmms with matlab

[3] GENERALIZING THE HOUGH TRANSFORM TO DETECT ARBITRARY SHAPES [J].

BALLARD, DH .

PATTERN RECOGNITION, 1981, 13 (02) :111-122

[4] Detecting bird sounds in a complex acoustic environment and application to bioacoustic monitoring [J].

Bardeli, R. ;

Wolff, D. ;

Kurth, F. ;

Koch, M. ;

Tauchert, K. -H. ;

Frommolt, K. -H. .

PATTERN RECOGNITION LETTERS, 2010, 31 (12) :1524-1534

[5]

Barnwal S, 2012, INT CONF ACOUST SPEE, P4725, DOI 10.1109/ICASSP.2012.6288974

[6] Mode-finding for mixtures of Gaussian distributions [J].

Carreira-Perpiñán, MA .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2000, 22 (11) :1318-1323

[7] MPEG-7 sound-recognition tools [J].

Casey, M .

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2001, 11 (06) :737-747

[8] A glimpsing model of speech perception in noise [J].

Cooke, M .

JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2006, 119 (03) :1562-1573

[9] Robust automatic speech recognition with missing and unreliable acoustic data [J].

Cooke, M ;

Green, P ;

Josifovski, L ;

Vizinho, A .

SPEECH COMMUNICATION, 2001, 34 (03) :267-285

[10] Comparison of techniques for environmental sound recognition [J].

Cowling, M ;

Sitte, R .

PATTERN RECOGNITION LETTERS, 2003, 24 (15) :2895-2907

← 1 2 3 4 →