Unsupervised Temporal Feature Learning Based on Sparse Coding Embedded BoAW for Acoustic Event Recognition

被引：12

作者：

Zhang Liwen ^{[1
]}

Han Jiqing ^{[1
]}

Deng Shiwen ^{[2
]}

机构：

[1] Harbin Inst Technol, Harbin, Heilongjiang, Peoples R China

[2] Harbin Normal Univ, Harbin, Heilongjiang, Peoples R China

来源：

19TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2018), VOLS 1-6: SPEECH RESEARCH FOR EMERGING MARKETS IN MULTILINGUAL SOCIETIES | 2018年

基金：

中国国家自然科学基金;

关键词：

acoustic event recognition; temporal feature learning; bag of audio words; sparse coding; CLASSIFICATION; SYSTEM;

D O I：

10.21437/Interspeech.2018-1243

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The performance of an Acoustic Event Recognition (AER) system highly depends on the statistical information and the temporal dynamics in the audio signals. Although the traditional Bag of Audio Words (BoAW) and the Gaussian Mixture Models (GMM) approaches can obtain more statistics information by aggregating multiple frame-level descriptors of an audio segment compared with the frame-level feature learning methods, its temporal information is unreserved. Recently, more and more Deep Neural Networks (DNN) based AER methods have been proposed to effectively capture the temporal information in audio signals, and achieved better performance, however, these methods usually required the manually annotated labels and fixed-length input during feature learning process. In this paper, we proposed a novel unsupervised temporal feature learning method, which can effectively capture the temporal dynamics for an entire audio signal with arbitrary duration by building direct connections between the BoAW histograms sequence and its time indexes using a non-linear Support Vector Regression (SVR) model. Furthermore, to make the feature representation have a better signal reconstruction ability, we embedded the sparse coding approach in the conventional BoAW framework. Compared with the BoAW and Convolutional Neural Network (CNN) baselines, experimental results showed our method brings improvements of 9.7% and 4.1% respectively.

引用

页码：3284 / 3288

页数：5

共 23 条

[1]

[Anonymous], P ICMR

[2] Selective Background Adaptation Based Abnormal Acoustic Event Recognition for Audio Surveillance [J].

Choi, Woohyun ;

Rho, Jinsang ;

Han, David K. ;

Ko, Hanseok .

2012 IEEE NINTH INTERNATIONAL CONFERENCE ON ADVANCED VIDEO AND SIGNAL-BASED SURVEILLANCE (AVSS), 2012, :118-123

[3]

Drucker H, 1997, ADV NEUR IN, V9, P155

[4] Rank Pooling for Action Recognition [J].

Fernando, Basura ;

Gavves, Efstratios ;

Oramas, Jose M. ;

Ghodrati, Amir ;

Tuytelaars, Tinne .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2017, 39 (04) :773-787

[5]

Giannoulis D, 2013, IEEE WORK APPL SIG

[6] Context-dependent sound event detection [J].

Heittola, Toni ;

Mesaros, Annamaria ;

Eronen, Antti ;

Virtanen, Tuomas .

EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2013,

[7]

Huang Z, 2013, INTERSPEECH, P2281

[8] Audio-Based Semantic Concept Classification for Consumer Video [J].

Lee, Keansub ;

Ellis, Daniel P. W. .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2010, 18 (06) :1406-1416

[9]

Lim H, 2015, 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, P3325

[10]

LLOYD SP, 1982, IEEE T INFORM THEORY, V28, P129, DOI 10.1109/TIT.1982.1056489

← 1 2 3 →