Generalized Filter-bank Features for Robust Speech Recognition Against Reverberation

被引：0

作者：

Pardede, Hilman F. ^{[1
]}

Zilvan, Vicky ^{[1
]}

Krisnandi, Dikdik ^{[1
]}

Heryana, Ana ^{[1
]}

Kusumo, R. Budiarianto S. ^{[1
]}

机构：

[1] Indonesian Inst Sci, Res Ctr Informat, Bandung, Indonesia

来源：

2019 INTERNATIONAL CONFERENCE ON COMPUTER, CONTROL, INFORMATICS AND ITS APPLICATIONS (IC3INA) | 2019年

关键词：

non-extensive statistics; q-Logarithm; deep learning; deep belief networks; reverberation; Filterbank;

D O I：

10.1109/ic3ina48034.2019.8949593

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Traditionally, automatic speech recognition (ASR) uses a Hidden Markov Model with Gaussian Mixture Model (HMM-GMM) as acoustic model and hand-designed features such as Mel-frequency Cepstral Coefficient (MFCC) as acoustic features. It is usually assumed that the features are uncorrelated, making it possible to use diagonal covariances for the GMM. The assumption generally holds due to the use of Discrete Cosine Transformation (DCT) that de-correlates the speech spectra. However, DCT could cause some information loss, such as correlations between the feature components. Current ASR systems, which is based on Deep Neural Network (DNN) show to be better especially in reverberant conditions when more primitive features, such as filter-bank (FBANK), are used. This might be because DNN is better in modeling non-linear relations between the components of the features. But the use of short-time processing in FBANK may cause the lost of long-term correlations in a speech pattern. To tackle this, we propose a new feature, q-FBANK, which is a generalization of FBANK. The results on artificially reverberant speech show that the proposed features achieve better performance than MFCC and FBANK on DNN-HMM systems where an average error reduction up to 39.73% and 13.5% were achieved respectively.

引用

页码：19 / 24

页数：6

共 41 条

[1] Weight-Space Viterbi Decoding Based Spectral Subtraction for Reverberant Speech Recognition [J].

Ban, Sung Min ;

Kim, Hyung Soon .

IEEE SIGNAL PROCESSING LETTERS, 2015, 22 (09) :1424-1428

[2] Learning Deep Architectures for AI [J].

Bengio, Yoshua .

FOUNDATIONS AND TRENDS IN MACHINE LEARNING, 2009, 2 (01) :1-127

[3] Automatic speech recognition and speech variability: A review [J].

Benzeghiba, M. ;

De Mori, R. ;

Deroo, O. ;

Dupont, S. ;

Erbes, T. ;

Jouvet, D. ;

Fissore, L. ;

Laface, P. ;

Mertins, A. ;

Ris, C. ;

Rose, R. ;

Tyagi, V. ;

Wellekens, C. .

SPEECH COMMUNICATION, 2007, 49 (10-11) :763-786

[4] COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES [J].

DAVIS, SB ;

MERMELSTEIN, P .

IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04) :357-366

[5]

Dennis J, 2015, 2015 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), P518, DOI 10.1109/ASRU.2015.7404839

[6]

Dharanipragada S, 2001, INT CONF ACOUST SPEE, P309, DOI 10.1109/ICASSP.2001.940829

[7]

Fan HM, 2012, NANOMEDICINE AND CANCER, P183

[8] Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains [J].

Gauvain, Jean-Luc ;

Lee, Chin-Hui .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 1994, 2 (02) :291-298

[9]

Gehring J, 2013, INT CONF ACOUST SPEE, P3377, DOI 10.1109/ICASSP.2013.6638284

[10] Training of HMM with filtered speech material for hands-free recognition [J].

Giuliani, D ;

Matassoni, M ;

Omologo, M ;

Svaizer, P .

ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, :449-452

← 1 2 3 4 5 →