Acoustic Feature Optimization Based on F-Ratio for Robust Speech Recognition

被引：5

作者：

Sun, Yanqing ^{[1
]}

Zhou, Yu ^{[1
]}

Zhao, Qingwei ^{[1
]}

Yan, Yonghong ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Acoust, ThinkIT Speech Lab, Beijing 100864, Peoples R China

来源：

IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS | 2010年 / E93D卷 / 09期

基金：

中国国家自然科学基金; 国家高技术研究发展计划(863计划);

关键词：

mismatched speech; robust speech recognition; F-Ratio; subband design; feature optimization;

D O I：

10.1587/transinf.E93.D.2417

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper focuses on the problem of performance degradation in mismatched speech recognition. The F-Ratio analysis method is utilized to analyze the significance of different frequency bands for speech unit classification, and we find that frequencies around 1 kHz and 3 kHz, which are the upper bounds of the first and the second formants for most of the vowels, should be emphasized in comparison to the Mel-frequency cepstral coefficients (MFCC). The analysis result is further observed to be stable in several typical mismatched situations. Similar to the Mel-Frequency scale, another frequency scale called the F-Ratio-scale is thus proposed to optimize the filter bank design for the MFCC features, and make each subband contains equal significance for speech unit classification. Under comparable conditions, with the modified features we get a relative 43.20% decrease compared with the MFCC in sentence error rate for the emotion affected speech recognition, 35.54%, 23.03% for the noisy speech recognition at 15 dB and 0 dB SNR (signal to noise ratio) respectively, and 64.50% for the three years' 863 test data. The application of the F-Ratio analysis on the clean training set of the Aurora2 database demonstrates its robustness over languages, texts and sampling rates.

引用

页码：2417 / 2430

页数：14

共 28 条

[1]

Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137, DOI 10.1109/ICSLP.1996.607807

[2]

[Anonymous], 2009, PRAAT DOING PHONETIC

[3]

Boor C.D., 2001, A Practical Guide to Splines

[4] A comparative study of traditional and newly proposed features for recognition of speech under stress [J].

Bou-Ghazale, SE ;

Hansen, JHL .

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, 2000, 8 (04) :429-442

[5]

*CASIA, 2005, CHIN EM SPEECH CORP

[6] Encoding Emotions in Speech with the Size Code A Perceptual Investigation [J].

Chuenwattanapranithi, Suthathip ;

Xu, Yi ;

Thipakorn, Bundit ;

Maneewongvatana, Songrit .

PHONETICA, 2008, 65 (04) :210-230

[7]

Cohen J., 1995, The Journal of the Acoustical Society of America, V97, P3246, DOI DOI 10.1121/1.411700

[8]

Dean S., 2009, F distribution and ANOVA: The F distribution and the F ratio

[9]

Droppo J, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P953

[10]

Ellis D.P. W., 2005, PLP and RASTA (and MFCC, and inversion) in Matlab

← 1 2 3 →