A novel Wake-Up-Word speech recognition system, Wake-Up-Word recognition task, technology and evaluation

被引：25

作者：

Kepuska, Z. ^{[1
]}

Klein, T. B. ^{[1
]}

机构：

[1] Florida Inst Technol, Dept Elect & Comp Engn, Melbourne, FL 32901 USA

来源：

NONLINEAR ANALYSIS-THEORY METHODS & APPLICATIONS | 2009年 / 71卷 / 12期

基金：

美国国家科学基金会;

关键词：

Wake-Up-Word; Speech recognition; Hidden Markov Models; Support Vector Machines; Mel-scale cepstral coefficients; Linear prediction spectrum; Enhanced spectrum; HTK; Microsoft SAPI;

D O I：

10.1016/j.na.2009.06.089

中图分类号：

O29 [应用数学];

学科分类号：

070104 ;

摘要：

Wake-Up-Word (WUW) is a new paradigm in speech recognition (SR) that is not yet widely recognized. This paper defines and investigates WUW speech recognition, describes details of this novel solution and the technology that implements it. WUW SR is defined as detection of a single word or phrase when spoken in the alerting context of requesting attention, while rejecting all other words, phrases, sounds, noises and other acoustic events and the same word or phrase spoken in non-alerting context with virtually 100% accuracy. In order to achieve this accuracy, the following innovations were accomplished: (1) Hidden Markov Model triple scoring with Support Vector Machine classification, (2) Combining multiple speech feature streams: Mel-scale Filtered Cepstral Coefficients (MFCCs), Linear Prediction Coefficients (LPC)-smoothed MFCCs, and Enhanced MFCC, and (3) Improved Voice Activity Detector with Support Vector Machines. WUW detection and recognition performance is 2514%, or 26 times better than HTK for the same training & testing data, and 2271%, or 24 times better than Microsoft SAPI 5.1 recognizer. The out-of-vocabulary rejection performance is over 65,233%, or 653 times better than HTK, and 5900% to 42,900%, or 60 to 430 times better than the Microsoft SAPI 5.1 recognizer. This solution that utilizes a new recognition paradigm applies not only to WUW task but also to any general Speech Recognition tasks. (C) 2009 Elsevier Ltd. All rights reserved.

引用

页码：E2772 / E2789

页数：18

共 23 条

[1] Broun CC, 2000, INT CONF ACOUST SPEE, P1811, DOI 10.1109/ICASSP.2000.862106
[2] Burges C.J.C., 1998, TUTORIAL SUPPORT VEC
[3] LIBSVM: A Library for Support Vector Machines
Chang, Chih-Chung
Lin, Chih-Jen
[J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[4] Cole Ron., 1997, SURVEY STATE ART HUM
[5] COMPARISON OF PARAMETRIC REPRESENTATIONS FOR MONOSYLLABIC WORD RECOGNITION IN CONTINUOUSLY SPOKEN SENTENCES
DAVIS, SB
MERMELSTEIN, P
[J]. IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1980, 28 (04): : 357 - 366
[6] Fan RE, 2005, J MACH LEARN RES, V6, P1889
[7] GARCIA A, 2006, ICASSP 2006, V1, P1
[8] HAZEN T, 2001, IEEE INT C AC SPEECH
[9] Hsu C.-W., 2007, PRACTICAL GUIDE SUPP
[10] JAMES DA, 1994, INT CONF ACOUST SPEE, P377

← 1 2 3 →