BOOSTING ATTRIBUTE AND PHONE ESTIMATION ACCURACIES WITH DEEP NEURAL NETWORKS FOR DETECTION-BASED SPEECH RECOGNITION

被引：0

作者：

Yu, Dong ^{[1
]}

Siniscalchi, Sabato Marco ^{[2
]}

Deng, Li ^{[1
]}

Lee, Chin-Hui ^{[3
]}

机构：

[1] Microsoft Res, Speech Res Grp, Redmond, WA USA

[2] Kore Univ, Dept Telemat, Enna, Italy

[3] Sch Elect & Comp Engn, Georgia Inst Technol, Atlanta, GA USA

来源：

2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) | 2012年

关键词：

automatic speech attribute transcription; deep neural networks; detection-based ASR; phonological features; attribute detection; phone recognition; FEATURES;

D O I：

暂无

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Generation of high-precision sub-phonetic attribute (also known as phonological features) and phone lattices is a key frontend component for detection-based bottom-up speech recognition. In this paper we employ deep neural networks (DNNs) to improve detection accuracy over conventional shallow MLPs (multi-layer perceptrons) with one hidden layer. A range of DNN architectures with five to seven hidden layers and up to 2048 hidden units per layer have been explored. Training on the SI84 and testing on the Nov92 WSJ data, the proposed DNNs achieve significant improvements over the shallow MLPs, producing greater than 90% frame-level attribute estimation accuracies for all 21 attributes tested for the full system. On the phone detection task, we also obtain excellent frame-level accuracy of 86.6%. With this level of high-precision detection of basic speech units we have opened the door to a new family of flexible speech recognition system design for both top-down and bottom-up, lattice-based search strategies and knowledge integration.

引用

页码：4169 / 4172

页数：4

共 24 条

[1] [Anonymous], 1999, THESIS
[2] [Anonymous], 2005, THESIS MIT CAMBRIDGE, DOI DOI 10.3115/1613984.1614005
[3] [Anonymous], 2010, P NIPS WORKSH DEEP L
[4] [Anonymous], P ICSLP
[5] Chaudhari U. V., 2009, ASRU 2009, P93
[6] Church K., 1986, THESIS
[7] Dahl G., 2012, IEEE T AUD SPEECH LA
[8] A STATISTICAL APPROACH TO AUTOMATIC SPEECH RECOGNITION USING THE ATOMIC SPEECH UNITS CONSTRUCTED FROM OVERLAPPING ARTICULATORY FEATURES
DENG, L
SUN, DX
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1994, 95 (05) : 2702 - 2719
[9] Deng L., 1999, COMPUTATIONAL MODELS, P214
[10] Training products of experts by minimizing contrastive divergence
Hinton, GE
[J]. NEURAL COMPUTATION, 2002, 14 (08) : 1771 - 1800

← 1 2 3 →