Convolutional Neural Network Bottleneck Features for bi-directional Generalized Variable Parameter HMMs

被引：0

作者：

Su, Rongfeng ^{[1
,2
]}

Liu, Xunying ^{[1
,2
]}

Wang, Lan ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Prov Key Lab Robot & Intelligent Syst, Beijing, Peoples R China

[2] Chinese Univ Hong Kong, Hong Kong, Hong Kong, Peoples R China

来源：

2016 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION (ICIA) | 2016年

基金：

中国国家自然科学基金;

关键词：

generalized variable parameter HMM; convolutional neural network; bottleneck features; robust speech recognition; SPEECH; FRAMEWORK;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Recently, convolutional neural networks (CNNs) have been applied successfully to acoustic modelling in speech recognition. As the bottleneck features from CNNs contain inherently discriminative and rich context information, the standard approach is to augment the conventional acoustic features with the CNN bottleneck features in a tandem framework. To better capture the highly complex relationship between them, a novel bidirectional generalized variable parameter HMM (GVP-HMM) based approach is proposed in this paper. In this approach, the trajectories of continuous acoustic features space HMM parameters, as well as the model space linear transforms against CNN bottleneck features are modelled by polynomial functions. The optimal GVP-HMM model structure for each direction, which is determined by the locally varying polynomial parameters and degrees, can be automatically learnt using model selection techniques. The proposed bi-directional GVP-HMM based approach gave a word error rate of 12.22% on the Aurora 4 task. In particular, a significant error rate reduction of 18.09% relative was obtained over the baseline tandem HMM system using CNN bottleneck features on the secondary microphone channel condition.

引用

页码：1126 / 1131

页数：6

共 36 条

[11] Chou W, 1999, INT CONF ACOUST SPEE, P345
[12] A study of variable-parameter Gaussian mixture hidden Markov modeling for noisy speech recognition
Cui, Xiaodong
Gong, Yifan
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (04): : 1366 - 1376
[13] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
Dahl, George E.
Yu, Dong
Deng, Li
Acero, Alex
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
[14] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
DEMPSTER, AP
LAIRD, NM
RUBIN, DB
[J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
[15] Grézl F, 2007, INT CONF ACOUST SPEE, P757
[16] Hermansky H, 2000, INT CONF ACOUST SPEE, P1635, DOI 10.1109/ICASSP.2000.862024
[17] Deep Neural Networks for Acoustic Modeling in Speech Recognition
Hinton, Geoffrey
Deng, Li
Yu, Dong
Dahl, George E.
Mohamed, Abdel-rahman
Jaitly, Navdeep
Senior, Andrew
Vanhoucke, Vincent
Patrick Nguyen
Sainath, Tara N.
Kingsbury, Brian
[J]. IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) : 82 - 97
[18] Li Y, 2013, INTERSPEECH, P2967
[19] Li Y, 2012, 2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, P136, DOI 10.1109/ISCSLP.2012.6423526
[20] Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression
Ling, Zhen-Hua
Richmond, Korin
Yamagishi, Junichi
[J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (01): : 205 - 217

← 1 2 3 4 →