Convolutional Neural Network Bottleneck Features for bi-directional Generalized Variable Parameter HMMs

被引:0
作者
Su, Rongfeng [1 ,2 ]
Liu, Xunying [1 ,2 ]
Wang, Lan [1 ,2 ]
机构
[1] Chinese Acad Sci, Shenzhen Inst Adv Technol, Guangdong Prov Key Lab Robot & Intelligent Syst, Beijing, Peoples R China
[2] Chinese Univ Hong Kong, Hong Kong, Hong Kong, Peoples R China
来源
2016 IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION (ICIA) | 2016年
基金
中国国家自然科学基金;
关键词
generalized variable parameter HMM; convolutional neural network; bottleneck features; robust speech recognition; SPEECH; FRAMEWORK;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recently, convolutional neural networks (CNNs) have been applied successfully to acoustic modelling in speech recognition. As the bottleneck features from CNNs contain inherently discriminative and rich context information, the standard approach is to augment the conventional acoustic features with the CNN bottleneck features in a tandem framework. To better capture the highly complex relationship between them, a novel bidirectional generalized variable parameter HMM (GVP-HMM) based approach is proposed in this paper. In this approach, the trajectories of continuous acoustic features space HMM parameters, as well as the model space linear transforms against CNN bottleneck features are modelled by polynomial functions. The optimal GVP-HMM model structure for each direction, which is determined by the locally varying polynomial parameters and degrees, can be automatically learnt using model selection techniques. The proposed bi-directional GVP-HMM based approach gave a word error rate of 12.22% on the Aurora 4 task. In particular, a significant error rate reduction of 18.09% relative was obtained over the baseline tandem HMM system using CNN bottleneck features on the secondary microphone channel condition.
引用
收藏
页码:1126 / 1131
页数:6
相关论文
共 36 条
  • [11] Chou W, 1999, INT CONF ACOUST SPEE, P345
  • [12] A study of variable-parameter Gaussian mixture hidden Markov modeling for noisy speech recognition
    Cui, Xiaodong
    Gong, Yifan
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2007, 15 (04): : 1366 - 1376
  • [13] Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition
    Dahl, George E.
    Yu, Dong
    Deng, Li
    Acero, Alex
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 30 - 42
  • [14] MAXIMUM LIKELIHOOD FROM INCOMPLETE DATA VIA EM ALGORITHM
    DEMPSTER, AP
    LAIRD, NM
    RUBIN, DB
    [J]. JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-METHODOLOGICAL, 1977, 39 (01): : 1 - 38
  • [15] Grézl F, 2007, INT CONF ACOUST SPEE, P757
  • [16] Hermansky H, 2000, INT CONF ACOUST SPEE, P1635, DOI 10.1109/ICASSP.2000.862024
  • [17] Deep Neural Networks for Acoustic Modeling in Speech Recognition
    Hinton, Geoffrey
    Deng, Li
    Yu, Dong
    Dahl, George E.
    Mohamed, Abdel-rahman
    Jaitly, Navdeep
    Senior, Andrew
    Vanhoucke, Vincent
    Patrick Nguyen
    Sainath, Tara N.
    Kingsbury, Brian
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2012, 29 (06) : 82 - 97
  • [18] Li Y, 2013, INTERSPEECH, P2967
  • [19] Li Y, 2012, 2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, P136, DOI 10.1109/ISCSLP.2012.6423526
  • [20] Articulatory Control of HMM-Based Parametric Speech Synthesis Using Feature-Space-Switched Multiple Regression
    Ling, Zhen-Hua
    Richmond, Korin
    Yamagishi, Junichi
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2013, 21 (01): : 205 - 217