Feature joint-state posterior estimation in factorial speech processing models using deep neural networks

被引:2
作者
Khademian, Mandi [1 ]
Homayounpour, Mohammad Mehdi [1 ]
机构
[1] Amirkabir Univ Technol, LIMP, Tehran, Iran
关键词
Factorial speech processing models; Deep neural networks factorial hidden Markov models; State-conditional observation distribution; Model combination using vector Taylor series; Feature joint-state posterior;
D O I
10.1016/j.compeleceng.2017.06.028
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper proposes a new method for calculating joint-state posteriors of mixed-audio features using deep neural networks to be used in factorial speech processing models. The joint-state posterior information is required in factorial models to perform joint-decoding. The novelty of this work is its architecture which enables the network to infer joint-state posteriors from the pairs of state posteriors of stereo features. This paper defines an objective function to solve an underdetermined system of equations, which is used by the network for extracting joint-state posteriors. It develops the required expressions for fine-tuning the network in a unified way. The experiments compare the proposed network decoding results to those of the vector Taylor series method and show 2.3% absolute performance improvement in the monaural speech separation and recognition challenge. This achievement is substantial when we consider the simplicity of joint-state posterior extraction provided by deep neural networks. (C) 2017 Elsevier Ltd. All rights reserved.
引用
收藏
页码:574 / 587
页数:14
相关论文
共 17 条
[1]  
[Anonymous], HTK BOOK
[2]  
[Anonymous], TRAINING RESTRICTED
[3]  
[Anonymous], THESIS
[4]  
[Anonymous], ARXIV14083264CS2014
[5]  
[Anonymous], ARXIV161001367CS2016
[6]   Monaural speech separation and recognition challenge [J].
Cooke, Martin ;
Hershey, John R. ;
Rennie, Steven J. .
COMPUTER SPEECH AND LANGUAGE, 2010, 24 (01) :1-15
[7]  
Dahl GE, 2011, INT CONF ACOUST SPEE, P4688
[8]   Factorial hidden Markov models [J].
Ghahramani, Z ;
Jordan, MI .
MACHINE LEARNING, 1997, 29 (2-3) :245-273
[9]  
Hershey J.R., 2012, Techniques for Noise Robustness in Automatic Speech Recognition, P311
[10]   Super-human multi-talker speech recognition: A graphical modeling approach [J].
Hershey, John R. ;
Rennie, Steven J. ;
Olsen, Peder A. ;
Kristjansson, Trausti T. .
COMPUTER SPEECH AND LANGUAGE, 2010, 24 (01) :45-66