Bayesian Recurrent Neural Network for Language Modeling

被引:95
作者
Chien, Jen-Tzung [1 ]
Ku, Yuan-Chu [1 ]
机构
[1] Natl Chiao Tung Univ, Dept Elect & Comp Engn, Hsinchu 30010, Taiwan
关键词
Bayesian learning; Hessian matrix; language model; rapid approximation; recurrent neural network; FRAMEWORK;
D O I
10.1109/TNNLS.2015.2499302
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A language model (LM) is calculated as the probability of a word sequence that provides the solution to word prediction for a variety of information systems. A recurrent neural network (RNN) is powerful to learn the large-span dynamics of a word sequence in the continuous space. However, the training of the RNN-LM is an ill-posed problem because of too many parameters from a large dictionary size and a high-dimensional hidden layer. This paper presents a Bayesian approach to regularize the RNN-LM and apply it for continuous speech recognition. We aim to penalize the too complicated RNN-LM by compensating for the uncertainty of the estimated model parameters, which is represented by a Gaussian prior. The objective function in a Bayesian classification network is formed as the regularized cross-entropy error function. The regularized model is constructed not only by calculating the regularized parameters according to the maximum a posteriori criterion but also by estimating the Gaussian hyperparameter by maximizing the marginal likelihood. A rapid approximation to a Hessian matrix is developed to implement the Bayesian RNN-LM (BRNN-LM) by selecting a small set of salient outer-products. The proposed BRNN-LM achieves a sparser model than the RNN-LM. Experiments on different corpora show the robustness of system performance by applying the rapid BRNN-LM under different conditions.
引用
收藏
页码:361 / 374
页数:14
相关论文
共 41 条
[1]  
[Anonymous], 2014, INTERSPEECH 2014 15
[2]  
Arisoy E, 2012, P NAACL HLT 2012 WOR, P20
[3]   LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT [J].
BENGIO, Y ;
SIMARD, P ;
FRASCONI, P .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02) :157-166
[4]  
Bengio Y, 2001, ADV NEUR IN, V13, P932
[5]  
Bishop C., 2006, Pattern recognition and machine learning, P423
[6]   Latent Dirichlet allocation [J].
Blei, DM ;
Ng, AY ;
Jordan, MI .
JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022
[7]  
Brown P. F., 1992, Computational Linguistics, V18, P467
[8]  
Chen S. F., 2009, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, P468
[9]   An empirical study of smoothing techniques for language modeling [J].
Chen, SF ;
Goodman, J .
COMPUTER SPEECH AND LANGUAGE, 1999, 13 (04) :359-394
[10]   Association pattern language modelling [J].
Chien, Jen-Tzung .
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (05) :1719-1728