Bayesian Recurrent Neural Network for Language Modeling

被引:93
作者
Chien, Jen-Tzung [1 ]
Ku, Yuan-Chu [1 ]
机构
[1] Natl Chiao Tung Univ, Dept Elect & Comp Engn, Hsinchu 30010, Taiwan
关键词
Bayesian learning; Hessian matrix; language model; rapid approximation; recurrent neural network; FRAMEWORK;
D O I
10.1109/TNNLS.2015.2499302
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A language model (LM) is calculated as the probability of a word sequence that provides the solution to word prediction for a variety of information systems. A recurrent neural network (RNN) is powerful to learn the large-span dynamics of a word sequence in the continuous space. However, the training of the RNN-LM is an ill-posed problem because of too many parameters from a large dictionary size and a high-dimensional hidden layer. This paper presents a Bayesian approach to regularize the RNN-LM and apply it for continuous speech recognition. We aim to penalize the too complicated RNN-LM by compensating for the uncertainty of the estimated model parameters, which is represented by a Gaussian prior. The objective function in a Bayesian classification network is formed as the regularized cross-entropy error function. The regularized model is constructed not only by calculating the regularized parameters according to the maximum a posteriori criterion but also by estimating the Gaussian hyperparameter by maximizing the marginal likelihood. A rapid approximation to a Hessian matrix is developed to implement the Bayesian RNN-LM (BRNN-LM) by selecting a small set of salient outer-products. The proposed BRNN-LM achieves a sparser model than the RNN-LM. Experiments on different corpora show the robustness of system performance by applying the rapid BRNN-LM under different conditions.
引用
收藏
页码:361 / 374
页数:14
相关论文
共 41 条
  • [1] [Anonymous], 2014, INTERSPEECH 2014 15
  • [2] Arisoy E, 2012, P NAACL HLT 2012 WOR, P20
  • [3] LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT
    BENGIO, Y
    SIMARD, P
    FRASCONI, P
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02): : 157 - 166
  • [4] Bengio Y, 2001, ADV NEUR IN, V13, P932
  • [5] Bishop C., 2006, Pattern recognition and machine learning, P423
  • [6] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [7] Brown P. F., 1992, Computational Linguistics, V18, P467
  • [8] Chen S. F., 2009, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, P468
  • [9] An empirical study of smoothing techniques for language modeling
    Chen, SF
    Goodman, J
    [J]. COMPUTER SPEECH AND LANGUAGE, 1999, 13 (04) : 359 - 394
  • [10] Association pattern language modelling
    Chien, Jen-Tzung
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (05): : 1719 - 1728