Bayesian Recurrent Neural Network for Language Modeling

被引：95

作者：

Chien, Jen-Tzung ^{[1
]}

Ku, Yuan-Chu ^{[1
]}

机构：

[1] Natl Chiao Tung Univ, Dept Elect & Comp Engn, Hsinchu 30010, Taiwan

来源：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS | 2016年 / 27卷 / 02期

关键词：

Bayesian learning; Hessian matrix; language model; rapid approximation; recurrent neural network; FRAMEWORK;

D O I：

10.1109/TNNLS.2015.2499302

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A language model (LM) is calculated as the probability of a word sequence that provides the solution to word prediction for a variety of information systems. A recurrent neural network (RNN) is powerful to learn the large-span dynamics of a word sequence in the continuous space. However, the training of the RNN-LM is an ill-posed problem because of too many parameters from a large dictionary size and a high-dimensional hidden layer. This paper presents a Bayesian approach to regularize the RNN-LM and apply it for continuous speech recognition. We aim to penalize the too complicated RNN-LM by compensating for the uncertainty of the estimated model parameters, which is represented by a Gaussian prior. The objective function in a Bayesian classification network is formed as the regularized cross-entropy error function. The regularized model is constructed not only by calculating the regularized parameters according to the maximum a posteriori criterion but also by estimating the Gaussian hyperparameter by maximizing the marginal likelihood. A rapid approximation to a Hessian matrix is developed to implement the Bayesian RNN-LM (BRNN-LM) by selecting a small set of salient outer-products. The proposed BRNN-LM achieves a sparser model than the RNN-LM. Experiments on different corpora show the robustness of system performance by applying the rapid BRNN-LM under different conditions.

引用

页码：361 / 374

页数：14

共 41 条

[1]

[Anonymous], 2014, INTERSPEECH 2014 15

[2]

Arisoy E, 2012, P NAACL HLT 2012 WOR, P20

[3] LEARNING LONG-TERM DEPENDENCIES WITH GRADIENT DESCENT IS DIFFICULT [J].

BENGIO, Y ;

SIMARD, P ;

FRASCONI, P .

IEEE TRANSACTIONS ON NEURAL NETWORKS, 1994, 5 (02) :157-166

[4]

Bengio Y, 2001, ADV NEUR IN, V13, P932

[5]

Bishop C., 2006, Pattern recognition and machine learning, P423

[6] Latent Dirichlet allocation [J].

Blei, DM ;

Ng, AY ;

Jordan, MI .

JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) :993-1022

[7]

Brown P. F., 1992, Computational Linguistics, V18, P467

[8]

Chen S. F., 2009, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, P468

[9] An empirical study of smoothing techniques for language modeling [J].

Chen, SF ;

Goodman, J .

COMPUTER SPEECH AND LANGUAGE, 1999, 13 (04) :359-394

[10] Association pattern language modelling [J].

Chien, Jen-Tzung .

IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2006, 14 (05) :1719-1728

← 1 2 3 4 5 →