A neural probabilistic language model

被引:2239
作者
Bengio, Y [1 ]
Ducharme, R [1 ]
Vincent, P [1 ]
Jauvin, C [1 ]
机构
[1] Univ Montreal, Ctr Rech Math, Dept Informat & Rech Operat, Montreal, PQ H3C 3J7, Canada
关键词
statistical language modeling; artificial neural networks; distributed representation; curse of dimensionality;
D O I
10.1162/153244303322533223
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.
引用
收藏
页码:1137 / 1155
页数:19
相关论文
共 34 条
[1]  
[Anonymous], 2000004 GCNU TR U CO
[2]  
[Anonymous], 1215 U MONTR DEP IRO
[3]  
[Anonymous], 2003, AISTATS
[4]  
[Anonymous], PATTERN RECOGNITION
[5]  
BAKER D, 1998, SIGIR 98
[6]  
Bellegarda J.-R., 1997, P 5 EUR C SPEECH COM, P1451
[7]   Taking on the curse of dimensionality in joint distributions using neural networks [J].
Bengio, S ;
Bengio, Y .
IEEE TRANSACTIONS ON NEURAL NETWORKS, 2000, 11 (03) :550-557
[8]  
Bengio Y, 2000, ADV NEUR IN, V12, P400
[9]  
Berger AL, 1996, COMPUT LINGUIST, V22, P39
[10]  
BROWN A, 2000, 2000004 GCNU TR U CO