A neural probabilistic language model

被引：2239

作者：

Bengio, Y ^{[1
]}

Ducharme, R ^{[1
]}

Vincent, P ^{[1
]}

Jauvin, C ^{[1
]}

机构：

[1] Univ Montreal, Ctr Rech Math, Dept Informat & Rech Operat, Montreal, PQ H3C 3J7, Canada

来源：

JOURNAL OF MACHINE LEARNING RESEARCH | 2003年 / 3卷 / 06期

关键词：

statistical language modeling; artificial neural networks; distributed representation; curse of dimensionality;

D O I：

10.1162/153244303322533223

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

引用

页码：1137 / 1155

页数：19

共 34 条

[1]

[Anonymous], 2000004 GCNU TR U CO

[2]

[Anonymous], 1215 U MONTR DEP IRO

[3]

[Anonymous], 2003, AISTATS

[4]

[Anonymous], PATTERN RECOGNITION

[5]

BAKER D, 1998, SIGIR 98

[6]

Bellegarda J.-R., 1997, P 5 EUR C SPEECH COM, P1451

[7] Taking on the curse of dimensionality in joint distributions using neural networks [J].

Bengio, S ;

Bengio, Y .

IEEE TRANSACTIONS ON NEURAL NETWORKS, 2000, 11 (03) :550-557

[8]

Bengio Y, 2000, ADV NEUR IN, V12, P400

[9]

Berger AL, 1996, COMPUT LINGUIST, V22, P39

[10]

BROWN A, 2000, 2000004 GCNU TR U CO

← 1 2 3 4 →