Smoothing methods in maximum entropy language modeling

被引:6
作者
Martin, SC [1 ]
Ney, H [1 ]
Zaplo, J [1 ]
机构
[1] Rhein Westfal TH Aachen, Rhein Westfal TH Aachen, Lehrstuhl Informat 6, D-52056 Aachen, Germany
来源
ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI | 1999年
关键词
D O I
10.1109/ICASSP.1999.758183
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
This paper discusses various aspects of smoothing techniques in maximum entropy language modeling, a topic not sufficiently covered by previous publications. We show (1) that straightforward maximum entropy models with nested features, e.g. tri-, bi-, and unigrams, result in unsmoothed relative frequencies models; (2) that maximum entropy models with nested features and discounted feature counts approximate backing-off smoothed relative frequencies models with Kneser's advanced marginal backoff distribution; this explains some of the reported success of maximum entropy models in the past; (3) perplexity results for nested and non-nested features, e.g. trigrams and distance-trigrams, on a 4-million word subset of the Wall Street Journal Corpus, showing that the smoothing method has more effect on the perplexity than the method to combine information.
引用
收藏
页码:545 / 548
页数:4
相关论文
共 11 条
[1]  
[Anonymous], ICASSP 1995
[2]  
Berger AL, 1996, COMPUT LINGUIST, V22, P39
[3]  
Bishop M.M., 1975, DISCRETE MULTIVARIAT
[4]   GENERALIZED ITERATIVE SCALING FOR LOG-LINEAR MODELS [J].
DARROCH, JN ;
RATCLIFF, D .
ANNALS OF MATHEMATICAL STATISTICS, 1972, 43 (05) :1470-&
[5]  
DELLAPIETRA S, 1995, CMUCS95144
[6]  
DELLAPIETRA S, 1992, IEEE INT C AC SPEECH, V1, P633
[7]   THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS [J].
GOOD, IJ .
BIOMETRIKA, 1953, 40 (3-4) :237-264
[8]  
Ney H, 1997, TEXT SPEECH LANG TEC, V2, P174
[9]   A maximum entropy approach to adaptive statistical language modelling [J].
Rosenfeld, R .
COMPUTER SPEECH AND LANGUAGE, 1996, 10 (03) :187-228
[10]  
SIMONS M, 1997, IEEE INT C AC SPEECH, V2, P787