A study of n-gram and decision tree letter language modeling methods

被引:25
作者
Potamianos, G
Jelinek, F
机构
[1] AT&T Bell Labs, Res, Speech & Image Proc Serv Res Lab, Florham Pk, NJ 07932 USA
[2] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA
关键词
language modeling; n-grams; decision trees; smoothing; laws of succession; back-off language model; deleted interpolation; Brown corpus;
D O I
10.1016/S0167-6393(98)00018-1
中图分类号
O42 [声学];
学科分类号
070206 ; 082403 ;
摘要
The goal of this paper is to investigate various language model smoothing techniques and decision tree based language model design algorithms. For this purpose, we build language models for printable characters (letters), based on the Brown corpus. We consider two classes of models for the text generation process: the n-gram language model and various decision tree based language models. In the first part of the paper, we compare the most popular smoothing algorithms applied to the former. We conclude that the bottom-up deleted interpolation algorithm performs the best in the task of n-gram letter language model smoothing, significantly outperforming the back-off smoothing technique for large values of n. In the second part of the paper, we consider various decision tree development algorithms. Among them, a K-means clustering type algorithm for the design of the decision tree questions gives the best results. However, the n-gram language model outperforms the decision tree language models for letter language modeling. We believe that this is due to the predictive nature of letter strings, which seems to be naturally modeled by n-grams, (C) 1998 Elsevier Science B.V. All rights reserved.
引用
收藏
页码:171 / 192
页数:22
相关论文
共 24 条
[1]  
[Anonymous], CSTR49595 PRINC U
[2]   A TREE-BASED STATISTICAL LANGUAGE MODEL FOR NATURAL-LANGUAGE SPEECH RECOGNITION [J].
BAHL, LR ;
BROWN, PF ;
DESOUZA, PV ;
MERCER, RL .
IEEE TRANSACTIONS ON ACOUSTICS SPEECH AND SIGNAL PROCESSING, 1989, 37 (07) :1001-1008
[3]   A MAXIMUM-LIKELIHOOD APPROACH TO CONTINUOUS SPEECH RECOGNITION [J].
BAHL, LR ;
JELINEK, F ;
MERCER, RL .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1983, 5 (02) :179-190
[4]  
BAHL LR, 1991, P EUR C SPEECH COMM, P1209
[5]  
Bell T. C., 1990, TEXT COMPRESSION
[6]  
BETZ M, 1995, P ICASSP, P856
[7]  
Brown P. F., 1992, Computational Linguistics, V18, P467
[8]  
BROWN PF, 1992, AM J COMPUTATIONAL L, V18, P31
[9]  
Chen S. F., 1996, Proceedings of the 1996 Association for Computational Linguistics ACL, P310, DOI [DOI 10.3115/981863.981904, 10.3115/981863.981904]
[10]   OPTIMAL PARTITIONING FOR CLASSIFICATION AND REGRESSION TREES [J].
CHOU, PA .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1991, 13 (04) :340-354