Statistical language models of Lithuanian based on word clustering and morphological decomposition

被引:0
作者
Vaiciunas, A
Kaminskas, V
Raskinis, G
机构
[1] Vytautas Magnus Univ, Dept Appl Informat, LT-3035 Kaunas, Lithuania
[2] Vytautas Magnus Univ, Ctr Computat Linguist, LT-3000 Kaunas, Lithuania
关键词
language models; n-grams; class-based models; morphology; inflections; interpolation; perplexity reduction; out-of-vocabulary words;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3-gram and 4-gram class-based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class-based models linearly interpolated with the 3-gram model led up to a 13% reduction in the perplexity compared with the baseline 3-gram model. Morphological models decreased out-of-vocabulary word rate from 1.5% to 1.02%.
引用
收藏
页码:565 / 580
页数:16
相关论文
共 21 条
[1]  
[Anonymous], STAT METHODS SPEECH
[2]  
[Anonymous], P 5 EUR C SPEECH COM
[3]   An empirical study of smoothing techniques for language modeling [J].
Chen, SF ;
Goodman, J .
COMPUTER SPEECH AND LANGUAGE, 1999, 13 (04) :359-394
[4]  
FILIPOVIC M, 2003, P C INF TECHN 2003, P10
[5]  
Ircing Pavel., 2001, Proceedings of the 7th European Conference on Speech Communication and Technology, V1, P487
[6]  
Jurafsky D., 2000, Speech and Language Processing. An Introduction to Natural language Processing, Computational Linguistics
[7]  
KLAKOW D, 1998, P INT C SPOK LANG PR
[8]  
KOBAYASHI N, 1999, P 6 EUR C SPEECH COM, P1599
[9]  
LAURINCIUKAITE S, 2003, P C INF TECHN 2003, P21
[10]  
Lipeika A, 2002, INFORMATICA-LITHUAN, V13, P37