Statistical language models of Lithuanian based on word clustering and morphological decomposition

被引：0

作者：

Vaiciunas, A

Kaminskas, V

Raskinis, G

机构：

[1] Vytautas Magnus Univ, Dept Appl Informat, LT-3035 Kaunas, Lithuania

[2] Vytautas Magnus Univ, Ctr Computat Linguist, LT-3000 Kaunas, Lithuania

来源：

INFORMATICA | 2004年 / 15卷 / 04期

关键词：

language models; n-grams; class-based models; morphology; inflections; interpolation; perplexity reduction; out-of-vocabulary words;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This paper describes our research on statistical language modeling of Lithuanian. The idea of improving sparse n-gram models of highly inflected Lithuanian language by interpolating them with complex n-gram models based on word clustering and morphological word decomposition was investigated. Words, word base forms and part-of-speech tags were clustered into 50 to 5000 automatically generated classes. Multiple 3-gram and 4-gram class-based language models were built and evaluated on Lithuanian text corpus, which contained 85 million words. Class-based models linearly interpolated with the 3-gram model led up to a 13% reduction in the perplexity compared with the baseline 3-gram model. Morphological models decreased out-of-vocabulary word rate from 1.5% to 1.02%.

引用

页码：565 / 580

页数：16

共 21 条

[1]

[Anonymous], STAT METHODS SPEECH

[2]

[Anonymous], P 5 EUR C SPEECH COM

[3] An empirical study of smoothing techniques for language modeling [J].

Chen, SF ;

Goodman, J .

COMPUTER SPEECH AND LANGUAGE, 1999, 13 (04) :359-394

[4]

FILIPOVIC M, 2003, P C INF TECHN 2003, P10

[5]

Ircing Pavel., 2001, Proceedings of the 7th European Conference on Speech Communication and Technology, V1, P487

[6]

Jurafsky D., 2000, Speech and Language Processing. An Introduction to Natural language Processing, Computational Linguistics

[7]

KLAKOW D, 1998, P INT C SPOK LANG PR

[8]

KOBAYASHI N, 1999, P 6 EUR C SPEECH COM, P1599

[9]

LAURINCIUKAITE S, 2003, P C INF TECHN 2003, P21

[10]

Lipeika A, 2002, INFORMATICA-LITHUAN, V13, P37

← 1 2 3 →