Croatian Language N-Gram System

被引:0
作者
Dembitz, Sandor [1 ]
Blaskovic, Bruno [1 ]
Gledec, Gordan [1 ]
机构
[1] Univ Zagreb, Fac Elect Engn & Comp, HR-10000 Zagreb, Croatia
来源
ADVANCES IN KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS | 2012年 / 243卷
关键词
Croatian; lexical n-gram; language modeling; Heaps' law;
D O I
10.3233/978-1-61499-105-2-696
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale n-gram models are available for a small number of languages. So far, Croatian was not one of them. The research presented in this paper describes the development of n-gram database system suitable for large-scale language modeling in Croatian. The process of n-gram collection relies on Croatian academic online spellchecker Hascheck, which has been publicly available since 1993, and is today a popular language service, with average daily traffic exceeding million tokens. The approach demonstrated in this paper eliminated the need of n-gram data cleaning in the post-processing phase, which is a serious issue in other languages. The spellchecker dynamics allowed Heaps' law modeling to be applied to Croatian n-grams, which enabled the prediction of n-gram count growth.
引用
收藏
页码:696 / 705
页数:10
相关论文
共 14 条
[1]  
[Anonymous], 1949, Human behaviour and the principle of least-effort
[2]  
Brants T., 2009, Web 1T 5-gram, 10 European Languages-Version 1
[3]   Advantages of online spellchecking: a Croatian example [J].
Dembitz, Sandor ;
Randic, Mirko ;
Gledec, Gordan .
SOFTWARE-PRACTICE & EXPERIENCE, 2011, 41 (11) :1203-1231
[4]  
Ha L.Q., 2002, Proceedings of the 19th International Conference on Computational Linguistics (COLING), P315
[5]   The Unreasonable Effectiveness of Data [J].
Halevy, Alon ;
Norvig, Peter ;
Pereira, Fernando .
IEEE INTELLIGENT SYSTEMS, 2009, 24 (02) :8-12
[6]  
Islam Aminul., 2009, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, P1241
[7]  
Juric D., KESAMSTA12
[8]  
Kornai A., 1999, Proceedings of the Sixth Meeting on Mathematics of Language (MOL), P347
[9]  
Kornai A., 2002, Glottometrics, V4, P61
[10]  
Ljubesic N, 2011, LECT NOTES ARTIF INT, V6836, P395, DOI 10.1007/978-3-642-23538-2_50