Croatian Language N-Gram System

被引：0

作者：

Dembitz, Sandor ^{[1
]}

Blaskovic, Bruno ^{[1
]}

Gledec, Gordan ^{[1
]}

机构：

[1] Univ Zagreb, Fac Elect Engn & Comp, HR-10000 Zagreb, Croatia

来源：

ADVANCES IN KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS | 2012年 / 243卷

关键词：

Croatian; lexical n-gram; language modeling; Heaps' law;

D O I：

10.3233/978-1-61499-105-2-696

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Large-scale n-gram models are available for a small number of languages. So far, Croatian was not one of them. The research presented in this paper describes the development of n-gram database system suitable for large-scale language modeling in Croatian. The process of n-gram collection relies on Croatian academic online spellchecker Hascheck, which has been publicly available since 1993, and is today a popular language service, with average daily traffic exceeding million tokens. The approach demonstrated in this paper eliminated the need of n-gram data cleaning in the post-processing phase, which is a serious issue in other languages. The spellchecker dynamics allowed Heaps' law modeling to be applied to Croatian n-grams, which enabled the prediction of n-gram count growth.

引用

页码：696 / 705

页数：10