Development of the N-gram Model for Azerbaijani Language

被引：0

作者：

Bannayeva, Aliya ^{[1
]}

Aslanov, Mustafa ^{[1
]}

机构：

[1] ADA Univ, Sch Informat & Technol, Baku, Azerbaijan

来源：

2020 IEEE 14TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2020) | 2020年

关键词：

N-grams; Markov Model; word prediction; Azerbaijani language;

D O I：

10.1109/AICT50176.2020.9368645

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

This research focuses on a text prediction model for the Azerbaijani language. Parsed and cleaned Azerbaijani Wikipedia is used as corpus for the language model. In total, there are more than a million distinct words and sentences, and over seven hundred million characters. For the language model itself, a statistical model with n-grams is implemented. N-grams are contiguous sequences of n strings or characters from a given sample of text or speech. The Markov Chain is used as the model to predict the next word. The Markov Chain focuses on the probabilities of the sequence of words in the n-grams, rather than the probabilities of the entire corpus. This simplifies the task at hand and yields in less computational overhead, while still maintaining sensible results. Logically, the higher the N in the n-grams, the more sensible the resulting prediction. Concretely, bigrams, trigrams, quadgrams and fivegrams are implemented. For the evaluation of the model, intrinsic type of evaluation is used, which computes the perplexity rate.

引用

页数：5