Development of the N-gram Model for Azerbaijani Language

被引:0
作者
Bannayeva, Aliya [1 ]
Aslanov, Mustafa [1 ]
机构
[1] ADA Univ, Sch Informat & Technol, Baku, Azerbaijan
来源
2020 IEEE 14TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2020) | 2020年
关键词
N-grams; Markov Model; word prediction; Azerbaijani language;
D O I
10.1109/AICT50176.2020.9368645
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
This research focuses on a text prediction model for the Azerbaijani language. Parsed and cleaned Azerbaijani Wikipedia is used as corpus for the language model. In total, there are more than a million distinct words and sentences, and over seven hundred million characters. For the language model itself, a statistical model with n-grams is implemented. N-grams are contiguous sequences of n strings or characters from a given sample of text or speech. The Markov Chain is used as the model to predict the next word. The Markov Chain focuses on the probabilities of the sequence of words in the n-grams, rather than the probabilities of the entire corpus. This simplifies the task at hand and yields in less computational overhead, while still maintaining sensible results. Logically, the higher the N in the n-grams, the more sensible the resulting prediction. Concretely, bigrams, trigrams, quadgrams and fivegrams are implemented. For the evaluation of the model, intrinsic type of evaluation is used, which computes the perplexity rate.
引用
收藏
页数:5
相关论文
共 4 条
[1]  
Jurafsky J. H. M. Daniel, SPEECH LANGUAGE PROC
[2]  
Kapadia S, 2019, LANGUAGE MODELS N GR
[3]  
Kulkarni V., 2019, Cross-Entropy for Dummies: A simple and intuitive explanation of information, entropy, and cross-entropy for data scientists
[4]  
Martin JamesH., Speech and language processing