Text Compression for Myanmar Information Retrieval

被引:0
作者
Lin, Nay [1 ]
Vitaly, Kudinov A. [2 ]
Soe, Yan Naing [3 ]
机构
[1] Kursk State Univ, Dept Software & Adm Informat Syst, Kursk, Russia
[2] Kursk State Agr Acad, Kursk, Russia
[3] Southwest State Univ, Dept Mech Mechatron & Robot, Kursk, Russia
来源
NLPIR 2019: 2019 3RD INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL | 2019年
关键词
vocabulary file; indexing; Myanmar Natural Language Processing; Myanmar information retrieval; Text Compression; ETDC; Boyer Moore pattern matching;
D O I
10.1145/3342827.3342830
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Myanmar word segmentation is an important task for construction of dictionary file for Myanmar information retrieval and Myanmar text compression. Although Myanmar word segmentation using dictionary and orthography has been existed for Myanmar language, the performance of word segmentation depends on the coverage of the dictionary and training dataset and can cause out of vocabulary (OOV) problem, leading to lower precision and recall in information retrieval. And to compress Myanmar text, words in text needs to be recognized first. In this paper, we propose a new method for Myanmar word segmentation by local statistical dataset without the use of any additional data (e.g., training corpus) and new compressed Myanmar Information Retrieval (MIR) model which used End Tagged Dense Code (ETDC) text compressed method. The experimental results showed that the method can improve evaluation of vocabulary file with precision 75%, recall 87%, F-measure 80% and average compression ratio is 32% of texts for Myanmar language.
引用
收藏
页码:62 / 67
页数:6
相关论文
共 11 条
[1]  
BRISABOA N, 2005, P 28 ANN INT ACM SIG, P234
[2]   Lightweight natural language text compression [J].
Brisaboa, Nieves R. ;
Farina, Antonio ;
Navarro, Gonzalo ;
Parama, Jose R. .
INFORMATION RETRIEVAL, 2007, 10 (01) :1-33
[3]   Fast and flexible word searching on compressed text [J].
de Moura, ES ;
Navarro, G ;
Ziviani, N ;
BaezaYates, R .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2000, 18 (02) :113-139
[4]   Making compression algorithms for Unicode text [J].
Gleave, Adam ;
Steinruecken, Christian .
2017 DATA COMPRESSION CONFERENCE (DCC), 2017, :441-441
[5]  
Lin Nay, 2019, 2019 IEEE C RUSS YOU
[6]  
Mantoro T., 2017, P INT C COMP ENG DES, P1
[7]  
Pa W.P., 2015, ADV INTELLIGENT SYST, V388, P447
[8]  
Pann Yu Mon, 2010, 2010 International Conference on Advances in ICT for Emerging Regions (ICTer), P69, DOI 10.1109/ICTER.2010.5643269
[9]  
Sun WF, 2003, IEEE DATA COMPR CONF, P448
[10]   Word segmentation for the Myanmar language [J].
Thet, Tun Thura ;
Na, Jin-Cheon ;
Ko, Wunna Ko .
JOURNAL OF INFORMATION SCIENCE, 2008, 34 (05) :688-704