Dataset of stopwords extracted from Uzbek texts

被引:8
作者
Madatov, Khabibulla [1 ]
Bekchanov, Shukurla [1 ]
Vicic, Jernej [2 ,3 ]
机构
[1] Urgench state univ, 14 Kh Alimdjan str, Urgench city 220100, Uzbekistan
[2] Fran Ramovs Inst, Slovenian Acad Sci & Arts, Res Ctr, Novi trg 2, Ljubljana 1000, Slovenia
[3] Univ Primorska, FAMNIT, Glagoljaska 8, Koper 6000, Slovenia
关键词
Stop words; Machine Learning; Unigram; Bigram; Collocation;
D O I
10.1016/j.dib.2022.108351
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Filtering stop words is an important task when processing text queries to search for information in large data sets. It enables a reduction of the search space without losing the semantic meaning. The stop words, which have only grammatical roles and not contributing to information content still add up to the complexity of the query. Existing mathematical models that are used to tackle this problem are not suitable for all families of natural languages [1]. For example, they do not cover families of languages to which Uzbek can be included. In the present work, the collocation method of this problem is o ered for families of languages that include the Uzbek language as well. This method concerns the so-called agglutinative languages, in which the task of recognizing stop words is much more difficult, since the stop words are "masked" in the text. In this work the unigram, the bigram and the collocation methods are applied to the "School corpus" that corresponds to the type of languages being studied. (C) 2022 Published by Elsevier Inc.
引用
收藏
页数:7
相关论文
共 13 条
[1]   ON CENTRAL LIMIT THEOREM FOR PRIME DIVISOR FUNCTION [J].
BILLINGSLEY, P .
AMERICAN MATHEMATICAL MONTHLY, 1969, 76 (02) :132-+
[2]  
Metin SK, 2017, ANADOLU UNIV BILIM T, V18, P1, DOI [10.18038/aubtda.322136, 10.18038/aubtda.322136, DOI 10.18038/AUBTDA.322136]
[3]  
Kumova MetinS., 2017, ANADOLU U J SCI TECH, V18, P1, DOI DOI 10.18038/AUBTDA.322136
[4]  
Madatov S.B. Xabibulla, COMPUTER LINGUISTICS, V1, P1
[5]  
Matlatipov S., 2016, OZMU XABARLARI, V2
[6]  
Ousirimaneechai N., 2018, INT J MACH LEARN COM, V8
[7]  
Pradana AW, 2019, KINETIK GAME TECHNOL
[8]   Multi-Class Text Classification of Uzbek News Articles using Machine Learning [J].
Rabbimov, I. M. ;
Kobilov, S. S. .
IV INTERNATIONAL SCIENTIFIC AND TECHNICAL CONFERENCE MECHANICAL SCIENCE AND TECHNOLOGY UPDATE (MSTU-2020), 2020, 1546
[9]   Uzbek News Categorization using Word Embeddings and Convolutional Neural Networks [J].
Rabbimov, Ilyos ;
Kobilov, Sami ;
Mporas, Iosif .
2020 IEEE 14TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2020), 2020,
[10]   Information Retrieval for Gujarati Language Using Cosine Similarity Based Vector Space Model [J].
Rakholia, Rajnish M. ;
Saini, Jatinderkumar R. .
PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON FRONTIERS IN INTELLIGENT COMPUTING: THEORY AND APPLICATIONS, (FICTA 2016), VOL 2, 2017, 516 :1-9