Developing named entity recognition algorithms for Uzbek: Dataset insights and implementation

被引:6
作者
Mengliev, Davlatyor [1 ,2 ]
Barakhnin, Vladimir [1 ,2 ,3 ]
Abdurakhmonova, Nilufar [4 ]
Eshkulov, Mukhriddin [5 ]
机构
[1] Tashkent Univ Informat Technol, Urgench Branch, 110 Al Khorezmi Str, Urgench 220100, Uzbekistan
[2] Novosibirsk State Univ, 2 Pirogova Str, Novosibirsk 630090, Russia
[3] Fed Res Ctr Informat & Computat Technol, 6 Acad MA Lavrentiev Ave, Novosibirsk 630090, Russia
[4] Natl Univ Uzbekistan, 4 Univ St, Tashkent 100174, Uzbekistan
[5] Jizzakh Polytech Inst, 4 Islom Karimov Str, Jizzakh 130100, Uzbekistan
关键词
Named entity; Low -resource languages; Uzbek language; Language corpus; Linguistic research;
D O I
10.1016/j.dib.2024.110413
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This paper presents a dataset and approaches to named entity recognition (NLP) in Uzbek language, in a resourceconstrained language environment. Despite the increase in NLP applications, the Uzbek language is still underrepresented, which underscores the importance of our work. Our dataset includes 1,160 sentences with nearly 19,0 0 0 word forms annotated for parts of speech and named entities, making it a valuable resource for linguistic research and machine learning applications in Uzbek. In addition, for practical application and experiments, the authors have developed two algorithms that, using this dictionary, identifies named entities in Uzbek language texts. In addition, the authors described the methodology for creating the dataset, the design of the algorithms, and their application to the Uzbek language. This study not only provides an important dataset for future named entity recognition(NER) tasks in the Uzbek language, but also offers a methodological basis for the use of vocabulary -based NER or Machine learning NER in other low resource languages (e.g. Karakalpak). The dataset (and algorithms) we have developed can be used to create applications such as improved chatbot systems, text mining applications and other analytical tools for the Uzbek language, contributing to the development of those areas in the region for which these solutions will be developed. (c) 2024 The Author(s). Published by Elsevier Inc. This is an open access article under the CC BY -NC license ( http://creativecommons.org/licenses/by-nc/4.0/ )
引用
收藏
页数:8
相关论文
共 13 条
[1]  
Abdurakhmonova Nilufar Z., 2022, 2022 IEEE International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON), P1790, DOI 10.1109/SIBIRCON56155.2022.10017049
[2]  
Elizarov A., 2018, P COMP MOD LANG SPEE
[3]  
Elov B., 2023, Sci. Innov. Int. Sci. J., V2
[4]   Construction and Evaluation of Sentiment Datasets for Low-Resource Languages: The Case of Uzbek [J].
Kuriyozov, Elmurod ;
Matlatipov, Sanatbek ;
Alonso, Miguel A. ;
Gomez-Rodriguez, Carlos .
HUMAN LANGUAGE TECHNOLOGY: CHALLENGES FOR COMPUTER SCIENCE AND LINGUISTICS, LTC 2019, 2022, 13212 :232-243
[5]   Dataset of Karakalpak language stop words [J].
Madatov, Khabibulla ;
Bekchanov, Shukurla ;
Vicic, Jernej .
DATA IN BRIEF, 2023, 48
[6]   Dataset of stopwords extracted from Uzbek texts [J].
Madatov, Khabibulla ;
Bekchanov, Shukurla ;
Vicic, Jernej .
DATA IN BRIEF, 2022, 43
[7]  
Mansurova M., 2021, 2021 IEEE INT C SMAR
[8]  
Mengliev Davlatyor B., 2023, 2023 IEEE XVI International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering (APEIE), P1440, DOI 10.1109/APEIE59731.2023.10347617
[9]  
Mengliev Davlatyor B., 2023, 2023 IEEE XVI International Scientific and Technical Conference Actual Problems of Electronic Instrument Engineering (APEIE), P1720, DOI 10.1109/APEIE59731.2023.10347765
[10]  
Natural Language Toolkit, About us