Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

被引：5

作者：

Yeshambel, Tilahun ^{[1
]}

Mothe, Josiane ^{[2
]}

Assabie, Yaregal ^{[3
]}

机构：

[1] Addis Ababa Univ, IT Doctorial Program, POB 1176, Addis Ababa, Ethiopia

[2] Univ Toulouse Jean Jaures, Componsante INSPE, IRIT, UMR5505 CNRS, 118 Rte Narbonne, F-31400 Toulouse, France

[3] Addis Ababa Univ, Dept Comp Sci, POB 1176, Addis Ababa, Ethiopia

来源：

INFORMATION | 2023年 / 14卷 / 03期

关键词：

word embeddings; BERT; pre-trained Amharic BERT model; query expansion; learning text representation; text classification; fine-tuning;

D O I：

10.3390/info14030195

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.

引用

页数：23

共 50 条

[1] Multilevel Natural Language Processing for Intelligent Information Retrieval and Text Mining
I. V. Smirnov
Scientific and Technical Information Processing, 2024, 51 (6) : 629 - 635
[2] Natural language processing and information retrieval
Voorhees, EM
INFORMATION EXTRACTION: TOWARDS SCALABLE, ADAPTABLE SYSTEMS, 1999, 1714 : 32 - 48
[3] Natural language processing for information retrieval
Lewis, DD
Sparck-Jones, K
COMMUNICATIONS OF THE ACM, 1996, 39 (01) : 92 - 101
[4] Enhanced text retrieval using natural language processing
Bull Am Soc Inf Sci, 4 (14):
[5] An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems
Jabbar, Abdul
Iqbal, Sajid
Tamimy, Manzoor Ilahi
Rehman, Amjad
Bahaj, Saeed Ali
Saba, Tanzila
IEEE ACCESS, 2023, 11 : 133681 - 133702
[6] Application of Natural Language Processing for Information Retrieval
Xi, Su Mei
Lee, Dae Jong
Cho, Young Im
PROCEEDINGS OF THE EIGHTEENTH INTERNATIONAL SYMPOSIUM ON ARTIFICIAL LIFE AND ROBOTICS (AROB 18TH '13), 2013, : 621 - 624
[7] Application of Natural Language Processing in Information Retrieval
Rojas, Yenory
Ferrandez, Antonio
Peral, Jesus
PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (34):
[8] Natural Language Processing for Spreadsheet Information Retrieval
Flood, Derek
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2010, 5723 : 309 - 312
[9] Thematic session: Natural language technology in mobile information retrieval and text processing user interfaces
Kuehn, M
Leong, MK
Tanaka-Ishii, K
NATURAL LANGUAGE PROCESSING - IJCNLP 2004, 2005, 3248 : 743 - 744
[10] The Use of Text Retrieval and Natural Language Processing in Software Engineering
Haiduc, Sonia
Arnaoudova, Venera
Marcus, Andrian
Antoniol, Giuliano
2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING COMPANION (ICSE-C), 2016, : 898 - 899

← 1 2 3 4 5 →