Learned Text Representation for Amharic Information Retrieval and Natural Language Processing

被引:5
|
作者
Yeshambel, Tilahun [1 ]
Mothe, Josiane [2 ]
Assabie, Yaregal [3 ]
机构
[1] Addis Ababa Univ, IT Doctorial Program, POB 1176, Addis Ababa, Ethiopia
[2] Univ Toulouse Jean Jaures, Componsante INSPE, IRIT, UMR5505 CNRS, 118 Rte Narbonne, F-31400 Toulouse, France
[3] Addis Ababa Univ, Dept Comp Sci, POB 1176, Addis Ababa, Ethiopia
关键词
word embeddings; BERT; pre-trained Amharic BERT model; query expansion; learning text representation; text classification; fine-tuning;
D O I
10.3390/info14030195
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.
引用
收藏
页数:23
相关论文
共 50 条
  • [1] Multilevel Natural Language Processing for Intelligent Information Retrieval and Text Mining
    I. V. Smirnov
    Scientific and Technical Information Processing, 2024, 51 (6) : 629 - 635
  • [2] Natural language processing and information retrieval
    Voorhees, EM
    INFORMATION EXTRACTION: TOWARDS SCALABLE, ADAPTABLE SYSTEMS, 1999, 1714 : 32 - 48
  • [3] Natural language processing for information retrieval
    Lewis, DD
    Sparck-Jones, K
    COMMUNICATIONS OF THE ACM, 1996, 39 (01) : 92 - 101
  • [5] An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems
    Jabbar, Abdul
    Iqbal, Sajid
    Tamimy, Manzoor Ilahi
    Rehman, Amjad
    Bahaj, Saeed Ali
    Saba, Tanzila
    IEEE ACCESS, 2023, 11 : 133681 - 133702
  • [6] Application of Natural Language Processing for Information Retrieval
    Xi, Su Mei
    Lee, Dae Jong
    Cho, Young Im
    PROCEEDINGS OF THE EIGHTEENTH INTERNATIONAL SYMPOSIUM ON ARTIFICIAL LIFE AND ROBOTICS (AROB 18TH '13), 2013, : 621 - 624
  • [7] Application of Natural Language Processing in Information Retrieval
    Rojas, Yenory
    Ferrandez, Antonio
    Peral, Jesus
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (34):
  • [8] Natural Language Processing for Spreadsheet Information Retrieval
    Flood, Derek
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2010, 5723 : 309 - 312
  • [9] Thematic session: Natural language technology in mobile information retrieval and text processing user interfaces
    Kuehn, M
    Leong, MK
    Tanaka-Ishii, K
    NATURAL LANGUAGE PROCESSING - IJCNLP 2004, 2005, 3248 : 743 - 744
  • [10] The Use of Text Retrieval and Natural Language Processing in Software Engineering
    Haiduc, Sonia
    Arnaoudova, Venera
    Marcus, Andrian
    Antoniol, Giuliano
    2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING COMPANION (ICSE-C), 2016, : 898 - 899